Approaches to Automatic Biographical Sentence Classification: An

Approaches to Automatic Biographical Sentence Classification:
An Empirical Study
By
Michael Ambrose Conway
S UBMITTED
IN PARTIAL FULFILMENT OF THE
REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
AT
T HE U NIVERSITY OF S HEFFIELD
May 2007
D EPARTMENT
OF
C OMPUTER S CIENCE
Who’s Who
A shilling life will give you all the facts:
How Father beat him, how he ran away,
What were the struggles of his youth, what acts
Made him the greatest figure of his day;
Of how he fought, fished, hunted, worked all night,
Though giddy, climbed new mountains; named a sea;
Some of the last researchers even write
Love made him weep his pints like you and me.
With all his honours on, he sighed for one
Who, say astonished critics, lived at home;
Did little jobs about the house with skill
And nothing else; could whistle; would sit still
Or potter round the garden; answered some
Of his long marvellous letters but kept none.
W.H.A UDEN (1935)
i
Soloman Grundy
Solomon Grundy,
Born on Monday,
Christened on Tuesday,
Married on Wednesday,
Took ill on Thursday,
Worse on Friday,
Died on Saturday,
Buried on Sunday.
That was the end
Of Solomon Grundy.
T RADITIONAL
ii
In general, most biographical material about me on the Internet is
seriously flawed, if not outright wrong, and I know other writers
are experiencing the same problem with their own data - so it must
be something to do with the way Google and Yahoo squeeze information and make it do odd tricks. Sometimes I’ll be introduced onstage at book events by a speaker saying, “Mr Coupland is German
and once did an advertisement for Smirnoff vodka. He collects meteorites and lives in Scotland in a house with no furniture.” I know.
What are you supposed to say when you hear something like this?
D OUGLAS C OUPLAND (www.coupland.com)
iii
Abstract
This thesis addresses the problem of the reliable identification of biographical
sentences, an important subtask in several natural language processing application areas (for example, biographical multiple document summarisation, biographical information extraction, and so on). The biographical sentence classification task is placed within the framework of genre classification, rather than
traditional topic based text classification.
Before exploring methods for doing this task computationally, we need to establish whether, and with what degree of success, humans can identify biographical sentences without the aid of discourse or document structure. To
this end, a biographical annotation scheme and corpus was developed, and assessed using a human study. The human study showed that participants were
able to identify biographical sentences with a good level of agreement.
The main body of the thesis presents a series of experiments designed to find
the best sentence representations for the automatic identification of biographical sentences from a range of alternatives. In contrast to previous work, which
has centred on the use of single terms (that is, unigrams) for biographical sentence representations, the current work derives unigram, bigram and trigram
features from a large corpus of biographical text (including the British Dictionary of National Biography). In addition to the use of corpus derived -grams,
a novel characteristic of the current approach is the use of biographically relevant syntactic features, identified both intuitively and through empirical methods.
The experimental work shows that a combination of -gram features derived
from the Dictionary of National Biography and biographically orientated syntactic features yield a performance that surpasses that gained using -gram features alone. Additionally, in accordance with the view of biographical sentence
classification as a genre classification task, stylistic features (for example, topic
neutral function words) are shown to be important for recognising biographical sentences.
iv
Acknowledgements
First of all, I would like to thank my supervisor Rob Gaizausakas for his invaluable guidance and advice during the preparation and writing of this thesis. I
would also like to thank Rob for his initial suggestion that the British Dictionary of National Biography would be an interesting corpus for examining the
characteristics of biographical texts.
The members of my thesis committee have provided valuable feedback on my
research plan which I would like to thank them for. Finally, I would like to
thank my family and friends for their support.
v
Contents
1 Introduction
1.1 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Biographical Sentence Classification: The Current Situation
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.1 Background Chapters . . . . . . . . . . . . . . . . .
1.4.2 Human Biographical Text Classification . . . . . . .
1.4.3 Automatic Biographical Text Classification . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
3
3
4
4
5
2 Background Issues for Biographical Sentence Recognition
2.1 Genre . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Systemic Functional Grammar . . . . . . . . . .
2.1.2 The Multi-Dimensional Analysis Approach . . .
2.2 Stylistics and Stylometry . . . . . . . . . . . . . . . . .
2.2.1 Stylistics . . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Stylistic Analysis . . . . . . . . . . . . . . . . . .
2.2.3 Stylometrics: Authorship Attribution . . . . . .
2.3 Biographical Writing . . . . . . . . . . . . . . . . . . . .
2.3.1 Characteristics of Biographical Writing . . . . .
2.3.2 Development of Biographical Writing . . . . . .
2.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Automatic Text Classification . . . . . . . . . . .
2.5 Machine Learning . . . . . . . . . . . . . . . . . . . . . .
2.5.1 Learning Algorithms . . . . . . . . . . . . . . . .
2.5.2 Evaluating Learning . . . . . . . . . . . . . . . .
2.5.3 Feature Selection . . . . . . . . . . . . . . . . . .
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
8
11
22
23
25
25
30
30
32
34
36
37
38
47
50
52
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Review of Recent Computational Work
3.1 Automatic Genre Classification . . . . . . . . . . . . . . . . . . .
3.1.1 Recent Work on Genre in the Computational Linguistics
Tradition . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.2 Feature Selection for Topic Based Text Classification . . .
3.1.3 Feature Selection for Genre Classification . . . . . . . . .
3.2 Systems that Produce Biographies . . . . . . . . . . . . . . . . .
3.2.1 The Summarisation Task . . . . . . . . . . . . . . . . . . .
3.2.2 Multiple Document Summarisation . . . . . . . . . . . .
3.2.3 New Mexico System . . . . . . . . . . . . . . . . . . . . .
vi
53
53
53
57
59
65
65
67
69
C ONTENTS
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
72
76
78
82
83
4 Methodology and Resources
4.1 Methodology . . . . . . . . . . . . . . . . . . . .
4.2 Software . . . . . . . . . . . . . . . . . . . . . . .
4.3 Corpora . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Dictionary of National Biography . . . .
4.3.2 Chambers Biographical Dictionary . . . .
4.3.3 Who’s Who . . . . . . . . . . . . . . . . .
4.3.4 Dictionary of New Zealand Biography . .
4.3.5 Wikipedia Biographies . . . . . . . . . . .
4.3.6 University of Southern California Corpus
4.3.7 The TREC News Corpus . . . . . . . . .
4.3.8 The B ROWN Corpus . . . . . . . . . . . .
4.3.9 The STOP Corpus . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
84
84
86
87
87
89
90
91
92
94
95
96
97
5 Developing a Biographical Annotation Scheme
5.1 Existing Annotation Schemes . . . . . . . . . . . . . .
5.1.1 Text Encoding Initiative Scheme . . . . . . . .
5.1.2 University of Southern California Scheme . . .
5.1.3 Dictionary of National Biography Scheme . . .
5.2 Synthesis Annotation Scheme . . . . . . . . . . . . . .
5.2.1 Developing a New Biographical Scheme . . .
5.2.2 A Synthesis Biographical Annotation Scheme
5.2.3 Assessing the Synthesis Annotation Scheme .
5.3 Developing a Small Biographical Corpus . . . . . . .
5.3.1 Text Sources . . . . . . . . . . . . . . . . . . . .
5.3.2 Issues in Developing a Biographical Corpus .
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
101
101
102
103
107
109
109
112
113
118
118
122
123
6 Human Study
6.1 Introduction . . . . . . . . . . .
6.2 Agreement . . . . . . . . . . . .
6.2.1 Percentage Based Scores
6.2.2 The KAPPA Statistic . . .
6.3 Pilot Study . . . . . . . . . . . .
6.4 Main Study . . . . . . . . . . . .
6.4.1 Motivation . . . . . . . .
6.4.2 Study Description . . .
6.4.3 Results . . . . . . . . . .
6.4.4 Discussion . . . . . . . .
6.5 Conclusion . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
124
124
125
125
126
130
132
133
133
134
135
136
3.3
3.2.4 Mitre/Columbia System . .
3.2.5 Southern California System
3.2.6 Southampton System . . . .
3.2.7 Other Relevant Work . . . .
Conclusion . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7 Learning Algorithms for Biographical Classification
137
7.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.2 Experimental Procedure . . . . . . . . . . . . . . . . . . . . . . . 138
vii
C ONTENTS
7.3
7.4
7.5
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
8 Feature Sets
8.1 Standard Features . . . . . . . . . . . . . .
8.2 Biographical Features . . . . . . . . . . . . .
8.3 Syntactic Features . . . . . . . . . . . . . . .
8.4 Key-keyword Features . . . . . . . . . . . .
8.4.1 Naive Key-keywords Method . . . .
8.4.2 WordSmith Key-keywords Method
8.5 Conclusion . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
144
144
146
148
152
153
154
157
9 Automatic Classification of Biographical Sentences
9.1 Procedure . . . . . . . . . . . . . . . . . . . . . .
9.2 Syntactic Features . . . . . . . . . . . . . . . . . .
9.2.1 Results . . . . . . . . . . . . . . . . . . . .
9.2.2 Discussion . . . . . . . . . . . . . . . . . .
9.3 Lexical Methods . . . . . . . . . . . . . . . . . . .
9.3.1 Results . . . . . . . . . . . . . . . . . . . .
9.3.2 Discussion . . . . . . . . . . . . . . . . . .
9.4 Keywords . . . . . . . . . . . . . . . . . . . . . .
9.4.1 Results . . . . . . . . . . . . . . . . . . . .
9.4.2 Discussion . . . . . . . . . . . . . . . . . .
9.5 Conclusion . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
158
159
160
161
164
165
166
168
170
171
173
176
10 Portability of Feature Sets
10.1 Motivation . . . . . . . .
10.2 Experimental Procedure
10.3 Results . . . . . . . . . .
10.4 Discussion . . . . . . . .
10.5 Conclusion . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
177
177
178
180
181
183
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11 Conclusion
184
11.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
11.1.1 Hypothesis 1, Annotation Scheme and Human Study . . 185
11.1.2 Hypothesis 2, Automatic Biographical Sentence Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
11.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
11.2.1 Biographical Sentence Classifier Module . . . . . . . . . 188
11.2.2 Improving Biographical Sentence Classification . . . . . 188
11.2.3 Extensions to the Biographical Sentence Classification Task188
11.2.4 Other Text Classification Tasks . . . . . . . . . . . . . . . 189
11.2.5 Genre Analysis . . . . . . . . . . . . . . . . . . . . . . . . 189
11.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
A Human Study: Pilot Study
191
A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
A.2 Task Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
A.2.1 Core Biographical Category . . . . . . . . . . . . . . . . 191
viii
C ONTENTS
A.2.2 Extended Biographical Category
A.2.3 Non Biographical Category. . .
A.3 Task Questions . . . . . . . . . . . . . .
A.4 Participant Responses . . . . . . . . . .
B Human Study: Main Study
B.1 Instructions to Participants . . . . .
B.1.1 Introduction . . . . . . . . .
B.1.2 Six Biographical Categories
B.1.3 Example Sentences . . . . .
B.2 Sentences . . . . . . . . . . . . . . .
B.2.1 Set One . . . . . . . . . . . .
B.2.2 Set Two . . . . . . . . . . . .
B.2.3 Set Three . . . . . . . . . . .
B.2.4 Set Four . . . . . . . . . . .
B.2.5 Set Five . . . . . . . . . . . .
B.3 Agreement Data . . . . . . . . . . .
B.3.1 Set 1 Agreement Data . . .
B.3.2 Set 2 Agreement Data . . .
B.3.3 Set 3 Agreement Data . . .
B.3.4 Set 4 Agreement Data . . .
B.3.5 Set 5 Agreement Data . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
192
192
193
199
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
202
202
202
203
205
207
207
213
219
226
231
237
238
240
242
245
247
C Identifying Syntactic Feature
251
C.1 Distance From the Mean . . . . . . . . . . . . . . . . . . . . . . . 251
C.2 Standard Deviations from the Mean . . . . . . . . . . . . . . . . 255
D Syntactic Features
259
E Ranked Features
263
F Coverage of New Annotation Scheme
F.1 Four Biographies from Various Sources
F.1.1 Ambrose Bierce . . . . . . . . . .
F.1.2 Phillip Larkin . . . . . . . . . . .
F.1.3 Alan Turing . . . . . . . . . . . .
F.1.4 Paul Foot . . . . . . . . . . . . . .
F.2 Four Biographies from Wikipedia . . . .
F.2.1 Jack Anderson . . . . . . . . . . .
F.2.2 Kerry Packer . . . . . . . . . . .
F.2.3 Richard Pryor . . . . . . . . . . .
F.2.4 Stanley Williams . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
265
265
265
266
268
269
270
270
271
271
272
273
G The Corrected Re-sampled -test
G.1 An Outline of the Corrected Re-sampled -test . . . . . . . . . . 273
H Factor Analysis
274
H.1 Constructing a Correlational Matrix . . . . . . . . . . . . . . . . 275
H.2 Identifying Factors from a Correlational Matrix . . . . . . . . . . 276
ix
List of Figures
2.1
2.2
2.3
2.4
2.5
Relationship Between Genre and Register (based on E GGINS (1994)). 10
Example Registers. . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Possible and Impossible Registers. . . . . . . . . . . . . . . . . . 12
Linguistic Characteristics of Dimensions in B IBER (1988). . . . . 17
Two Examples of Stylistically Different Texts from (A MIS, 1985)
and (D ENNETT, 1992), Respectively. . . . . . . . . . . . . . . . . 26
2.6 Examples of “Deep” and “Contingent” Features Described in
M C E NERY and O AKES (2000). . . . . . . . . . . . . . . . . . . . . 28
2.7 Inverted Pyramid for Biographies. . . . . . . . . . . . . . . . . . 33
2.8 Genre Features Decision Tree. . . . . . . . . . . . . . . . . . . . . 40
2.9 Decision Tree Rules Example. . . . . . . . . . . . . . . . . . . . . 41
2.10 Example Decision Tree for 3000 Instances. . . . . . . . . . . . . . 42
2.11 Rule Based Learning Example. . . . . . . . . . . . . . . . . . . . 44
2.12 Constituents of a Contingency Table for . . . . . . . . . . . . . 51
3.1
3.2
3.3
3.4
3.5
3.6
3.7
System Network from W HITELAW and A RGAMON (2004) . . . .
Conversion of Documents to Document Vectors . . . . . . . . .
Genre Categories Used in K ARLGREN and C UTTING (1994). . . .
New Mexico System (C OWIE ET AL ., 2001). . . . . . . . . . . . .
Sample Output from S CHIFFMAN ET AL . (2001). . . . . . . . . .
System Architecture for Z HOU ET AL . (2004) MDS system. . . .
Sample Output from ARTEQUAKT System (K IM ET AL ., 2002)
4.1
Discrepancies in Annotation Styles in the USC (Curie Biographies). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
B ROWN Corpus Hierarchy of Text Types . . . . . . . . . . . . . . 98
Truncated Example Entry from the STOP Corpus (S EMINO and
S HORT, 2004): Michael Caine’s Autobiography . . . . . . . . . . 99
Hierarchy of Texts Included in the STOP Corpus (S EMINO and
S HORT, 2004) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.2
4.3
4.4
5.1
5.2
5.3
56
57
60
72
73
77
79
Dictionary of National Biography Opening Schema. . . . . . . . . . 109
Relationship Between the Dictionary of National Biography and
University of Southern California Biographical Schemes. . . . . 110
Relationship Between New Synthesis Scheme, Text Encoding Initiative Scheme, and University of Southern California Scheme.
111
x
L IST OF F IGURES
5.4
5.5
5.6
5.7
7.1
7.2
Entry for Alan Turing in the Dictionary of National Biography Annotated Using New Six Way Scheme. . . . . . . . . . . . . . . . .
Wikipedia biography for Richard Pryor Annotated Using New
Six Way Scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sources of Documents Used From the Literary Genres of the
STOP Corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Types of Document Used in Biographically Tagged Corpus. . .
115
116
121
121
Mean Performance of Learning Algorithms with 10 x 10 CrossValidation on “Gold Standard” Data using a Unigram Based Feature Representation. . . . . . . . . . . . . . . . . . . . . . . . . . 141
Root section of a C4.5 Decision Tree Derived From the Gold Standard Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . 143
9.1
Comparison of the Performance of Unigrams, Bigrams and Trigrams. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
9.2 Comparison of the Performance of Syntactic and Pseudo-Syntactic
Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
9.3 Experimental and Null Hypotheses — Syntactic Features. . . . 163
9.4 Experimental and Null Hypotheses — Pseudo-Syntactic Features.164
9.5 Comparison of the Performance of Differing Lexical Representations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
9.6 Experimental and Null Hypotheses — Stemming. . . . . . . . . 168
9.7 Experimental and Null Hypotheses – Function Words. . . . . . 168
9.8 Experimental and Null Hypotheses — Stopwords. . . . . . . . . 169
9.9 Comparison of the Performance of Keywords, Key-Keywords,
and Frequencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
9.10 Experimental and Null Hypotheses – Key-Keywords. . . . . . . 172
9.11 Comparison of Partial Decision Trees for Each Feature Set. . . . 173
10.1 Biographical Unigram Extraction from the USC Corpus . . . . . 179
10.2 Comparison of the Performance of Unigrams Derived from USC
Annotated Clauses and Biographical Dictionary Unigram Frequency Counts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
10.3 Experimental and Null Hypotheses: USC and Biographical Dictionary Derived Features . . . . . . . . . . . . . . . . . . . . . . . 181
xi
List of Tables
2.1
2.2
2.3
2.4
2.5
Genres Used by B IBER (1988). . . . . . . . . .
Text Typology Derived by B IBER (1989). . . .
Example Training Sentence Representations.
Example Test Sentence Representation. . . . .
Contingency Table. . . . . . . . . . . . . . . .
.
.
.
.
.
16
18
45
45
51
4.1
Descriptive Statistics for Biographical Corpora. . . . . . . . . . .
96
5.1
5.2
116
5.4
5.5
Coverage of New Annotation Scheme Using Different Sources. .
Coverage of New Annotation Scheme on Short Wikipedia Biographies (Deaths in December 2005.) . . . . . . . . . . . . . . . . . .
Percentage of Biographical Sentences Based on 1000 Sentence
Sample. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Descriptive Statistics for Biographical Corpora. . . . . . . . . . .
Average Number of Biographical Tag Types per Text. . . . . . .
118
122
122
6.1
6.2
6.3
6.4
6.5
Raw Agreement Scores (Idealised Example Data) . . . . . . . . .
Types of KAPPA; Methods for Calculating Expected Probability.
Idealised Data for KAPPA Example. . . . . . . . . . . . . . . . . .
Inter-classifier Agreement Results. . . . . . . . . . . . . . . . . .
KAPPA Scores for Each Sentence Set . . . . . . . . . . . . . . . . .
126
127
128
132
135
7.1
Six Learning Algorithms Compared using “Gold Standard” Data
and a Feature Representation Based on the 500 Most Frequent
Unigrams in the DNB: 10 x 10 Fold Cross Validation . . . . . . 140
Six Learning Algorithms Compared using “Gold Standard” Data
and a Feature Representation Based on the 500 Most Frequent
Unigrams in the DNB: 100 x 10 Fold Cross Validation . . . . . . 140
5.3
7.2
8.1
8.2
8.3
8.4
8.5
8.6
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
116
100 Most Frequent Unigrams from the DNB. . . . . . . . . . . . 145
100 Most Frequent Bigrams From the DNB. . . . . . . . . . . . . 147
100 Most Frequent Trigrams From the DNB . . . . . . . . . . . . 147
Syntactic Features Used by B IBER (1988). . . . . . . . . . . . . . . 149
Twenty Syntactic Features Most Characteristic of Biography Ranked
by Maximum Distance from Mean. . . . . . . . . . . . . . . . . . 150
Twenty Syntactic Features Characteristic of Biography Ranked
by Positive Association with Biographical Genre. . . . . . . . . . 150
xii
L IST OF TABLES
8.7
Twenty Syntactic Features Characteristic of Biography Ranked
by Negative Association with Biographical Genre. . . . . . . . .
8.8 Frequent Unigrams in the Biographical Corpus . . . . . . . . . .
8.9 Unigrams in the Biographical Corpus Ranked by Naive Keykeyness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.10 Unigrams in the Biographical Corpus Ranked by WordSmith
Key-keyness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.1
9.2
9.3
151
154
155
156
Performance of Syntactic and Pseudo-syntactic Features. . . . . 163
Performance of Alternative Lexical Methods. . . . . . . . . . . . 166
Performance of Keyword and Key-keyword features Relative to
a Baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
10.1 Classification Accuracies of the USC and DNB/Chambers Derived Features on Gold Standard Data. . . . . . . . . . . . . . . . 181
A.1 Pilot Study Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
B.1
B.2
B.3
B.4
B.5
Agreement Data for Set 1
Agreement Data for Set 1
Agreement Data for Set 3
Agreement Data for Set 4
Agreement Data for Set 4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
C.1 Sixty-seven Features Ranked by Distance from the Mean (Irrespective of Whether the Distance is Positive or Negative) . . . .
C.2 Sixty-seven Features Ranked by Distance from the Mean . . . .
C.3 Sixty-seven Features Ranked by Number of Standard Deviations
from the Mean. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
C.4 Sixty-seven features Ranked by Number of Standard Deviations
from the Mean. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
E.1 100 Features Identified by
238
240
242
245
247
251
253
255
257
for DNB and TREC data. . . . . . 264
xiii
C HAPTER 1
Introduction
1.1 Hypotheses
This thesis presents and defends the claim that biographical writing can be reliably identified at the sentence level using automatic methods. In other words,
biographical sentences can be identified as such, independently of the context
in which they occur (that is, the document or surrounding text).
As sub-hypotheses to the more general hypothesis, this thesis explores two
secondary hypotheses:
Hypothesis 1 Humans can reliably identify biographical sentences without
the contextual support provided by a discourse or document structure.
Hypothesis 2 “Bag-of-words” style sentence representations augmented with
syntactic features provide a more effective sentence representation for
biographical sentence recognition than “bag-of-words” style representations alone.
The first hypothesis seeks to identify whether the research programme is feasible. That is, if humans can identify isolated biographical sentences without the
aid of a supporting discourse structure, then it is likely that a machine learning
algorithm will be able to perform the same task. Once we have established that
humans are able to identify biographical statements with good agreement (see
Chapter 6) we can move to Hypothesis 2, which claims that the topic orientated
text representations commonly used in text classification (and information retrieval) research alone, are less useful than a combination of topic orientated
and stylistic (non-topical) features for the biographical sentence classification
task. Hypothesis 2 is designed to explore the usefulness of syntactic and stylistic features for genre classification.
The main hypothesis (and two sub-hypotheses) provides a framework for the
thesis, but other research questions are addressed within that framework (for
1
C HAPTER 1: I NTRODUCTION
example, the utility of the key-keywords methodology (see page 21) for identifying biographical features).
The remainder of this introductory chapter provides some motivation for the
research work, focusing on potential uses for a successful biographical sentence classifier and provides a chapter-by-chapter outline of the thesis, concentrating on how each chapter addresses the hypotheses identified.
1.2 Applications
Biographical orientated summarisation or biographically orientated multiple
document summarisation is dependent on the reliable identification of biographical sentences; sentences that may not be judged salient using standard
information retrieval techniques and which may be “buried” in otherwise event
orientated text. In situations where gigabytes of data are analysed for biographical sentences, more linguistically orientated natural language processing approaches to identifying biographical sentences — say, full parsing — are
inappropriate due to the significant overhead of producing a linguistic representation of the document and its constituent sentences. In this situation, standard “bag-of-words” representations combined with highly focused syntactic
features may prove a reliable and scalable solution, able to identify biographical writing hidden among potentially huge document collections.
Potential uses for a biographical sentence classifier could include:
A sentence filter in a person orientated information extraction system.
That is, sentences classified as biographical with respect to the person of
interest could be selected for further processing.
A research tool for journalists and writers seeking only biographical information from information retrieval results on the web. For example, a
writer constructing a biography for a named historical individual may be
interested in filtering out those sentences that are not biographical with
respect to the individual of interest, even if this causes some textual incoherences/disfluencies.
A module in a more general genre identification system. For example,
a system that classified web pages with respect to their primary information function, with the proportion of biographical sentences above a
given threshold leading to the document being tagged as biographical as
opposed to some other genre.
More potential uses of a biographical sentence classifier (along with areas of
further research) are outlined in the concluding chapter.
2
C HAPTER 1: I NTRODUCTION
1.3 Biographical Sentence Classification: The Current Situation
While recent work on identifying biographical writing has been carried out in
the context of the development of architectures and algorithms for biographically orientated automatic summarisation systems (S CHIFFMAN ET AL ., 2001;
K IM ET AL ., 2002; C OWIE ET AL ., 2001; Z HOU ET AL ., 2004) little work in the
computational linguistics tradition has been directed towards the elaboration
of a theoretical framework for the identification of biographical writing. Previous work has concentrated on the identification of heuristics for the identification of biographical phrases or sentences (for example, the assumption that
apposition is a strong indicator of biographical clauses (S CHIFFMAN ET AL .,
2001)), or on the construction of traditional domain specific ontologies (for
example, a system that is limited to the domain of artists’ biographies (K IM
ET AL ., 2002)). Z HOU ET AL . (2004) describes a system developed at the University of Southern California, that uses corpus evidence to construct biographical
schemas (that is, to identify the kinds of information that form a biography).
However, their sample is limited to 120 biographies of 10 individuals, as a preliminary study during the construction of a biographical multiple document
summarisation system.
Neither the text classification nor genre identification communities have focused research efforts directly on the identification of biographical sentences,
although the identification of biographical sentences can easily be reframed
as a sentence level text classification problem and also as a genre identification problem if we accept that biographical writing constitutes a distinct
genre.
1.4 Thesis Outline
The thesis can be divided into three parts. Chapters 2, 3 and 4 discuss background issues necessary for understanding the work as a whole. Chapters 5 and
6 address Hypothesis 1 (Humans can reliably identify biographical sentences
without the contextual support provided by a discourse or document structure). Chapters 7, 8, 9 and 10 address Hypothesis 2 (“Bag-of-words” style sentence representations augmented with syntactic features provide a more effective sentence representation for biographical sentence recognition than “bagof-words” style representations alone) and also present results pertaining to
the automatic classification of biographical sentences more generally. The concluding chapter summarises the results of the thesis. Additional material is
presented in appendices, which are referenced when appropriate in the text of
the thesis.
3
C HAPTER 1: I NTRODUCTION
1.4.1 Background Chapters
These three chapters contain the background necessary to understand the thesis as a whole.
Chapter 2: Background Issues for Biographical Sentence Recognition This
chapter provides essential background to the central themes of the thesis. The
first section explores the notion of genre, and discusses two recent theoretical approaches to the study of genre: Systemic Functional Grammar and Multidimensional Analysis. The second section sets forth a survey of stylistics (a subdiscipline of linguistics relating to the study of style) and its statistically influenced intellectual cousin, stylometry. The third section presents a summary of
the history of biographical writing in English, along with a discussion of some
the characteristics of biographical writing. The fourth section discusses classification theory and especially issues associated with text classification. The fifth
and final section reviews several of the important machine learning algorithms
used in the work, along with a discussion concerning methods of evaluating
the performance of learning algorithms.
Chapter 3: Review of Recent Computational Work This chapter aims to describe recent computational work relevant to the biographical sentence classification task. The chapter is divided into two sections. The first section describes work in automatic genre classification (focusing on the different strategies used in topic categorisation and genre categorisation). The second section describes several working systems that produce biographies from unstructured data.
Chapter 4: Methodology and Resources The chapter is divided into three
sections. The first section describes the methodology used in the work. The
second and third section outlines the two kinds of resources — software and
corpora, respectively — used in the research.
1.4.2 Human Biographical Text Classification
These two chapters address the issue of whether people are able to reliably
identify biographical sentences. The two chapters describe the formulation of
a biographical annotation scheme (that is, a criterion for deciding whether a
sentence is biographical or non-biographical), the development of a biographical
corpus based on this scheme, and the utilisation of that scheme in a human
study which assesses the extent to which participants agree on what is and is
not a biographical sentence.
4
C HAPTER 1: I NTRODUCTION
Chapter 5: Developing a Biographical Annotation Scheme This chapter first
reviews some existing annotation schemes for biographical text (including the
scheme developed under the auspices of the Text Encoding Initiative). The second section describes a new syntactic annotation scheme, along with an initial
assessment of that scheme. The final section describes a biographical corpora
based on the new annotation scheme.
Chapter 6: Human Study This chapter is divided into three main sections.
First, some necessary background on agreement statistics is presented, with
special reference to the K APPA statistic. Second, a pilot study is reported which
uses a three way biographical sentence classification scheme. Finally, the main
study is presented, which shows that the biographical annotation scheme developed in Chapter 5 yields good levels of agreement between annotators. The
high levels of agreement obtained in this study supports the claim made in
Hypothesis 1, that people are able to distinguish reliably between biographical
and non-biographical sentences.
1.4.3 Automatic Biographical Text Classification
Chapters 7, 8, 9 and 10 present results pertaining to the automatic classification of biographical sentences. The corpus of biographical sentences developed in Chapter 5 and validated in Chapter 6 forms the basis of a gold standard corpus of five hundred and one biographical and non-biographical sentences, which is utilised in these machine learning chapters as training and
test data. While Chapters 5 and 6 established that people are able to reliably
distinguish between biographical and non-biographical sentences, this part of
the thesis addresses, among other issues pertinent to automatic biographical
sentence classification, Hypothesis 2: “Bag-of-words” style sentence representations augmented with syntactic features provide a more effective sentence
representation for biographical sentence recognition than “bag-of-words” style
representations alone.
Chapter 7: Learning Algorithms for Biographical Classification This chapter compares the performance of five popular text classification algorithms
when applied to the biographical sentence classification task. Each learning
algorithm was tested using the gold standard biographical data and a unigram
based feature set consisting of the 500 most frequent words in the Dictionary
of National Biography. The chapter serves as a “first pass” of the data, allowing
indicative results to be drawn about the usefulness of different machine learning algorithms for the biographical sentence classification task. Later chapters
(that is, Chapters 9 and 10) concentrate on varying the feature sets used, rather
than the machine learning algorithms. The Naive Bayes algorithm generated
the best results.
5
C HAPTER 1: I NTRODUCTION
Chapter 8: Feature Sets In order to provide the necessary background to
Chapters 9 and 10, which focus on evaluating the performance of various feature sets, this chapter outlines the different feature sets used in the work. These
features include standard features (including, “bag-of-words” style features), biographical features, syntactic features, and (so-called) key-keyword based features.
Chapter 9: Automatic Classification of Biographical Sentences This chapter is divided into three sections. Section one considers the performance of
syntactic features versus “bag-of-words” style sentence representations for the
biographical text classification task. It is discovered that augmenting a “bagof-words” style representation with syntactic features improves classification
accuracy, but not at a statistically significant level, lending limited support to
Hypothesis 2. Section two considers the impact of function words on classification accuracy, in line with the intuition (borrowed from stylometry) that topic
neutral function words are important in representing the non-topical content
of a text. It was discovered that accuracy decreases at a statistically significant
level when function words are removed from feature sets. The third part of
the chapter assesses the key-keyword methodology for selecting genre specific
features and finds that selecting features using the method, which was developed as an alternative to Multi-Dimensional analysis (see Chapter 2), yields
worse results than simply using frequent unigrams.
Chapter 10: Portability of Feature Sets This chapter compares the performance of a feature set identified by Z HOU ET AL . (2004) (using a method that
required a considerable annotation effort) to a feature set derived automatically
from frequent unigrams in a biographical corpora. It is shown that both approaches yield almost identical classification accuracy scores, suggesting that
there is little gain in using this labour intensive feature identification method.
Chapter 11: Conclusion The concluding chapter is divided into two sections. First, contributions made by the thesis work are described. Second,
possible future directions, based on the work conducted for this thesis, are
outlined.
6
C HAPTER 2
Background Issues for
Biographical Sentence
Recognition
This chapter is designed to provide essential background to the central theme
of the thesis, that biographical sentence classification can be viewed as a genre
classification task where non-topical sentence representations are useful (hence
the extended discussion of genre and stylistics — fields of study which stress
the non-topical characteristics of text — in Sections 2.1 and 2.2, respectively.).
The chapter consists of five sections. First, different approaches to genre are
discussed (as the thesis places biographical text recognition within the wider
area of genre recognition). Second, a brief overview of stylistics is presented, as
stylistics is traditionally concerned with non-topical criteria for distinguishing
texts. Third, salient characteristics of biographical writing are outlined, including a brief history of the biographical genre. Fourth, an outline of classification
theory is presented, concentrating on the particular problems associated with
text classification. Fifth, a review of several of the important machine learning algorithms used in this work is presented, including a discussion of the
important subject of how to evaluate learning.
2.1 Genre
A RISTOTLE (c 340 BC) was the first to develop a systematic theory of genre.
Although his classification framework is rooted in the culture of antiquity, it
remains important as it established a framework for subsequent developments.
D UBROW (1982) borrows Whitehead’s description of the history of philosophy
as a series of footnotes to Plato, and characterises the history of genre studies
as a series of footnotes to Aristotle.
7
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
This section focuses on two active research traditions that have genre as a core
concern; Systemic Functional Linguistics and Multi-Dimensional Analysis.
2.1.1 Systemic Functional Grammar
Systemic Functional Grammar1 (SFG) is a sociolinguistic theory of language
use first developed by Michael Halliday in the 1960’s (H ALLIDAY, 1961, 1966),
and extended by Halliday and others throughout the 1970’s, 1980’s and 1990’s
(H ALLIDAY and H ASAN, 1976; M ARTIN ET AL ., 1996). The current theory is
presented comprehensively in the most recent edition of Introduction to Functional Grammar (H ALLIDAY and M ATTHIESSEN, 2004). This section aims to provide a brief overview of SFG, before going on to give some details on SFG
grammatical analysis, and finally describes the importance of SFG to genre.
Note that we do not leap into a discussion of the function of genre in SFG for
two main reasons. First, SFG is a very dense and complex theory, and requires
some exposition before an analysis of genre in relation to the theory can be
helpful. Second, SFG uses an extensive technical vocabulary, sometimes with
words (for example, the term “genre” itself) used in surprising ways.
A Brief Overview of Systemic Functional Theory
SFG is a functional theory of grammar, emphasising how language is embedded in a social context of use. This emphasis on language use distinguishes SFG
from formal — Chomskyan — theory, which concerns itself with the ways in
which minds shape and constrain possible grammars. SFG is a socio-linguistic
theory of grammar, whereas formal linguistics is a psychological theory (M AR TIN ET AL ., 1997). In this way, SFG can be viewed as an intellectual descendant
of the later Wittgenstein (W ITTGENSTEIN, 1953), and more directly, Firth’s approach to linguistics (H ONEYBONE, 2005). Malinowski’s work in anthropology,
which stressed the importance of language in negotiating, developing and consolidating human relationships (M ALINOWSKY, 1935), was also very influential
in the development of SFG.
The core concern of SFG (and its chief point of contrast with formal linguistics)
is its preoccupation with the questions: What do people do with language, and
how do they do it? These questions are grounded in a view of language based
around four theoretical assumptions (articulated by E GGINS (1994)).
1. Language use is functional.
2. The function of language is to make meanings.
3. Meanings are influenced by social and cultural contexts.
4. Language use is a process of making meanings by choosing.2
1 Systemic
Functional Grammar is also known as Systemic Functional Linguistics.
uses the word semiotics to describe this process.
2 E GGINS (1994)
8
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
The social orientation of SFG and its view of language as a “strategic, meaning
making resource” (E GGINS, 1994, p. 1) means that SFG does not only provide a
method for grammatical analysis, but can aptly be applied to various practical
problems. E GGINS (1994) lists several areas in which SFG has been successfully
applied, including language education (for example, R OTHERY (1991)), speech
pathology (for example, A RMSTRONG (1991)) and natural language generation
(T EICH, 1999). Ideas from SFG have also been applied to textual genre classification (for example, W HITELAW and A RGAMON (2004) described on page
54).
A distinctive feature of SFG is that it provides a theoretical structure at multiple levels of granularity. It is both able to provide fine grained grammatical analysis at the level of parts-of-speech within clauses, and to interpret
the broader communicative act in its social context, unified by the idea of
meaning as a function of linguistic choice. These meanings — the meanings
that a text produces — can be divided into three categories (H ALLIDAY and
M ATTHIESSEN, 2004):
Ideational meaning: Roughly “aboutness”. Ideational meaning is concerned with processes and relationships.
Interpersonal meaning: The social dynamic assumed in the text. Note
that even expositionary text can be viewed as interpersonally meaningful
if construed as a dialogue between writer and reader.
Textual meaning: The textual meaning (as opposed to contextual or interpersonal meaning) is that component of the text that is “about” the
text itself. That is, how the text is organised. For example, are persons or
abstract nouns referred to?
SFG is grounded in the social use of language and as such, it focuses (as we
have mentioned) on entire texts, rather than isolated sentences as, according
to SFG theory, communication has a beginning, a middle and an end, and can
only properly be analysed in its entirety.
A further feature of SFG, a natural development of the theory that meaning
is a function of linguistic choice, is the system network, a formalism for representing linguistic choice (see Figure 3.1 on page 56 for an example of a system
network).
Systemic Functional Grammar: Register and Genre
“Genre” and “register” are semi-technical3 terms within SFG. Register is typically defined as “the context of situation” and genre “the context of culture”
(H ALLIDAY and M ATTHIESSEN, 2004). What this means is that the context of a
text can be described at two levels of abstraction: the lower level (register) describes the situation, and the higher level (genre) explains the situation in terms
3 These
terms were identified by M ALINOWSKY (1935)
9
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
of purpose. The relationship between genre, register and language is represented diagrammatically in Fig 2.1.
Figure 2.1: Relationship Between Genre and Register (based on E GGINS (1994)).
Context of
Situation
Context of
Culture
mode
LANGUAGE
field
REGISTER
tenor
GENRE
As register is “nearer” to language, and the basis on which genre is formed, we
will begin with an analysis of how register and text are related. Register can be
divided into three aspects (note that these are analogous to the three meanings
of a text, ideational, interpersonal and textual):
Field is the topic of the text and the particular linguistic patterns associated with that topic. For example, consider a weather forecast and
the typical linguistic features associated with it. Note that weather forecast phrases are a subset of weather related phrases generally and that
the phrase “persistent heavy rain” is a likelier phrase in the context of
weather forecasts than “raining buckets”.4
Tenor is the effect of the relationship between language producer and addressees. For example, we are unlikely to talk to a child in the same manner (vocabulary, level of formality, and so on) as we would to a potential
employer in a job interview situation.
4 The set of acceptable phrases for describing the weather in the context of a weather forecast
are severely constrained, and breaking these rules very noticeable to the audience. Recently in the
UK a weather forecaster was severely criticised after making the (presumably off guard) prediction
that it would “piss it down”.
10
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
Mode refers to the channel of communication and the way it shapes interaction. For instance, text message (SMS) writing is highly truncated and
information dense compared to face-to-face interaction.
The register of a text is its situational context, that is, a description of the situation in which the text occurs defined according to the three parameters, Field,
Tenor and Mode (see Figure 2.2 on the following page for some examples of
situations.) This situational context (register) provides a basis for genre — the
cultural context — or the purpose of the text. The purpose of a text — its genre
— can be discerned from register information (field, tenor and mode) only
through a knowledge of culture. In other words, genre is a “meta” level of
analysis above descriptive (situational) register information. For example, we
may know the field, tenor, and mode of a situation (newspaper, shop worker
- customer, face-to-face, respectively), but only because we know about shops,
newspapers and customers (and the way these entities interact in a particular culture) can we correctly ascertain that the genre here is face-to-face routine
transaction.
E GGINS (1994) identifies two ways in which genre is “mediated” through register. First, register helps to “fill in the slots” for a particular genre. To use
Eggin’s example, if we use “university essay” as a model genre, register serves
the function of “filling in the specifics relevant to a particular situation of use
of that genre” (E GGINS, 1994, p . In other words, while all instances of the
genre share the same structure (statement of thesis, presentation of evidence,
and so on), the “field” will change according to the discipline of the essay (anthropology, sociology, and so on). Second, E GGINS (1994) presents the concept
of genre potential, which is described as all possible register configurations that
are culturally possible (possible in a given culture). For instance, Figure 2.3 on
the next page shows two register configurations, one of which belongs to the
distance learning (or correspondence) course genre, and one which cannot belong to any genre (Karate cannot be learned through distance learning).
Different genres are characterised by distinct schematic structures. These schematic
structures are similar in structure to the frame and script structures associated
with traditional artificial intelligence (for example M INSKY (1974) and S CHANK
and A BELSON (1977)). E GGINS (1994) uses the example of the recipe genre. The
form of a recipe is highly predicable (recipe title, ingredients, instructions) and
any deviation from this norm is surprising to the reader. Not all genres are
characterised by this ideational function. Spoken conversations can be primarily concerned with the consolidation of social relationships — the interpersonal
function — rather than the transmission of ideas.
2.1.2 The Multi-Dimensional Analysis Approach
Biber’s work on genre and text types – and especially his multi-dimensional
methodology, where factor analysis is used to identify salient differences between
11
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
Figure 2.2: Example Registers.
Television weather forecast
field: weather
tenor: television presenter - anonymous audience
mode: television broadcast
Internet weather forecast
field: weather
tenor: corporate author - anonymous audience
mode: electronic text
Buying a newspaper at a newsagent
field: newspaper
tenor: shop worker - customer
mode: face-to-face
Figure 2.3: Possible and Impossible Registers.
Possible register: accountancy distance learning course
field: accountancy
tenor: lecturer - student
mode: written
Impossible register: karate distance learning course
field: karate
tenor: lecturer - student
mode: written
genres and text types5 — began in the 1980s in collaboration with his doctoral
supervisor, Edward Finegan (B IBER and F INEGAN, 1986) and was developed
through the late 1980s (especially in the book length treatment Variations Across
Speech and Writing (B IBER, 1988), which was based on Biber’s PhD thesis). The
research programme focused on identifying dimensions of difference between
texts in English (B IBER, 1988) and then using these dimensions as a basis for
the development of an empirically based theory of text types (B IBER, 1989).
6
5 The notion of text types like the notion of genre does not have a settled definition and is used in
different ways in different research traditions. See M OESSNER (2001) for a discussion of this and
other terminological issues.
6 The multi-dimensional approach can be viewed as part of the development of corpus linguistics. Instead of intuitively analysing the features associated with different genre, multi-
12
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
The themes and methodologies identified in Biber’s early work were taken
up by other researchers and applied to new areas. For instances, K IM and
B IBER (1994) focuses on empirically determining the underlying dimensions of
variation in contemporary Korean using a corpus based approach and multidimensional analysis, and B IBER ET AL . (2002) uses a corpus of written and
spoken academic discourse7 to identify prevalent registers in the hope that this
can inform pedagogy for students of English as a second language.
This section first explores the central distinction in multi-dimensional studies in variation; the difference between genre and text type, before going on to
outline the broad methodology used in studies of linguistic variation. Then,
the impact of the research programme developed by Biber and Finegan is described, with reference to work from various areas, including language teaching (as briefly mentioned above) and historical linguistics. Finally, some criticisms of, and potential substitutes for, the multi-dimensional approach are
presented and assessed.
For Biber, a typology of text types is any classification scheme for texts. So
the traditional categorisation provided by Aristotle (into comedy, tragedy, epic
poetry, and so on.) is a typology, as is the categorisation scheme given by traditional discourse theory, where modes of discourse are either narrative, descriptive, expositionary, or argumentative. Genre is one kind of typology, which
Biber describes as a “folk-typology” (B IBER, 1989). By describing genre as a
“folk-typology” system, Biber means that the genre of a text is classified using its most easily discernable attribute; its external format — whether it is a
newspaper article, technical manual, novel, and so on — rather than its internal linguistic features.
Biber presents a new typology of English, based on the empirically identified
internal linguistic features of a text, rather than relying on either the hand-medown genre distinctions of folk-typology, or the intuitions of linguistics about
which features of a text may be important, resulting in the discovery that the
traditional distinctions of folk-typology do not map clearly to the classification
scheme generated by his empirical study:
Genre distinctions do not adequately represent the underlying text
types of English... Texts within particular genres can differ greatly
in their linguistic characteristics; for example, newspaper articles
can range from extremely narrative and colloquial in linguistic form
to extremely informational and elaborated in form. On the other
hand, different genres can be quite similar linguistically; for example, newspaper articles and popular magazines can be nearly identical in form. Linguistically distinct texts within a genre represent
dimensional analysis seeks to quantify these differences using a perspicuous and repeatable
methodology.
7 TOEFL 2000 Spoken and Written Academic Language Corpus, which consists of extracts from
textbooks, conversations between academics and students, and service encounters between university staff and students.
13
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
different text types linguistically similar texts from different genres
represent a single text type.
B IBER (1989, p. 6)
So, while genre may be a useful text typology insofar as it reflects the external characteristics and social functions of texts — whether they are newspaper
articles, childrens’ books, and so on — it fails to reflect the internal linguistic
qualities of the text. A given genre may well contain texts of several different
text types. For example, the genre of academic papers includes argumentative papers and survey papers, each with different purposes, structures and
styles, yet despite their linguistic differences, both are included in the genre
“academic papers”. Similarly, argumentative texts may occur across a range of
traditional genre categories. Examples here could include newspaper editorials, academic papers, and political pamphlets.
Biber’s work is distinctive because it is empirically based. Instead of examining representative documents from each genre, and attempting to extract the
distinctive linguistic features common to all documents, Biber uses linguistic
features to identify text categories, and to construct a typology:
It should be noted that the direction of analysis here is opposite
from that typically used in the study of language use. Most analyses begin with a situated or functional description and identify linguistic features associated with that distinction as a second step...
The opposite approach is used here: quantitative techniques are
used to identify the groups of features that actually co-occur in texts
and afterwards these groupings are interpreted in functional terms.
The linguistic dimension, rather than the functional dimension is
given priority.
B IBER (1988, p. 13)
Biber’s system involves two stages: First, the linguistic features of a collection of texts are analysed using statistical techniques, and a set of dimensions
based on this statistical analysis is presented. Second, using cluster analysis,
a typology of texts for English is proposed, based on the previously derived
dimensions.
Multi-Dimensional Variation: Methodology
Stage One: Deriving Textual Dimensions
B IBER (1988) identifies five dimensions of variation in English across twentythree genres (see Table 2.1 on page 16). Four hundred and eighty one texts were
selected from the Lancaster-Oslo-Bergen corpus (J OHANSSON ET AL ., 1978)
and the London-Lund corpus of spoken English (S VARTVIK, 1990). Through
a study of the linguistics literature, Biber identified sixty-seven features that
14
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
seemed likely to be significant for genre identification (including the presence
of second person pronouns, demonstratives, stranded prepositions, and so on).
See Appendix D on page 259 for a brief description of each feature. These features were identified automatically using a set of programs, and then hand
checked. The requirement to verify the feature selection in each document by
hand, limited the number of documents that could be processed. The features
in each document were then counted (and normalised to account for different document lengths), and the data was then subjected to factor analysis (see
Appendix H on page 274 for a brief outline of factor analysis), from whence
seven factors (that is, patterns of linguistic co-occurrence) where identified.
On the basis of the factors identified, Biber determined six dimensions of variation:8
1. Involved versus informational production — Involved text is informal and
stresses interpersonal relationships. Examples of involved genre include
personal letters, interview transcripts, and so on. Features strongly associated with involved text are: present tense, contractions and second person pronouns. Informational text is primarily concerned with the transmission of information. It is formal, dense with nouns and prepositions,
and lacks contractions and pronouns. Examples of informational genre
would include academic papers and government reports.
2. Narrative versus non-narrative concerns — Narrative texts are characterised
by extensive use of the past tense and third person pronouns. Narrative
genres include fiction and biography. Non-narrative texts are associated
with heavy use of the present tense and can include genres like academic
papers, government documents, technical manuals, and so on.
3. Explicit versus situation dependent reference — B IBER (1988) describes this
dimension as a “dimension that distinguishes highly explicit and elaborated endophoric reference from situation dependent exophoric reference.” In other words, explicit (endophic) genres (like academic prose
and government documents) are self-contained; they do not depend on
extensive reference to an unexplained situation in order to be intelligible.
Relative clauses are highly characteristic of explicit genres. Situation dependent genres however assume a high level of familiarity with the domain in question. B IBER (1988) uses the example of a football commentary, which only makes sense because we have background knowledge
about football and the kind of events that occur at football matches.
4. Overt expression of persuasion — Important features distinctive of persuasive writing are the occurrence of suasive verbs (for example, “should”).
Persuasive genres include press editorials, and argumentative academic
8 B IBER (1988) identified six dimensions, but in later work, this was reduced to five dimensions
(B IBER , 1989). The dropped dimension was online informational elaboration. The first three dimensions have both positive and negative features, the last three dimensions only have positive features.
15
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
Table 2.1: Genres Used By B IBER (1988). “LOB” refers to the Lancaster-OsloBergen Corpus (J OHANSSON ET AL ., 1978) and London-Lund Refers to the
London-Lund Corpus of Spoken English (S VARTVIK, 1990).
Genre
Press reportage
Editorials
Press reviews
Religion
Skills and hobbies
Popular law
Biographies
Official documents
Academic prose
General fiction
Mystery fiction
Science fiction
Adventure fiction
Romantic fiction
Humour
Personal letters
Professional letters
Face-to-face conversation
Telephone conversation
Public conversations
Broadcast
Spontaneous speeches
Prepared speeches
Number of Texts
44
27
17
17
14
14
14
14
80
29
13
6
13
13
9
6
10
44
27
22
18
16
14
Source
LOB
LOB
LOB
LOB
LOB
LOB
LOB
LOB
LOB
LOB
LOB
LOB
LOB
LOB
LOB
Biber
Biber
London-Lund
London-Lund
London-Lund
London-Lund
London Lund
London-Lund
Spoken/Written
Written
Written
Written
Written
Written
Written
Written
Written
Written
Written
Written
Written
Written
Written
Written
Written
Written
Spoken
Spoken
Spoken
Spoken
Spoken
Spoken
prose. These contrast with genres whose function is solely to relay information (for example, technical manuals).
5. Abstract versus non abstract style — Features that indicate abstract writing
include the use of the passive voice and agentless verbs (characteristic of
academic writing). Non-abstract style is simply the absence of the features that indicate abstraction.
6. Online informational elaboration — Here “online” refers to those genres that
are characterised by the relay of information in real time (for example,
prepared speeches, and so on). These “online” genres are dense with that
complements.
Linguistic features are associated with each “end” of the dimension (see Figure 2.4 on the next page) as determined by factor analysis. For example, if we
take the narrative versus non-narrative dimension, past tense is associated with
16
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
Figure 2.4: Linguistic Characteristics of Dimensions in B IBER (1988).
Dimension
Positive Features
private verbs
contradictions
present-tense verbs
second person pronouns
analytic negation
AH-Questions
Adverbs
past-tense verbs
third person pronouns
public verbs
present participle clauses
phrasal co-ordination
nominalization
infinitives
predictive modals
necessary modals
split auxiliaries
conjuncts
agentless passives
past participle clauses
other adverbial subordination
Dimension 1
Involved vs Informational
Production
Dimension 2
Narrative vs Non Narrative
Concerns
Dimension 3
Explicit vs Situation
Dependent
Negative Features
prepositions
place adverbials
present-tense verbs
attributive adjectives
time adverbials
time adverbials
place adverbials
Dimension 4
Overt Expressions of
Persuasion
No negative features
Dimension 5
Abstract vs Non-Abstract
Style
No negative features
the narrative end of the dimension and present tense with the non-narrative
end of the dimension.
The Second Stage: Deriving a Textual Typology
B IBER (1989) builds on his identified dimensions to create an empirically grounded
typology of English texts. To derive the typology, he gave each text a “dimension
score” based on the frequency of linguistic features associated with that dimension. The dimension score was determined for each dimension by counting
the frequencies of features associated with one end of the dimension (for instance, in the case of the Narrative versus Non-narrative dimension these would
be features associated with the Narrative end of the dimension, like past tense
verbs or third person pronouns) and subtracting the total number of features
associated with the other end of the dimension (for instance, in the case of
the Narrative versus Non-Narrative dimension, these would be features associated with the Non-Narrative end of the dimension, like present tense verbs
17
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
Table 2.2: Text Typology Derived by B IBER (1989).
Intimate interpersonal interactions
Scientific exposition
Imaginative narrative
Situated reportage
Informational Interaction
Learned exposition
General narrative exposition
Involved persuasion
or time adverbials). A vector of the scores for each dimension was then subjected to cluster analysis and the optimal grouping of the data resulted in eight
distinct clusters. These clusters form the basis of Biber’s text typology (see
Table 2.2).
The text types identified do not map directly to genres. That is, the text-types
identified by internal linguistic characteristics do not map directly to the external genre categories of the corpus used by B IBER (1988) (see Table 2.1 on
page 16 for a list of the genres used by B IBER (1988)), as is in accordance with
Biber’s notion of text typologies as fully determined by linguistic qualities of
a text rather than by external function or location. Consistent with this, texts
from the Biography genre were spread across four of the text types identified by
Biber’s scientific exposition (7%), learned exposition (29%), imaginative narrative (7%) and general narrative (57%). Over half the biographical examples
from the corpus came from the General narrative text type. One of the goals
of the current research is to identify those features that identify biographical
writing across text types.
The Multi-Dimensional Analysis Approach: Further Applications
The multi-dimensional analysis approach to linguistic variation has been highly
influential in the corpus linguistics research tradition (M C E NERY and W ILSON,
1996), principally in the areas of synchronic studies, disynchronic studies, studies
of variation in languages other than English, and studies of variation applied
to language learning. These four areas will be briefly described using one representative study for each area.
Synchronic Studies
Synchronic studies investigate linguistic variation between genres at a specific
point in time. That is, synchronic studies of variation attempt to account for
variation between registers (or genres) rather than the changes in a register
over time. An example of this kind of study is C SOMAY (2002), who used multidimensional analysis to study a corpus of US higher education spoken lec-
18
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
tures.9 The aim of the project was to explore how lectures differ with respect to
educational level, and degree of interactivity. Twenty-three linguistic features
associated with academic lectures were identified (including, a high frequency
of nouns, use of attribute adjectives and the passive voice), and placed in dimensions similar to those of B IBER (1988). After subjecting the feature scores
to cluster analysis, C SOMAY (2002) found that non-interactive lectures showed
a similar pattern to academic prose, whereas lectures (at a similar level) with a
highly interactive component, were more akin to spoken discourse (for example, they had fewer passive constructions).
Diachronic Studies
Diachronic studies investigate variation within a single register (or perhaps
groups of registers) over time. The multi-dimensional variation methodology
has been particularly important in historical linguistics, of which an example is
ATKINSON (1992), who analysed the development of scientific writing between
1735 and 1985 in the Edinburgh Medical Journal. The journal was sampled at approximately 45 year periods (that is, 1735, 1774, 1820, 1864, 1905, 1945 and
1985), with ten articles taken from each period. Further, after an informal analysis of the articles, they were placed into one of five possible categories:
1. case reports (narratives of single cases)
2. disease reviews (summaries of knowledge with respect to a particular
disease or ailment)
3. treatment reviews (summaries of possible treatments for a given disease
or condition)
4. experimental reports (reports of individual scientific experiments or studies)
5. reproduced speeches (reproduced speeches of medical luminaries)
When subjected to multi-dimensional analysis, ATKINSON (1992) found that
the Journal had become progressively less narrative over its history across all
five identified article types, and also that the language became less overtly persuasive over the same period, perhaps reflecting a more modern, “dispassionate” and objective scientific writing style.
Languages Other Than English
The use of the multi-dimensional methodology to analyse the extent of variation in Korean (K IM and B IBER, 1994) has already been mentioned briefly on
page 13. The study shows that the dimensions identified by factor analysis
are not universal — K IM and B IBER (1994) labelled one of the identified dimensions, “honourific”, reflecting the stress in Korean society on deferential
9 173 lectures were selected from the The TK2-SWAL Corpus, a spoken and written corpus of
US higher education discourse.
19
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
social relationships, and its relatively recent widespread and officially sanctioned adoption of a non-logographic writing system.10 Another example of
multi-dimensional work on a language other than English is B IBER and H ARED
(1994), who investigate the development of news registers in Somali, a language with only a very short written tradition.11 The aim of this work was
to assess the impact of literacy in Somali by focusing on differences between
written registers (news text) and spoken registers (for example, conversations).
This was achieved by sampling a corpus of Somali news texts at three periods (roughly, 1974, 1979 and 1987) and identifying changes that have occurred
between these samples. B IBER and H ARED (1994) found that as the written registers matured, variation decreased between registers as linguistic use became
standardised. Additionally, variation has increased between written and spoken registers as written registers have become more firmly established.
Language Learning Applications
Two examples of multi-dimensional analysis applied to language learning issues have been briefly mentioned above (C SOMAY, 2002; B IBER ET AL ., 2002).
B IBER ET AL . (2002) seeks to explore the diverse range of language tasks faced
by international students at US universities, in order that English language instruction can be made more appropriate to a university setting. The multidimensional study showed that, unlike general English, university English
is very strongly polarised between written and spoken genres, with university written registers being uniformally informationally dense prose and an
impersonal style, whereas spoken genres (even lectures) are characterised by
fewer features of impersonal style and features of involvement and interaction.
Multi-Dimensional Analysis: Criticism and Alternatives
This section briefly surveys some criticism of the multi-dimensional analysis
approach to the study of linguistic variation, before going on to discuss an alternative approach to linguistic variation, without the computational and statistical “overhead” of multi-dimensional analysis.
Multi-Dimensional Analysis of Linguistic Variation: Some Criticisms
L EE (1999) as part of his PhD work, replicated Biber’s original 1988 work on
a larger scale, with a greater number of linguistic features (84 compared to
Biber’s 67), new feature identification programs, and a much larger corpus of
contemporary English — a four million word subset of the British National
10 Hangul, the Korean alphabet, has been the official writing system since 1945. Before 1945, a
logographic system based on Chinese characters was used.
11 Somali did not have a widespread written form until 1972, when a new orthographic system
was imposed “top-down” by the Somali government.
20
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
Corpus. As part of the replication process, L EE (1999) analysed the multidimensional approach and “found it to be lacking in some respects. In particular, it does not seem to support the kinds of conclusions and claims that
have been made for it” (L EE, 1999, p. 396). These criticisms fall into two main
groups:
1. Sampling Texts: The representativeness of Biber’s original corpus is not
well understood, and hence any extrapolations of findings to the English
language generally, are statistically invalid. This is a principled objection
to the multi-dimensional approach (and it would seem, to most of corpus linguistics) and is not remedied by the increase in size of the corpus
— even a very large corpus (as used by L EE (1999)) is not known to be
representative of written English as a whole.
2. Sampling Linguistic Features: The particular choice of linguistic features
(and the exclusion of others) dictates the dimensions that are formed in
factor analysis (see Appendix H on page 274 for a brief review of factor analysis.) L EE (1999) suggests that feature selection is an iterative
process, rather than a matter of selecting features that intuitively seem
important, as intuitive feature selection is likely to lead to the creation of
factors “that are to some degree artefacts of the features chosen.” (L EE,
1999). In other words, factors in multi-dimensional analysis will simply
reflect the theoretical commitments made when selecting which features
are relevant to genre; according to L EE (1999), an unacceptably circular
methodology. In contrast, Lee suggests that the proper use of multidimensional analysis involves using the maximum number of features
and identifying those which are most important in an iterative fashion.
The “Keyword” Approach
T RIBBLE (1998) presents an alternative to the multi-dimensional method for
the analysis of genre developed by B IBER (1988) based on the keyword function
available in the software tool, WordSmith.12 T RIBBLE (1998) suggests that the
use of genre specific keywords (as identified by WordSmith) serves as a low
effort alternative to multi-dimensional analysis. That is, instead of writing linguistically complicated feature extraction programs, and then subjecting the
frequencies of the identified features to factor analysis, a relatively complex
statistical process – we can simply use empirically identified genre specific keywords identified by the WordSmith program. The keywords approach to genre
analysis using WordSmith has three stages:
1. Given a corpus divided into genres of interest (henceforth a test corpus),
a wordlist for each genre is created (a list of word types for each genre).
2. The wordlist from each genre in the test corpus is compared against a
12 WordSmith consists of a concordancer, and keyword extractor (as well as other tools). It
is used widely in corpus linguistics research and lexicography. More details can be found at:
http://www.lexically.net/ Accessed 05-08-06.
21
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
reference corpus, using a feature selection method (see page 50). This reference corpus ought to be large and there should be an attempt to make
it representative. The frequencies of each type in the test corpus wordlist
(remember there is a test corpus wordlist for each genre represented) is
compared to the frequency of that word in the reference corpus, and each
typed is ranked according to its “keyness.”13
3. In order to improve the “keyness” of the genre keywords, key-keywords
are used. These key-keywords are words that are keywords in more than
one genre text. That is, those words that are only key in one text from
a particular genre are not key-keywords for that genre. The central idea
here is that by eliminating those words that are only key in one text, we
are left with key-keywords that better reflect the “essence” of a given
genre, rather than the specific topicality of individual texts.
X IAO and M C E NERY (2005) tests the keyword approach identified by T RIBBLE
(1998) against a Biber style multi-dimensional methodology for the analysis
of three genres in American English; conversation, presentational speech (for
example, sermons, lectures, and so on) and academic prose. While X IAO and
M C E NERY (2005) concludes that the keyword approach is not a substitute for a
full multidimensional analysis, its power, at least on the data used in the study,
is acknowledged, as the keyword method yields results broadly comparable
with Biber style multidimensional analysis. Additionally, X IAO and M C E NERY
(2005) emphasises the relative difficulty of the Multidimensional approach,
describing it as “very time consuming and computationally/statistically demanding” compared to the keyword approach, which uses “off-the-shelf” software and does not require programming or complex statistics.
The current work is not committed to using multi-dimensional analysis as a
tool for exploring linguistic variation. Instead, B IBER (1988)’s preliminary data
– the frequency tables of linguistic features across genres – are used to identify those features especially characteristic of a particular genre (in this case
biographical writing.) The usefulness of this data for research use is stressed by
L EE (1999).
2.2 Stylistics and Stylometry
This section reviews the linguistic sub-discipline of stylistics, before contrasting
stylistics with the allied discipline of stylometrics and its applications in authorship attribution studies.
13 “Keyness” is calculated using either the chi-square or log-likelihood feature selection methods
(S COTT , 2005) described on page 50 of this document.
22
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
2.2.1 Stylistics
Stylistics is the formal study of literary style. Style itself is, however, not a
straightforward concept, as it is normally used to describe the non-propositional
content of a text. That is, the aesthetic “residue” when propositional content
has been removed. The particular linguistic choices — ways of expressing —
associated with a given genre or writer are described as that genre or writer’s
style. For instance, the obituary genre has a particular style characterised by
the heavy use of euphemism. For example, “she is survived by three adult
sons and a husband” and “she did not suffer fools gladly”, rather than “she
is outlived by her husband and three adult sons” and “she was irritable and
disagreeable”.
L EECH and S HORT (1981) provides a seminal study of stylistics in a literary
context. The authors are concerned with studying how certain linguistic choices
create specific artistic effects. A very simple example here would be the very
different aesthetic effects created by sentence length between, say, Henry James
and Ernest Hemingway (sophistication versus power). The difficulty of distinguishing between different styles is repeatedly emphasised by L EECH and
S HORT (1981). For them, style is a relational term; we can only define a style in
relation to another style.
Style is a relational term: we talk about “the style of x”, referring
through “style” to characteristics of language use, and correlating
these with extralinguistic x, which we may call the stylistic domain.
The x (writer, period, and so on) defines some corpus of writings in
which the characteristics of language use are to be found. But the
more extensive and varied the corpus of writings, the more difficult
it is to identify a common set of linguistics habits.
L EECH and S HORT (1981, p. 11)
The study of style is also relational with respect to the specific research question. For instance, if we are interested in a particular genre, we will try and focus on those features that are particular characteristics of the genre, and those
features that differ between different authors working within the same genre
will be discounted.
L EECH and S HORT (1981) developed a methodology for comprehensively analysing
texts (manually rather than automatically) for style. These features are divided into four categories (lexical, grammatical, figures of speech, and cohesion/content) described in more detail below:
1. Lexical Categories
General: Is the vocabulary simple or complex? Is it descriptive or
evaluative? Are any rare or specialised words used?
Nouns: Are the nouns abstract or concrete? How are proper names
used?
23
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
Adjectives: How frequent are the adjectives and what are the attributes to which they refer? For example, evaluative or colour adjectives, and so on.
Verbs: Are they transitive or intransitive? Do they refer to physical
acts, speech acts?
Adverbs: How frequent are adverbs? What function do they play?
For example, time, degree, place, and so on.
2. Grammatical Categories
Sentence Types: Are questions, commands and exclamations used
in addition to propositional sentences.
Sentence Complexity: How complex are the sentences? What is the
average sentence length? Do the sentences vary greatly in length
and complexity?
Clause Types: What kind of dependent clauses are used: relative,
adverbial or nominal?
Clause Structure: Is there anything unusual about clause structure?
Noun Phrase: Are noun phrases simple or complex? Are there sequences or adjectives? Is apposition used?
Verb Phrases: Is the simple past tense used? If not, what kind of
tenses are present?
Other Phrase Types: Are prepositional phrases, adjective phrases
or adverb phrases used?
Minor Word Classes: How are function words used? For example,
prepositions, auxiliaries, determiners, and so on.
General: Any other unusual constructions (for example, superlatives, comparatives, and so on.
3. Figures of Speech
Grammatical and Lexical Schemes: Is the technique of structural
repetition used? For example, anaphora, parallelism.
Phonological Schemes: Are rhyme or alliteration used?
Tropes: Is there any obvious linguistic “deviation”? For example,
neologisms and so on.
4. Content and Cohesion
Cohesion: How does the text manage links between sentences? Are
the links implicit, logical, or via coordinating conjunctions. How is
repetition used?
24
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
Context: How is speech represented, directly or indirectly? Is first
or third person narrative used? Are there differences in style in the
reported speech of different characters?
2.2.2 Stylistic Analysis
Stylistics as envisaged by L EECH and S HORT (1981) and S HORT (1996) is designed to analyse small sections of literary text, in an attempt at bringing systematic techniques traditionally associated with linguistics to the literary study
of English Literature. This enterprise is known as stylistic analysis. The technique is empirical in that it respects the primacy of text, and seeks to explore and
describe the literary devices used in text. The identification of important features, however, relies on the skills and intuitions of an experienced reader. For
example, if we compare two short text examples presented in Figure 2.5 on the
next page (from Martin Amis’s 1985 novel Money and Daniel Dennett’s philosophical monograph Consciousness Explained, respectively) using only some of
the analytic methods identified by L EECH and S HORT (1981) we can mark out
clear differences between the texts. This identification of relevant features becomes more difficult however, if instead of using examples from radically different genres (contemporary prose fiction and philosophy in this instance) we
choose texts from the same genre.
The most striking differences between the two examples are their use of vocabulary: we can see that Example 2 uses several specialised, technical words
(heterophenomenology, blindsight), and standard words on the edge of acceptability (“ultracautious”) whereas Example 1 relies on a non-technical vocabulary. Neither text contains concrete nouns, suggesting that both of them are
designed to convey information about abstract qualities, rather than physical descriptions. Example 1 exhibits a more informal use of adverbs (“awful
slowly”) and a more informal tone generally. Example 1 contains a rhetorical
question, designed to create a conversational tone, whereas Example 2 concentrates on the clear exposition characteristic of academic writing. Both examples
exhibit similar levels of sentence complexity.
2.2.3 Stylometrics: Authorship Attribution
Stylometry is concerned with the statistical analysis of style.14 Although the
stylistics techniques exemplified by L EECH and S HORT (1981) and S HORT (1996)
are empirical in that they draw conclusions from the close analysis of data using
a “checklist” type approach, the interpretation of the selected features relies on
the skilled judgements of the reader. Stylometrics, can be thought of as a scientific sub-discipline of stylistics that has developed its own methodologies and
techniques (WALES, 1989). Stylometrics is also distinguishable from stylistics
14 Literally
“the measurement of style”.
25
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
Figure 2.5: Two Examples of Stylistically Different Texts from (A MIS, 1985) and
(D ENNETT, 1992), Respectively.
EXAMPLE 1: Opera certainly takes its time, doesn’t it? Opera really lasts, or
at least Otello [sic] does. I gathered that a second half would follow this one,
and this one was travelling awful slowly through its span. The other striking
thing about Otello [sic] is — it’s not in English. I kept expecting them to pull
themselves together and start singing properly, but no: Spanish or Italian or
Greek was evidently the deal.
EXAMPLE 2: Here is a place where the ultracautious policies of heterophenomenology pays dividends. Both blindsight subjects and hysterically
blind people are apparently sincere in their avowals that they are unaware
of anything occurring in their blind field. So their heterophenomenological
worlds are alike – at least in respect to their presumptive blind field.
not only because of its methodology, but also because of its limited application. Traditionally, stylometrics has concerned itself primarily with the issue of
authorship attribution.
Stylometry is the science which describes and measures the personal elements in literary and extempore utterances, so that it can
be said that one particular person is responsible for the composition rather than any other person who might have been speaking
or writing at that time on the same subject for similar reasons. Stylometry deals not with the meaning of what is said or written but
how it is being said or written.
(M ORTON, 1978, p. 7)
A further distinction is often made between stylometry, computational stylometry, and computational stylistics (M C E NERY and O AKES, 2000). Stylometry, although greatly facilitated by the use of computers, is not dependent on them.
Indeed, early, painstaking work on stylometry (reviewed below) began well before the advent of digital computers. Computational Stylometry normally refers
to automatic stylistic analysis using electronic texts, Computational Stylistics is
slightly broader in scope and chiefly distinguished by its use of more complex features than traditional stylometry (W HITELAW and A RGAMON, 2004),
and concern with providing a “bridge” between authorship attribution style
stylometry and literary stylistics (C RAIG, 1999).
26
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
Development of Stylometry
The stylometric method for author identification was first outlined in 1851
in a letter by Augustus de Morgan (reported by K ENNY (1982)). de Morgan
suggested that a dispute over the authorship of the Pauline Epistles could be
settled by an analysis of the word lengths of the various Epistles. In 1887,
Mendenhall (described in K ENNY (1982) and also O AKES (1998)) compared
the frequency distributions of word lengths for Shakespeare and several other
writers (including Bacon, J.S.Mill, and Marlowe), and discovered that the distributions for each writer had distinctive shapes.
In the middle of the Twentieth Century, computers were inaccessible to most
language researchers. Y ULE (1944) — who was the author of a standard textbook on general statistics according to L OVE (2002) — did however make use
of the advances in statistics since the end of the Nineteenth Century. Instead of
(impressionistically) comparing the appearance of frequency distributions (like
Mendenhall), Y ULE (1944) used statistical tests based on lexical features (for example, total vocabulary size, use of unique nouns) in order to investigate the
disputed authorship of the medieval religious text De Imitatione Christe, which
had been variously attributed to both Thomas a Kempis and Jean Charlier de
Gerson. Yule’s analysis strongly favoured Thomas a Kempis as author.
M ORTON (1965), in the tradition established by de Morgan, analysed the most
common word in the Pauline Epistles (“kai” – the Greek for “and”) as a proportion of the total words in each Epistle, finding that the Epistles fell into two
distinct groups, indicating that St Paul authored only some of the texts traditionally attributed to him. However, when this method was applied to a corpus
of Morton’s own essays, E LLISON (1967) (reported in O AKES (1998)) claims that
the result suggests multiple authors, indicating that reliance on single connectives as features is not adequate to reliably distinguish between authors.
More recently, there has been an attempt to create an experimental framework
for authorship attribution studies, a major part of this effort is the creation of
a software environment for (repeatable and perspicuous) author identification
experimentation (JGAAP — Java Graphical Authorship Attribution Program)
(J UOLA ET AL ., 2006).
Standard Stylometric Feature
M C E NERY and O AKES (2000) describe six easily computable features commonly used in stylometric studies (see Figure 2.6 on the following page). This
work draws a distinction between features that are “contingent” or readily under the control of the author and reflective of the genre of the text (sentence
length, word length, vocabulary richness, and relative frequencies of parts-ofspeech) and those authorial decisions that are made at the unconscious level
27
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
Figure 2.6: Examples of “Deep” and “Contingent” Features Described in
M C E NERY and O AKES (2000).
C ONTINGENT F EATURES
1. Sentence Length has the advantage of being easy to compute, but has
been shown to be an unreliable indicator of authorship.
2. Word length like sentence length is very easy to compute, but like sentence length, seems to indicate genre rather than a specific author.
3. Vocabulary Richness is a measure of variety used by the author. The
most common and simplest to compute is the type/token ratio. That is,
the number of types divided by the number of tokens in a given text.
4. Part-of-Speech Relative frequencies of nouns, verbs, and so on. This
choice of feature requires that the text be either hand tagged for part-ofspeech categories — a laborious task — or tagged using an automatic
part-of-speech tagger.
D EEP F EATURES
1. Word Ratio including the relative frequency of synonyms (like
while/whilst and since/because).
2. Letter Based that is, the frequency of different characters. This technique
is more common in non-English alphabets.
that may indicate differences in “deep” authorial style (ratios of the relative
frequency of synonyms and character based frequencies).
Authorship attribution studies often draw a distinction between the “deep”
and “contingent” styles of an author. An author’s deep style remains constant
over time (in adulthood) and between different genres. It cannot be easily disguised. The “contingent” style of an author is that which changes over the life
course, and between different genres (for example, sentence length may vary
with a particular author working in different genres, as may vocabulary richness). The features characteristic of “deep” style are less amenable to conscious
control (M C E NERY and O AKES, 2000; L OVE, 2002). Figure 2.6 details common
features used in authorship attribution studies and their status as “deep” or
“contingent” based on M C E NERY and O AKES (2000).
Authorship Attribution: The Federalist Papers
This section briefly reviews some important work on the Federalist Papers, a series of seventy seven articles published in four New York newspapers in 1787
and 1788, which provide a challenging and frequently used test corpus for assessing different authorship attribution techniques. The papers were written
to persuade the population of New York to ratify a new constitution for the
28
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
United States, which, initially they failed to do, and the papers were republished as a book in 1788 (with eight extra essays). The papers were initially
published using the pseudonym “Publius” and it is generally agreed that three
men were responsible for all eighty five papers. These were, Alexander Hamilton, John Jay and James Madison (later the President of the United States).
Of the eighty five papers, Madison and Hamilton both claimed authorship of
twelve. The papers are especially useful for authorship attribution studies purposes because all three writers attempted to write in a single consistent style
(as the person Publius), and the genre and subject matters is remarkably homogenous (18th Century American political rhetoric), providing a solid test of
those features and statistical methods that discriminate texts solely on the basis
of authorship.
There is a central problem in comparing the efficacy of different feature sets
(see Section 8 on page 144 for the difference between features and feature sets)
for attributing authorship to those papers whose author is known. Different
studies used not only different feature sets, but different classification algorithms, meaning that direct comparisons between feature sets are difficult to
make. The following three approaches give a flavour of the kinds of techniques
used:
1. M OSTELLER and WALLACE (1984), in a book length study used “marker
words” as features, after discovering that sentence length failed to distinguish adequately between the authors (owing to the attempts of the
authors to adopt a uniform style). These “marker words” were synonym
pairs (for example, “upon/on” and “while/whilst”) based on an analysis of Hamilton and Addisons’ wider output outside the federalist papers. The approach used a multi-stage Bayesian methodology to classify
the papers, that despite promising results, did not become popular in
the authorship attribution community due to the statistical sophistication required to implement it successfully (H OLMES and F ORSYTH, 1995;
M C E NERY and O AKES, 2000). M OSTELLER and WALLACE (1984) does
however remain a landmark study in authorship attribution studies.
2. M C C OLLY and W EIER (1983) used the features identified by M OSTELLER
and WALLACE (1984) and compared their performance to sixty-four context independent function words using a likelihood ratio approach. It
was found that classification accuracy was much lower using function
words, rather than the synonym pairs identified by M OSTELLER and WAL LACE (1984), indicating that the relative frequency of function words are
more a result of genre constraints than specifically authorial style.
3. H OLMES and F ORSYTH (1995), use (among other methods) a genetic algorithm15 to compare the “marker words” identified by M OSTELLER and
WALLACE (1984) and forty-nine frequent function words. Like, M C C OLLY
15 An algorithm that is based on the analogy of Darwinian evolution to develop new rules by a
process of mutation and fitness testing.
29
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
and W EIER (1983), it was found that use of function words were less successful as discriminating features than the “marker words” identified by
M OSTELLER and WALLACE (1984). This result is consistent with the claim
that the distribution of function words is dictated by genre rather than
authorial style, and suggests that function words may be useful for genre
classification (see the experimental work described in Chapter 9).
2.3 Biographical Writing
This section briefly surveys some of the defining characteristics of biographical writing, before outlining the history of biographical writing (focusing on
biographical writing in English).
2.3.1 Characteristics of Biographical Writing
Biography as a genre is the history of the lives of individuals with its own literary
form (S HELSTON, 1977; G ITTINGS, 1978; M AUROIS, 1929). A biography needs to
be historical; a factually grounded narration describing the unfolding of events
over time. A biography needs to be focused on an individual; other persons are
considered only insofar as they relate to the individual of interest. It also needs
to have a certain form, presenting certain kinds of information (birth and death
dates, location of birth, education, and so on) in a given order, and retaining a
chronological sequence.16
All these three conditions have to be met for a text to be described as biographical. For instance, a person’s dental records are about an individual (they satisfy
the individuality criteria), and they also focus on that individual over time
(they satisfy the historical criteria). Dental records, however accurate, do not
satisfy the form criteria, in that they focus entirely on biographically less relevant information. To use another example, while the play Hamlet is directly
concerned with recounting the life of an individual (the Prince of Denmark) and
has a biographical form (describing events in the life of Hamlet), if fails to satisfy the history criterion (the events did not take place and the person does not
and did not exist).
Biographical texts use a version of what journalists’ term the “inverted triangle” for presenting information: “an old-fashioned device, the origins of which
are unclear, but the rules of which stand the test of time.” (PAPE and F EATH ERSTONE , 2005). In standard newspaper writing practice, the first paragraph
of an article is a short summary of the story; the essential facts. The next paragraphs will expand on these essential facts, providing background information
16 Often book length literary biographies deviate from this form at a superficial level – say, in the
first few paragraphs of the book – but when the overall structure of the text is considered, the form
chosen is almost always chronological.
30
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
and perhaps analysis. The final paragraph will bring the article to a graceful
end. This technique is useful for the following two reasons:
1. It allows the reader to read as little or as much as they want and still gain
a coherent account.
2. If the article has to be edited, it can be done speedily simply by removing
paragraphs from the end of the article, without compromising its meaning.
In the case of biographical texts the pyramid is more prescribed; certain facts
are obliged to appear at the widest part of the pyramid (see 3.2 on page 57).
These facts include, dates of birth and death, place of birth, profession, notable
achievements and significant events. Additionally, information about family
(for example, marital status, details of children and parents) may be included.
The first paragraph of the biography is an attempt at summarising the life, on
which subsequent paragraphs elaborate. A rule of thumb is that the longer
the biography, the more background information is presented; this is obvious
if we think of a published biography of a politician. The skeleton facts can be
presented in a single paragraph, the rest of the book is a deeper analysis of
these facts, and the historical background necessary to enhance understanding. It is notable that very short biographies (one paragraph) consist entirely
of these central biographical facts with no elaboration or padding. Sometimes
these type of biographies reject the norms of published narrative English and
adopt instead an information rich, restricted “telegraph language”, normally
to a highly prescribed biographical scheme. For instance, the quotation below
forms the beginning of a biography of George Washington, reproduced from
the the website of the US Congress:
WASHINGTON, George, (granduncle of George Corbin Washington), a Delegate from Virginia and first President of the United
States; born at Wakefield, near Popes Creek, Westmoreland County,
Va., February 22, 1732; raised in Westmoreland County, Fairfax County
and King George County; attended local schools and engaged in
land surveying; appointed adjutant general of a military district in
Virginia with the rank of major in 1752; in November 1753 was sent
by Lieutenant Governor Dinwiddie, of Virginia, to conduct business with the French Army in the Ohio Valley; in 1754 was promoted to the rank of lieutenant colonel and served in the French
and Indian war, becoming aide-de-camp to General Braddock in
1755; appointed as commander in chief of Virginia forces in 1755;
resigned his commission in December 1758 and returned to the
management of his estate at Mount Vernon in 1759; served as a justice of the peace 1760-1774, and as a member of the Virginia house
of burgesses 1758-1774.
http://bioguide.congress.gov
The inverted biographical triangle is illustrated best in longer, multi-paragraph
31
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
biographies written in continuous prose rather than the “choppy”, truncated
style of the previous example. The initial paragraph is used to give basic information, and the rest of the text expands on the these facts. This style can be
seen below in the first paragraph of a piece on Charles Dickens:
1812-70, English author, born in Portsmouth, one of the world’s
most popular, prolific, and skilled novelists. The son of a naval
clerk, Dickens spent his early childhood in London and in Chatham.
When he was 12 his father was imprisoned for debt, and Charles
was compelled to work in a blacking warehouse. He never forgot this double humiliation. At 17 he was a court stenographer,
and later he was an expert parliamentary reporter for the Morning
Chronicle. His sketches, mostly of London life (signed Boz), began appearing in periodicals in 1833, and the collection Sketches
by Boz (1836) was a success. Soon Dickens was commissioned to
write burlesque sporting sketches; the result was The Posthumous
Papers of the Pickwick Club (1836-37), which promptly made Dickens and his characters, especially Sam Weller and Mr. Pickwick,
famous. In 1836 he married Catherine Hogarth, who was to bear
him 10 children; the marriage, however, was never happy. Dickens had a tender regard for Catherine’s sister Mary Hogarth, who
died young, and a lifelong friendship with another sister, Georgina
Hogarth.
http://www.bartleby.com
It is important to note, however, that often in book length, or more literary
biographical texts, the highly constrained “inverted pyramid” is likely not to
apply, just as in longer feature articles in newspapers, the journalistic “inverted
pyramid” is less likely to apply.
2.3.2 Development of Biographical Writing
The earliest known biographies (or proto-biographies) are the classical histories of Herodotus and Thucydides (H ERODOTUS, c 440 BC; T HUCYDIDES, c
411 BC). These texts were primarily historical, but did contain person focused
interludes. This tradition was continued with P LUTARCH (c 100). Similarly,
many sacred texts contain biographical sections, which are, at least in intent,
biographical according to the three characteristics of biography given above.
That is, a biography should be historical, focused on an individual and be of a
distinct literary form.
The earliest English biographies were written in Latin and had religious subjects (for example, Adamnan’s Life of St Columba (A DAMNAN, c 690), or Bede’s
Life of St Cuthbert (B EDE, c 700)). These lives of the saints — hagiographies —
stressed the enthusiastic description of miraculous religious happenings above
factual accuracy. By the sixteenth century however, a more familiar biograph-
32
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
Figure 2.7: Inverted Pyramid for Biographies.
INTRODUCTION PARAGRAPH
Essential:birth and death dates, location of birth, profession,
notable achievements, significant events.
Optional: marital status, details of children,
details of parents
EXPANSION PARAGRAPH/S
The next paragraphs will expand on
the initial facts, usually with a
chronological, narrative structure.
BACKGROUND PARAGRAPH/S
This group of paragraphs
provides relevant background
information
CONCLUSION
The piece is brought
to a graceful
conclusion.
Sometimes an
appendix
(publications,
etc.)
ical form had emerged, written in English and concerned with presenting verified fact (a good example of this development is William Roper’s biography
of Thomas More (R OPER, 1550)). The first attempt at a National Biographical
Dictionary — the Athenae Oxonienses17 was completed in 1691 (W OOD, 1691).
The Eighteenth Century saw the publication of Dr Johnson’s Lives of the Poets
(J OHNSON, 1781), a work that follows the contemporary form of more literary
biographies; basic information is presented, along with criticism and analysis of the biographical subjects’ achievements. Boswell’s Life of Johnson, first
17 The Athenae Oxonienses was subtitled “An Exact History Of All The Writers and Bishops Who
have had their Education in The most ancient and famous University Of Oxford, From The Fifteenth Year of King Henry the Seventh, Dom. 1500, to the End of the Year 1690. Representing
The Birth, Fortune, Preferment, and Death of all those Authors and Prelates, the great Accidents of
their Lives, and the Fate and Character of their Writings.”
33
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
published in 1791 (B OSWELL, 1791) in its accuracy, attention to detail and emphasis on character revealing incidents, exemplifies the literary biographical
form.
Perhaps the most common biographical subtype is the obituary. Obituaries are
distinct from book length treatments as, although they obey the basic form of
biographies, they are more anecdotal and selectively focused. They are also
generally (though not necessarily) biased in favour of the subject. Obituaries
have been part of British and American newspapers since the late nineteenth
century. They concentrate on achievement, and in the British tradition, do not
dwell on (or often even mention) the cause of death (F ERGUSSON, 2000). An
important variation in British obituary writing is the anonymity of the writer;
in The Guardian obituaries are signed, in The Times and The Telegraph they are
anonymous.
Currently, there are a number of forums for biographical writing; the traditional book length treatment, obituaries, encyclopedias and dictionaries of biography. Many of these resources are reproduced online, and there are also
numerous websites that generate biographical sketches of various lengths, and
in various domains. For example, the U.S government publishes biographical profiles of current and former congressmen online18 and various websites
electronically publish short biographies of scientists.19
2.4 Classification
As this thesis is concerned with biographical sentence classification it is important to introduce some important issues in classification theory and how these
issues apply to text classification.
According to the classical theory of classification — presented by Aristotle who
we have previously mentioned in relation to literary genre on page 7 — there
are a set of necessary and sufficient conditions for category membership. Also,
any member of a category is an equally good member of that category. The
classical (Aristotelian) theory was further entrenched when systematic classification with respect to biology was developed as part of a scientific biology in
the early eighteenth century with the creation of Linneaus’ familiar taxonomy
of genus and species (L INNAEUS, 1735). Linnaeus attempted to introduce a
more formal system in place of (or parallel with) the more imprecise categories
of folk taxonomy that could not support new developments in biology.
The twentieth century saw a weakening of the classical theory of classification with the application of ideas from philosophy and psychology to the theory of classification (L AKOFF, 1987); particularly the later Wittgenstein’s the18 For example,
http://bioguide.congress.gov/biosearch/biosearch.asp Accessed 05-08-6
19 For example, http://scienceworld.wolfram.com/ Accessed 05-08-06
34
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
ory that the relationships between category members are better characterised
in terms of “family resemblance” rather than common properties determining class membership. W ITTGENSTEIN (1953) used the example of “games”
to illustrate the point; there are many ways an activity can be a game; it can
be competitive or non competitive, indoor or outdoor, athletic or sedentary.
Tiddly-winks, lacrosse, tennis, and bowls do not share any single common
property that make them belong to the category “games”, yet we have no difficulty in identifying them as games. Rosch’s influential experimental work
on human classification also undermined the classical theory; in a series of experiments R OSCH (1973) found that within categories, some members are universally judged to be better examples of a class than others — when applied
to colours, this meant that there were “prototype” colours (that is, a “best” or
“prototype” red colour, that is a better example of red than the other shades
of red) — showing that not all members of a category are equally good representatives of that category, and undermining the classical view of category
membership. It seems that the number of features shared between category
members is pivotal here; the more features shared with other members of the
category, the “better” the example is judged to be (R OSCH, 1973), providing
empirical support for Wittgenstein’s philosophical intuition. For example, although football and solitaire both belong to the category “games”, football is a
better example of a “game” than solitaire, as football has more features characteristic of games than solitaire (competitive, number of participants greater
than one, a clear winner, and so on.). Rosch’s work formed the foundation of
the prototype theory of categorisation (TAYLOR, 2003).
Additional support for this abandonment of classical classification theory, with
its emphasis on necessary and sufficient conditions for class membership, is
provided by L ABOV (1975), who in a series of experiments utilising household receptacles demonstrated that there is no hard and fast boundary between
neighbouring categories. L ABOV (1975) presented his participants with various
containers; when the container was as deep as it was wide, and also had a handle, then participants called it a “cup”. When width was increased however,
a higher proportion of participants called it a “bowl” and when depth was increased, more participants called it a “vase”. The important point to note here
is that there is no strict rule that distinguishes “bowl” from “cup” or “cup”
from “vase”.
Prototype theory is a general theory of classification that can easily be applied
to the special case of interest here; the classification of texts by genre. For instance, if we are seeking to classify newspaper articles into two classes, editorial
and reportage, then although many articles are likely to be prototypical editorial articles (for example, leaders) and many are likely to be prototypically
reportage articles (for example, factual reports on the activities of politicians),
many articles will also be “outlying” editorials (that is, less good examples of editorials like, for instance, advice columns) or “outlying” reportage (that is, less
good examples of reportage, like business reports with extensive background
35
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
material).
The classification of texts at the document level (particularly physical books)
has traditionally been a central concern for librarians, and a considerable literature and expertise has developed, particularly over the last one hundred and
fifty years. Substantial effort was expended on the creation, update and elaboration of comprehensive library cataloguing systems in the late Nineteenth
Century (for example, Dewey decimal) (B ATTLE, 2004). Traditionally, these
classification schemes were thought of as analogous to Linnaeus’ taxonomy of
natural kinds (B ROADFIELD, 1946), but library subject categories are not obviously natural kinds; we cannot classify a book as belonging to a given category
in the same way that we could confidently classify a fruit with respect to an obvious quality, like its shape (L AKOFF, 1987). The classification task performed
by librarians lacks a developed theoretical basis and can properly be described
as a professional craft skill, not based in an integrated theory of classification,
but rather classification with respect to a particular academic discipline’s cultural requirements (M AI, 2004).
The focus of this thesis is however, automatic text classification; an area which
has its own problems distinct from human approaches to text classification. Automatic text classification is the subject of the next section.
2.4.1 Automatic Text Classification
The term automatic text classification (also categorisation, or categorisation20 )
has traditionally been used to describe a group or related tasks (S EBASTIANI,
2002):
1. The automatic assignment of texts to predefined categories. This is the
dominant sense, and the one assumed here.
2. The automatic identification of a set of categories. This is a usage from the
early days of text categorisation research (B ORKO and B ERNICK, 1963).
3. The identification of a set of categories, and the subsequent assignment of
documents to those identified categories. This is more normally referred
to as text clustering.
4. The general name for document categorisation tasks — that is, both what
we are referring to as text classification and text clustering both belong
to the more general class of text categorisation. This is the terminology
used in Manning and Shütze’s textbook in statistical natural language
processing (M ANNING and S CH ÜTZE, 1999).
An important distinction exists between categorisation based entirely on the
contents of documents, and categorisation where extra-document metadata is
available to aid with the classification task (for example, where the document
20 These
terms are used interchangeably.
36
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
or text has been indexed). Automatic categorisation based entirely on the contents of the document itself is referred to as endogenous categorisation, whereas
categorisation based on a document augmented by metadata is referred to as
exogenous categorisation. Most research work on automatic text categorisation
focuses on endogenous categorisation exclusively. Additionally, classification
tasks can also be distinguished with respect to their accommodation of overlapping categories; that is, overlapping categories allow instances — items to
be classified — to belong to more than one category, and non-overlapping categories restrict instances to one category. Items can also be assigned to categories probabilistically (that is, each item is said to belong to each category
with a certain degree of probability). Binary classification (of which biographical sentence categorisation is an example) is a special case of non-overlapping
categorisation where the number of categories is limited to two. Automatic
text classification is used successfully in a number of different application areas. Examples (of which there are many) include, spam filtering, automatically
indexing academic papers, author attribution studies, and language identification.
2.5 Machine Learning
Since the early 1990s, machine learning techniques for automatic text categorisation have won popularity over knowledge engineering approaches to the task,
partly because of the availability of copious training data necessary for machine learning, partly because of the cost (and brittleness) of human expertise
in framing task specific rules, and partly due to a shift in theoretical emphasis towards empirical techniques (all these factors interact in precipitating the
shift towards machine learning). General theoretical texts on machine learning
include M ITCHELL (1997), and more specifically focused on the classification
of texts M ANNING and S CH ÜTZE (1999, chap 16). More practically orientated
machine learning focused texts include W ITTEN and F RANK (2005).
This section is designed to provide enough information and background on
machine learning for the reader to make sense of those research chapters (7,
8, 9 and 10) that rely on machine learning techniques. First, the six learning
algorithms used in the research are outlined. Second, some comments on the
evaluation of learning are presented. Finally, the notion of feature selection (that
is, the selection of those features most useful for classification) is introduced,
along with a description of the algorithm, a commonly used method for
feature selection.
37
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
2.5.1 Learning Algorithms
Automatic classification relies on the use of learning algorithms. Learning algorithms use examples of correctly classified items in order to classify previously
unseen items.21 Most learning algorithms make no claim at psychological plausibility.
Several learning algorithms were used at different points in this work, ranging
from the very straightforward Zero Rule algorithm (which provides a baseline
against which the performance of the other algorithms can be tested) to the
more sophisticated C4.5 decision tree algorithm. This section describes the six
different algorithms used.
Zero Rule Learning Algorithm
The Zero Rule algorithm provides a baseline for the assessment of other, more
sophisticated learning algorithms.22 Zero Rule simply predicts that each example to be classified belongs to the most prevalent class in the training data.
For example, if 51% of the training sentences are biographical, and 49% nonbiographical, then 100% of the test data will be classified as biographical, as
biography forms the majority class in the training data.
One Rule Learning Algorithm
The One Rule algorithm is based on the intuition that very simple rules are
capable of classifying with great accuracy.23 The idea is that the feature with
the highest accuracy on the training data is selected as the single feature on
which the test data is tested. For example, if the training data shows that the
pronoun feature has the greatest classification accuracy, then that is the only
feature used to classify the test data. H OLTE (1993) compared the performance
of the One Rule Algorithm to a variant of Quinlan’s decision tree learning algorithm (Q UINLAN, 1988) on sixteen standard machine learning data sets, and
discovered that classification accuracy suffered only slightly with the One Rule
algorithm.
21 This “learning by example” is often referred to as supervised learning and is compared to unsupervised learning, which first identifies categories, and then sorts unseen instances into these identified categories. Unsupervised learning is also referred to as clustering.
22 The Weka implementation of the Zero Rule learning algorithm (ZeroR) was used.
23 The Weka implementation of the One Rule learning algorithm (OneR) was used. The One Rule
algorithm can be conceptualised as a one level decision tree.
38
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
C4.5 Decision Tree Learning Algorithm
The C4.5 decision tree learning algorithm (Q UINLAN, 1993) is an extension of
the the ID3 algorithm (Q UINLAN, 1988).24 This section describes some relevant features of decision trees, before providing a brief overview of decision
tree induction algorithms with particular reference to the C4.5 algorithm. This
section draws heavily from M ITCHELL (1997).
Decision Tree Representation
In this context, decision trees are rule sets constructed from training data, and
used to provide a perspicuous decision procedure for the classification of new
instances. For instance, the decision tree depicted by Figure 2.8 on the next
page and the set of rules reproduced in Figure 2.9 on page 41 are equivalent.
The rules can be understood as paths through the tree, from root (family name)
to one of eight leaves.
Consider Example 2.1. The root node of the decision tree depicted in Figure 2.8 on the following page is (family name). If we assume that “Smith”
has been correctly identified as a family name then the sentence will be classified as biographical.
(2.1) The prize winning painter was Bill Smith.
Consider Example 2.2 beginning at the root node (family name), the first rule
is answered no. The sentence does not contain he and therefore the second rule
answers no, leading to the third rule — she — which again is answered no,
likewise with the fourth rule her. The fifth rule is answered yes, and reaches
the node By. The sentence is classified as non-biographical. These rules are
presented in Figure 2.9 on page 41.
(2.2) The prize winning painter was born in 1972.
Decision Tree Induction
Decision trees are learned by a recursive procedure which branches on the
“best” available feature. The procedure stops when either all training instances
are correctly classified or all features have been considered down a particular
branch. Figure 2.10 on page 42 shows an idealised binary decision tree to illustrate this. The root of the tree shows the feature which best predicts target
class membership. As the features used are binary (yes or no) the instances are
split into two groups reflecting whether Feature K is yes or no. The test is
repeated for each of the descendent nodes of Feature K. For example, of the
1400 instances for which Feature K has the value yes, Feature N provides
the most powerful classification features.
At each decision point the feature which predicts the target class most effec24 The
Weka implementation of the C4.5 learning algorithm (J48) was used.
39
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
Figure 2.8: Genre Features Decision Tree.
Family_Name
yes
no
Bio
He
yes
no
Bio
She
yes
no
Bio
Her
no
yes
Bio
Born
no
yes
Forename
yes
By
Bio
no
Non-Bio
no
yes
Non-Bio
Bio
tively must be selected. This can be done naively — for example the One Rule
algorithm discussed above simply selects that feature which agrees with the
target class most often — or by using more sophisticated methods. Information
gain is regarded as providing a better gauge of the classification potential of a
40
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
Figure 2.9: Decision Tree Rules Example.
IF family name = yes THEN Biographical
IF family name = no AND he = yes THEN Biographical
IF family name = no AND he = no AND she = yes THEN Biographical
IF family name = no AND he = no AND she = no AND her = yes
THEN Biographical
IF family name = no AND he = no AND she = no AND her = no
AND born = yes THEN Biographical
IF family name = no AND he = no AND she = no AND her = no
AND born = no AND forename = no THEN Non-biographical
IF family name = no AND he = no AND she = no AND her = no AND
born = no AND forename = yes AND by = yes THEN Biographical
IF family name = no AND he = no AND she = no AND her = no
AND born = no AND forename = yes AND by = no THEN Nonbiographical
feature than relying on accuracy alone (M ITCHELL, 1997).
The information gain metric is dependent on the more basic concept of entropy,
a quantification of uncertainty among different possible outcomes (S HANNON,
1948).
In the case of binary classification, the entropy of a set of instances with respect
to a target classification can be calculated using Equation 2.3.
is a collection of instances ( is the number of instances in the collection ), are the
instances belonging to classified as “yes” with respect to a target classification, and are the instances belonging to classified as “no” with respect to
that target classification. Entropy is 1 if equal numbers of instances belong to
each class (maximum uncertainty) and 0 if all the instances belong to the same
class.
(2.3)
"# $% $'
!
!& !
The information gain metric of a given feature is the forecasted reduction in uncertainty when this feature is used to classify the data. The higher the information gain, the more discriminating the feature with respect to the classification
task. The information gain for a given binary-valued feature (( ) with respect
to a set of instances is given by Equation 2.4 on the next page.
is the collection of all instances, ) are the instances belonging to for which ( has the
value “yes”, and * are the instances belonging to for which ( has the value
41
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
“no” with respect to the target the classification task.
(2.4)
( " ) ! * Figure 2.10: Example Decision Tree for 3000 Instances.
Feature_K
no
1400 instances
1600 instances
yes
Feature_P
Feature_N
500 instances
900 instances
no
yes
Class A
yes
no
750 instances
850 instances
Class B
Class A
Class B
Overfitting, where a decision tree is built around the idiosyncracies of the training data, and fails to perform optimally on exposure to new data, is an endemic problem with decision trees. The C4.5 algorithm Q UINLAN (1993) post
processes the decision tree (post-pruning) in an attempt to address this problem. Decision trees are first converted to rules (as stated above, each path from
root node to leaf constitutes a rule). In this representation, antecedents of these
rules are made up of conjunctions of terms.
(2.5) IF (Family name = yes) AND (he = no) AND (she = yes) THEN
Biographical
Consider Example 2.5. If removal of the first term (Family name = yes) or
both first and second terms ((Family name = yes) AND (he = no)) result in the
rule classifying more accurately, then these initial terms will be abandoned. In
other words, if the third term (she = yes) alone classifies with greater accuracy
than all three conditions, then the first two conditions are jettisoned. While
some algorithms use a validation data set — a collection of instances distinct
from the test and training data — for post pruning, C4.5 utilises statistical tests
to identify those rules that are not contributing to classification accuracy and
trims them.
42
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
Ripper Rule Learning Algorithm
Propositional rule learning algorithms produce IF. . . THEN. . . rules that account for the positive instances in a training set.25 The learning process conventionally has two stages: rule production and rule post-processing.
1. Rule generation is an iterative process. First, the maximally discriminant
feature with respect to the classification task is identified (often using the
information gain metric described above) and converted into an antecedent
in an IF... THEN... rule. For example, if feature family name is identified as that feature having most potential, then the initial rule would be
IF family name=yes THEN biographical = yes. Remaining features are
tested in conjunction with the established rules, and that feature which
predicts biography best is selected as an additional antecedent (see Figure 2.11 on the next page for an example). This process is repeated until
the desired level of performance is achieved. After each complete rule
has been learnt, each instance successfully covered by that rule is removed from the training data, and the process of single rule learning
begun again on the diminished training data. This process is repeated
until the desired number of rules are produced, or all the data is correctly
classified
2. Like decision trees, rule based learning derived rules are subject to overfitting problems. Heuristics can be used to identify and prune those rules
(or parts of rules) that tend to reduce the accuracy of the rule set on unseen data.
The rule learning algorithm used in this work was Ripper (Repeated Incremental Parsing to Produce Error Reduction) (C OHEN, 1995). Ripper’s major innovation is in the post-pruning of rules. For each rule , a series of competing
rules ( ) are constructed using heuristics. Each rule ( )
are then evaluated with respect to their classification accuracy for the data set
and the best rule is selected.
Naive Bayes Learning Algorithm
The Naive Bayes classifier is a popular and computationally inexpensive tool
for text classification (J OHN and L ANGLEY, 1995; M ITCHELL, 1997; W ITTEN
and F RANK, 2005).26 It differs from the methods we have already considered
as it does not produce human readable representations in the form of rules and
trees. The algorithm is Bayesian in that it relies on Bayes’ method for calculating
probability and naive in that it assumes that the feature values are conditionally
independent (and equally important) with respect to the classification task. Despite these simplifications the algorithm is used extensively because it is highly
25 The
26 The
Weka implementation of the Ripper algorithm (JRip) was used.
Weka implementation of the Naive Bayes learning algorithm (NaiveBayes) was used.
43
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
Figure 2.11: Rule Based Learning Example.
Target bio = yes
IF
the = yes
THEN
bio = yes
...
IF
the = no
THEN
bio = yes
IF
family_name = yes
THEN
bio = yes
IF
family_name = yes
AND
date = yes
THEN
bio = yes
IF
family_name = yes
AND
past_tense = yes
THEN
bio = yes
IF
family_name = no
THEN
bio = yes
IF
family_name = yes
AND
date = no
THEN
bio = yes
...
IF
family_name = yes
AND
past_tense = no
THEN
bio = yes
effective, easy to implement and scales well to large datasets and large numbers of features. The Naive Bayes classifier is given in Equation 2.6 (based on
M ITCHELL (1997)) where is the set of all classes, is the target class, and
are the feature values that constitute an instance.
(2.6)
44
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
Table 2.3: Example Training Sentence Representations.
Family Name
yes
yes
yes
yes
no
Forename
no
no
yes
no
yes
“he”
yes
yes
no
no
yes
“she”
yes
no
no
no
yes
“born”
yes
no
no
yes
yes
class
biographical
biographical
biographical
non-biographical
non-biographical
Table 2.4: Example Test Sentence Representation.
Family Name
yes
(2.7)
Forename
no
“she”
no
“born”
yes
class
???
“he”
yes
The attractive simplicity of the technique is best illustrated with a simple example. Table 2.3 shows five sentences represented with five features (family name,
forename, he, she and born). Table 2.4 shows the representation of the test
sentence to be classified. The data in Table 2.4 is instantiated in Equation 2.6 to
yield Equation 2.7. The likeliest class given the data for the instance presented
can be derived from the training data (presented in Table 2.3) where there is
a 0.6 (3/5) probability of a sentence being biographical and a 0.4 (2/5) probability of a sentence being non-biographical. It is straightforward to calculate
conditional probabilities from the training data. For example P(familyname =
yes biographical) = 3/3 (1), or to give another example, P(familyname = yes
non-biographical) = 1/2 (0.5). The likelihood score can be calculated using
Equation 2.7, yielding:
Biographical Class: (3/3) (1/3) (2/3) (1/3) (1/3) = 0.0237
Non-biographical Class: (1/2) (1/2) (1/2) (1/2) (2/2) = 0.0625
As the non-biographical class yields a higher result, the example instance given
in Table 2.3 is classified as non-biographical. Note that the figures derived are
not genuine probabilities, although probabilities can easily be calculated (see
M ITCHELL (1997)).
A disadvantage of using the Naive Bayes algorithm is that, as it assumes that
45
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
the feature values are independent of each other, it does not perform well on
those datasets that demonstrate strong interdependence between features. The
algorithm seems highly competitive with decision tree based algorithms when
applied to textual data, however (L EWIS, 1992b).
Support Vector Machine Learning Algorithm
Support vector machines (SVM’s)27 are a comparatively recent innovation in
the classification literature (C ORTES and VAPNIK, 1995), and there is some evidence to show that SVM’s perform well in text categorisation tasks (J OACHIMS,
1998).28 As SVM’s are an extension of traditional linear models, this section
will first describe some relevant features of linear models for text classification,
before briefly describing the particular characteristics of SVM’s. As SVM’s
are considerably more complex from a mathematical point of view than the
other algorithms considered, their treatment here is less comprehensive, and
the reader is referred to B URGES (1998) for an extensive tutorial introduction to
the use of SVM’s for classification generally.
Linear Models
This section relies heavily on (W ITTEN and F RANK, 2005). Linear regression is
a technique that can be used to classify numeric data, where the aim is to “to
express the class as a linear combination of the attributes with predetermined
weights.” (W ITTEN and F RANK, 2005) (see Equation 2.8, where are the weights).
are the feature values, and (2.8)
) According to W ITTEN and F RANK (2005), “One way of looking at multi-response
linear regression is to imagine that it approximates a numeric membership
function for each class. The membership function is 1 for instances that belong to that class and 0 for other instances. Given a new instance we calculate
its membership for each class and select the biggest.”
Support Vector Machines
One obvious problem in using straightforward linear regression is that it assumes that the instance space can be characterised linearly. This is by no means
clear for text categorisation tasks (J OACHIMS, 1998). Support Vector Machines
avoid this problem by mapping the original instance space into a new instance
space that can be characterised linearly.
27 See
(W ITTEN and F RANK , 2005) for a simplified account of SVM’s.
Vector Machine (SMO) learning algorithm was used.
28 The Weka implementation of the Support
46
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
SVM’s determine a hyperplane29 that is optimally discriminatory with respect
to the two classes. In other words, a hyperplane “gives the greatest separation within the classes” (W ITTEN and F RANK, 2005). Support vectors are those
instances closest to the hyperplane (that is, those instances that define the hyperplane) of which there must be one for each class. This optimal hyperplane
is referred to in the literature as the maximum margin hyperplane. We can think
of SVM’s as an optimisation of straightforward linear methods.
2.5.2 Evaluating Learning
This section briefly surveys some issues in the evaluation of classifiers.30 First,
methods for assessing the accuracy of different classifiers are presented. Second, statistical methods for comparing the performance of different classifiers
are described.
Assessing Accuracy
Accuracy is the percentage of instances correctly classified. The two standard
methods for assessing the accuracy of classifiers are the use of training/test evaluation and cross-validation. Both these techniques require a separation of training and test data, as evaluating a classifier on data that has been used to train
it is unlikely to reflect the accuracy of the classifier on unseen data.
Training/Test Evaluation
In training/test evaluation, data is divided into two groups; test data and training data. Each group is stratified to reflect the class distribution in the entire
data set. Normally, between two thirds and three quarters of data is used for
training, and the remaining portion used for testing the trained classifier.
Cross Validation
Cross validation is an extension of the simpler training/test method. The data
is divided into equally sized sections — each stratified to reflect the class
distribution in the wider data set — with each of the sections serving as test
data for the remaining
sections in turn. The final accuracy score is the
mean accuracy for all runs. For example, if stratified 4-fold cross validation
is adopted, the data is divided into quarters, with each quarter containing a
class distribution that reflects the dataset as a whole (that is, stratified). Each
quarter is “held out” in turn while the classifier is trained on the remaining
29 A hyperplane divides -dimensional space. In the case of one, two and three dimensional space,
hyperplanes are normally referred to using more familiar terms (dot, line and plane respectively).
However, when the number of dimensions is greater than three, the more general term hyperplane
is used.
30 Here classifier refers to a particular combination of learning algorithm, feature set and data set.
47
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
three quarters of the dataset. Accuracy is then the mean accuracy of each of the
four runs. In this way, all data is used to maximum effect as both training and
test data. In order to gain more reliable results, multiple cross-validations are
performed. For instance 10 x 4-fold cross validation requires that 4-fold cross
validation be performed 10 times with the average of the 10 runs being a better
estimate of classification accuracy. Note that for each of the 10 runs (stratified)
data is randomly allocated to test/training sets.
Statistical Significance
While accuracy scores are useful for comparing the performance of classifiers,
sometimes we require a more reliable indicator that one classifier performs better than another. In this situation we must turn to statistical tests. The sections
below outline the context in which statistical tests are used (hypothesis testing)
before reviewing some special issues in evaluating classifiers with statistical
tests.
Hypothesis Testing
Hypothesis testing requires the identification of two hypotheses; the null hypothesis and the experimental hypothesis.31 In the context of this research, the
null hypothesis could be, for example, “Classifiers A and B are equally accurate.” An example experimental hypothesis could be “There is a difference in
the classification accuracy between Classifier A and Classifier B.” To put this
another way, the null hypothesis claims that any observed difference between
Classifiers A and B is merely due to chance, and the experimental hypothesis
claims that the results of the two classifiers are drawn from different populations.
Rather than attempting to prove the experimental hypothesis, the researcher’s
aim is (generally) to disprove the null hypothesis. The usual convention allows
the rejection of the null hypothesis if the probability of achieving the observed
data, given the null hypothesis is less than 0.0532. A Type 1 error is said to occur
if the null hypothesis is rejected, when in fact it is true, and a Type 2 error is said
to occur when the null hypothesis is accepted when it is in fact false. Another
way of putting this is that Type 1 error occurs when a test indicates that there
is a difference when there is no such difference, and Type 2 error indicates no
difference when there is, in fact, a difference.
A one-tailed experimental hypothesis specifies the direction of the difference;
that classifier A is better or worse than classifier B. A two-tailed hypothesis
claims that there is a difference between the two classifiers, but does not specify
the direction of that difference.
31 See O AKES (1998). The textbook covers basic (and more advanced statistics) applied to corpus
linguistics.
32 This cut off point is referred to as a “P-value”.
48
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
Evaluating Classifiers with Statistical Tests
D IETTERICH (1998) reviews the performance of five different statistical tests
for comparing classifiers. Each test was used on three sets of data using two
different machine learning algorithms. The five tests used were:
1. The McNemar test.
2. A test for the differences of two proportions.
3. The re-sampled paired -test.
4. The -fold cross validated paired -test.
5. The
cv paired -test.
D IETTERICH (1998) concluded that the re-sampled -test should never be used,
as the likelihood of obtaining Type 1 errors is unacceptably high. Similarly,
the -fold cross validated paired -test should not be used as it too produces
too many Type 1 errors. D IETTERICH (1998) recommends the use of either the
cv paired -test or the less computationally expensive McNemar statistic.
N ADEAU and B ENGIO (2003) compares two further statistical tests against those
identified by D IETTERICH (1998), and found that the corrected re-sampled -test
was associated with a much lower likelihood of generating Type 1 errors. Taking D IETTERICH (1998) and N ADEAU and B ENGIO (2003) as a starting point,
B OUCKAERT and F RANK (2004) suggest that repeatability is an important criterion in selecting a statistic, as well as “appropriate Type 1 error and low
Type 2 error”. According to the empirical study presented in B OUCKAERT and
F RANK (2004), the corrected re-sampled -test does not produce high Type 2 errors (in contrast to the McNemar test) and, as reported by N ADEAU and B EN GIO (2003), is associated with a much lower probability of Type 1 error than
the uncorrected re-sampled -test. Additionally, B OUCKAERT and F RANK (2004)
shows that the corrected re-sampled -test shows very high reliability, especially when used in conjunction with 10 x 10 fold cross validation. On the basis
of this research, the corrected re-sampled -test, in conjunction with 10 times 10
fold cross validations, was adopted as the significance test in evaluating classifiers in this work.33 The reader is referred to Appendix G on page 273 for the
formula for the corrected re-sampled -test, and details of the implementation
used.
machine learning toolkit includes an implementation of the corrected re-sampled
-test.TheThisWEKA
implementation has its limitations, however, and a Perl implementation of the test was
33
used for the bulk of this research (see Appendix G).
49
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
2.5.3 Feature Selection
Feature selection — the process of minimising the number of features necessary
for a classification task without reducing accuracy34 — is conducted for two
main reasons, the first pragmatic and the second more principled:
1. Minimising processing times. Although some learning algorithms scale
well to large numbers of features (for example, Naive Bayes, SVM’s), others — like decision trees for instance — are not so flexible.
2. Removing “noise” from data. In this case, “noise” refers to those features
that lack discriminatory power with respect to the classification task. For
example, the unigram “the” is removed in much topic orientated text
classification as it is regarded as unlikely to contribute to classification
accuracy.
YANG and P EDERSEN (1997) showed that aggressive feature selection can increase classification accuracy for certain kinds of texts (newswire articles). They
discarded 98% of the unigram features from the Reuters corpus, and retained
only the 2% identified as optimal by the chi-squared ( ) method (discussed
below). The result was an increase in classification accuracy attributed to noise
reduction. W ITTEN and F RANK (2005) provides an overview of feature selection in a general, while F ORMAN (2002) surveys feature selection for text classification in particular. G UYON and E LISSEEFF (2003) reviews various different
feature selection algorithms, particularly information-theoretic approaches to
the selection problem.
The feature selection algorithm was used extensively in the current work,
as it has proven success in text classification applications (O AKES ET AL ., 2001;
K ILGARRIFF and R OSE, 1998), it is not computationally intensive and it is straightforward to understand. The algorithm — in this context — is designed to identify those features that are most characteristic of a class with respect to a second class. So, in the biographical example, the algorithm identifies those
features most characteristic of the biographical class through contrast with the
non-biographical class.
The following description of the algorithm is based largely on O AKES ET AL .
(2001) and O AKES (1998). The technique requires the construction of a contingency table for each feature. If we have two sets of sentences, one biographical (the BC) and one non-biographical (NBC), the union of these two classes
is the combined sentences (CC). A contingency table of observed frequencies is
constructed for each feature ( . The constituents of the table are listed in Figure 2.12 on the next page, and a representation of a contingency table shown in
Table 2.5 on the following page.
34 Feature selection methods can also be used to identify keywords representative of a corpus
of interest, if we have a corpus consisting of texts from the genre of interest, and a more general
(normally larger) reference corpus.
50
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
Table 2.5: Contingency Table.
Biographical Class
a
c
Feature (
Feature (
Non Biographical Class
b
d
Figure 2.12: Constituents of a Contingency Table for
.
a= Frequency of instances of feature type ( in BC
b= Frequency of instances of feature type ( in NBC
c= The sum of frequencies of instances of all feature types in BC
apart from instances of feature type (
d= The sum of frequencies of instances of all feature types in
NBC apart from instances of feature type (
Once the observed frequencies have have been calculated for each feature, the
expected frequencies can be ascertained. The expected frequency of a feature is
the frequency one would expect given the size of the corpus and the rarity of
the word. It is straightforward to calculate the estimated frequency for each
position in the contingency table. Consider Equation 2.9, where and refer to
columns and rows of the contingency table, respectively. For example, position
would be calculated using Equation 2.10.
(2.9)
) (2.10)
* ) )
*
The observed and expected frequency tables enable us to calculate the statistic directly. For each of a, b, c and d, if is the observed frequency and is the
expected frequency then
is the sum of
for each table element.
If for a feature we can be 95% confident that the feature does occur
more frequently in one of the two categories. This information can be used to
rank features according to their discriminating power (that is, from most to least
discriminatory), and a cut off point chosen either by capping the number of
features (for example, the 500 most discriminating features) or by selecting an
appropriate significance level (for example, the features we are 95% confident
51
C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION
occur more frequently in one of the classes).
Note that those features where the expected frequency ( ) of any table element is less than five should be discarded, as becomes unreliable at low frequencies. Other methods for selecting features, which handle low frequency
features include the log-likelihood method (D UNNING, 1993), used widely in
corpus and computational linguistics, as it delivers good results with minimal data. F ISHER (1922) describes the Fisher test, a variant of the
test that
also employs a contingency table can also be used with low frequencies. was used in this case, as we are primarily interested in those features that are
maximally discriminatory from a very large population of features, and those
features that occur very infrequently — with an expected frequency of less than
five — are unlikely to be helpful.35
2.6 Conclusion
This chapter has elaborated on some background themes necessary for understanding the thesis as a whole, particularly the notions of style, genre, biography, classification and machine learning. The related notions of gnere and style
become important in later chapters as we investigate and assess possible representations for biographical genre classification. These investigations employ
machine learning algorithms as a vital methodological tool.
The next chapter reviews recent work in automatic text classification by genre
and in addition surveys some systems that produce biographies.
35 See F ORMAN (2002) for an extensive empirical study comparing the performance of different
feature selection methods for text classification.
52
C HAPTER 3
Review of Recent
Computational Work
The first part of this chapter reviews recent literature relevant to automatic
genre classification. The second part of the chapter reviews some working systems that produce biographies, and relates them to the current work.
3.1 Automatic Genre Classification
This section reviews recent work on the study of genre from the perspective of
computational linguistics, then goes on to examine recent important literature
in feature selection for topic orientated text classification. Finally, recent work
on feature selection for genre classification is surveyed.
3.1.1 Recent Work on Genre in the Computational Linguistics
Tradition
Recent work on genre in the computational linguistics tradition is geared towards practical problems, rather than foundational questions. For a description of two theoretical perspectives on genre see Section 2.1.1 on page 8 and
Section 2.1.2 on page 11, which describe Systemic Functional Grammar and the
Multi-Dimensional Approach respectively. Research effort in the computational
linguistics tradition is focused on finding the optimal method for distinguishing between genres computationally, rather than constructing a theoretical edifice (complete with rigorous definitions of genre and associated concepts). Recently, however, there have been attempts at placing the computational study
of genre within the theoretical framework of computational stylistics (for example, K ARLGREN (2004) and A RGAMON ET AL . (2003)). The notion of stylistics
(see Section 2.2 on page 22) gives us the basis from which we can ground a
53
C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK
better understanding of genre. W HITELAW and A RGAMON (2004) (see page 54)
is particularly interesting as it grounds the computational study of genre in the
Systemic Functional Grammar research tradition.
K ARLGREN (2004) defines a style as the result of an author’s “consistent and
distinguishable tendency” to favour certain lexical and syntactic patterns in
their writing, to structure material in a given way, and to write with a certain
kind of audience in mind. In its turn, a genre (described again by K ARLGREN
(2004) as “a vague but well established term”) is defined as a collection of documents that are stylistically consistent, accepted as belonging together by a sophisticated reader, familiar with the genre in question. For instance, leader articles
in traditional UK broadsheet newspapers fulfil both these requirements; they
are stylistically similar (use of persuasive language and argumentation) and
also form a coherent grouping to those familiar with the genre. The two conditions tend to mirror each other; the more stylistically similar a group of documents are, the better they hang together as a genre. For instance, if we extend
our newspaper leader example beyond traditional UK broadsheet newspapers,
to include leaders in UK newspapers, we can see that although tabloid leaders
(like broadsheet newspapers) contain argumentation and persuasive language,
their syntactic structure is different (more contractions, shorter sentences, and
so on). Hence, readers familiar with both text sources (tabloid and broadsheet)
are less likely to recognise them as belonging to the same genre.
W HITELAW and A RGAMON (2004) place genre recognition within the context
of Systemic Functional Grammar (SFG) as SFG provides a mechanism for representing the stylistic meaning of the text (see Section 2.1.1 on page 8). The stylistic
meaning of a text is that component of the text’s meaning that is non-topical
(that is, syntactic patterns, document structure, and so on). In other words, the
stylistic meaning is the residue meaning once the – in the terms of W HITELAW
and A RGAMON (2004) — denotational meaning is removed.
A distinctive feature of SFG is its emphasis on choice; language is characterised
as a text creator’s choices between different constructions at different points
in the writing process, and these choices are represented as system networks, a
representation that allows each word or phrase to be tagged with the choices
that led to its selection (see Figure 3.1 on page 56 for an example of a system
network presented by W HITELAW and A RGAMON (2004)). Note that as system
networks are primarily designed for language description by linguists, they
are not optimised for computational representation1 W HITELAW and A RGA MON (2004) concentrate on features likely to be revealing of genre (for example, different types of conjunction, pronouns and so on) using a computation1 If we examine system networks from left to right, we can understand each choice as a disjunction that constrains subsequent choices until we arrive at the leaves of the tree; actual text
elements. If we take Figure 3.1 on page 56 as an example, we begin at the root, “conjunction”,
and then choose one of three ways of expressing “conjunction”, either “elaboration”, “extension”
or “enhancement”. If we choose elaboration, the modes of expression under “extension” are not
available to us. We have available to us the choice of “apposition” or “clarification”. If we choose
clarification, we have the choice of “rather”, “in any case” or “specifically”.
54
C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK
ally manageable representation, based on system networks. The strength of the
SFG approach for genre classification is that it recognises that documents function as documents, and are not merely strings of concatenating unigrams. The
cost of this “whole document” approach is increased complexity. W HITELAW
and A RGAMON (2004) gives the example of how a letter (that is, a document
belong to the letter genre) can be identified using systemic features. If the token
“Dear” appears near the beginning of a document, and later in the document,
a well-wishing phrase occurs — here well-wishing is a higher level construct
including phrases like, “yours sincerely” and “yours faithfully” — then this
“Dear” + well-wishing feature forms a useful attribute for genre categorisation,
assuming that letters were one of the target genres.
Stylistic features identified using the SFG approach were evaluated in the context of detecting fraudulent email (so-called “Nigerian” emails, which solicit
money transfers to anonymous bank accounts). These Nigerian emails were
contrasted with two other data sources (newswire text and texts from the British
National Corpus sports and leisure domain) using the Pronoun Type systemic
feature. The “Nigerian” emails differ from the other two categories in (to use
SFG terminology) field (that is, topic. The “Nigerian” emails are about financial transactions) and tenor (that is, social relationships; the “Nigerian” emails
attempt to establish a friendly rapport with the reader). Classification accuracy
of 99.6 was achieved using a support vector machine learning algorithm. The
performance of “bag-of-words” style approaches was not tested.
Since the early 1990s the growth of the web has provided a new context for
the study of genre. The synergy of technological development and cultural
change have led to the rapid maturation of distinct genres that have no clear
analogue outside the World Wide Web. Examples here include the now traditional homepage, with its contact details, outlines of major professional or
recreational activities, personal details, or photographs. New web genres —
like wikis and blogs — have also emerged. In the light of the particular constraints and opportunities provided by the web as a medium, it is unlikely that
an optimal taxonomy of web genres will mirror that of established physical
media.
Attempts to provide a web genre taxonomy include M EYER ZU E ISSEN and
S TEIN (2004), which assesses the utility of using the concept of genre in the context of information retrieval. Among other potential uses, search engines could
utilise effective genre recognition in order to exclude documents belonging to
genres of peripheral interest to the user’s information need. For instance, a user
searching for information on a certain model of car may wish to exclude web
pages classified as advertising. M EYER ZU E ISSEN and S TEIN (2004) conducted
a user study — via questionnaire — which identified the ten most useful web
genre from an information retrieval perspective. These genres were: help, articles, discussion, shopping, portrayal (personal home pages, and so on), link
collections and download sites. Another application of genre identification in
information retrieval is suggested by B OESE (2005). As some web genres con55
C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK
Figure 3.1: System Network from W HITELAW and A RGAMON (2004)
Apposition
that is, in other words, for example
Elaboration
Clarification
rather, in any case, specifically
Addititive
and, or, moreover
Extension
CONJUNCTION
Adversative
but, yet, on the other hand
Verifying
besides, instead, or, alternatively
Matter
with regard to, in one respect
Simple
then, next, afterward
Spatio-Temporal
Enhancement
Complex
soon, meanwhile, until now
Manner
similarly, likewise
Causal
therefore, consequently, since
Cause-Conditional
Conditional
then, albeit, notwithstanding
sist of largely static content (for instance, personal homepages which are —
normally — updated infrequently) in comparison to highly dynamic genres
(for example, newspaper home pages) spiders (search engine indexing agents)
can be more effectively utilised if directed disproportionately towards pages
with frequent content changes.
56
C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK
3.1.2 Feature Selection for Topic Based Text Classification
The text classification community has expended a huge amount of research effort on exploring the most effective features for representing documents.2 A
“raw” text document cannot be directly processed or interpreted by a machine
learning algorithm, but instead must be mapped to a succinct, vector based
representation, with each vector constructed of values for each feature in the
document.3 All documents involved in the classification process, including
those documents used for training, testing and in practical classification tasks
“in the wild”, must be converted into this computationally tractable representation (S EBASTIANI, 2002) (see Figure 3.2). The simplest and most commonly
used representation requires the use of each word in the document collection
as a feature, and then constructing a vector for each document that reflects the
presence or absence of that word. This is the so called “bag-of-words” representation. For example, if a document collection contains the term “elephant”
or “gazelle”, then each document will be represented by a document vector
that includes “elephant” and “gazelle” features (again, see Figure 3.2).
Figure 3.2: Conversion of Documents to Document Vectors
DOCUMENTS
OUTPUT
PROCESSING
features
Tokenize
elephant gazelle
frequency counts
Document vector
doc
doc
doc
doc
.
.
.
1
2
3
4
0
1
1
0
0
1
0
1
...
An effective document representation will represent only those aspects of the
documents relevant to the task at hand. It will “distill” that aspect of the document with maximum potential for classification. Since the early 1990s how2 An extensive bibliography on automatic text classification generally is maintained by Evgeniy
Gabrilovich: http://www.cs.technion.ac.il/ gabr Accessed on 02-01-07
3 This value could be, for example, a raw frequency count, a binary representation, or a nominal
value.
57
C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK
ever, more complex methods of representing
documents have been assessed,
4
grams (where is greater than
1) and
including the use of word based
gram repsyntactic features (for instance, the use of noun phrases).5 An
resentation does not require the use of natural language processing tools –
part-of-speech taggers, syntactic parsers, and so on — demanded by syntactic features. Instead, a simple word tokenizer is all that is required. -gram
) can be thought of as pseudo-syntactic features, as they
features (where are able to partially represent common syntactic features without the need for
computationally intensive actual syntactic analysis, like, for instance, sentence
parsing.
F ÜRNKRANZ (1998) points out that the number of possible
gram types in
a document collection increases exponentially with . For example, the unigram type “the” is likely to occur with high frequency in any English language
text. If we choose to use a bigram representation, then while the number of tokens remains static, the number of types increases with the number of distinct
two word sequences beginning with “the”. The number of tokens is almost
the same, while the number of types increases much faster; the number of bigrams is potentially (where is the number of unigrams), and the number
of trigrams is potentially . Identifying highly discriminatory bigrams in this
situation is problematic. F ÜRNKRANZ (1998) suggests an algorithm for pruning
features (that is, reducing the number of features while retaining those with
grams rehigh discriminatory power). A multi-pass
strategy is used, with
tained only if their constituent (
)-grams
have
met
a
predetermined
fre
quency threshold. In other words, for an
gram feature “the authors are”
to be included, the (
)-grams “the authors” and “authors are” must have
been observed with a certain threshold frequency in the previous pass. The
features were tested on two corpora (including Reuters) using the RIPPER algorithm (C OHEN, 1995). This work indicates that while bigram and trigram
features are more successful than unigram features, when , results begin
to deteriorate.6
In contrast to the frequency pruning method of feature reduction described
above, TAN ET AL . (2002) use a combination of an information-gain metric and
grams. The feature selection algofrequency counts to select appropriate
rithm first identifies those unigrams that have a frequency above a predetermined threshold in at least one of the categorisation groups, and then selects
only those bigrams with a flagged unigram as a constituent. The feature set
(unigrams and selected bigrams) were evaluated using the Naive Bayes algorithm on the Yahoo-Science corpus and Reuters corpus. The bigram aug-
4 -grams are sequences of
items from a larger sequence. 1-grams, 2-grams and 3-grams are
normally referred to as unigrams, bigrams and trigrams, respectively. -grams where
3, the
generic term “ -gram” is often used.
5 Semantic based representations — typically using hypernym or hyponym relations ascertained via WordNet, have also been used, but these representations are not considered here.
6 The area of feature selection has attracted much research attention; see this document on page 50.
Additionally, F ORMAN (2003) provides a recent survey.
58
C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK
mented representation improved classification performance when compared
to unigram features alone.
The use of syntactic phrasal representations, particularly noun phrases, would,
on the surface seem to be beneficial to text classification, as they allow more
explicit representations of the conceptual content of the document, with less
ambiguity (for example, the noun phrase “river bank”, captures a different
concept to either “river” or “bank” individually). This research theme has been
pursued by a number of researchers (for example, L EWIS (1992b)).
In a comprehensive study analysing optimal representation in both an information retrieval and text categorisation context L EWIS (1992b), found that the use
of syntactic phrases as features (here “syntactic phrases” mean noun phrases
identified through part-of-speech tagging and matching selected part-of-speech
categories) resulted in a deterioration in performance when compared to the
standard “bag of words” approach, using several different learning algorithms
and the MUC-3 and Reuters corpus (see also L EWIS (1992a)).
The benefits and drawbacks of using syntactic phrases are also assessed by
M OSCHITTI and B ASILI (2004), who found “overwhelming evidence” that syntactic features failed to improve topic orientated classification accuracy. Two
types of phrase were selected as features; proper nouns (identified using a capital letter sensitive grammar) and noun phrases (with noun phrases selected
from each classification category). A subset of the Reuters newswire corpus
was used for training and testing purposes, and a variant of the support vector
machine algorithm (B URGES, 1998) was used for cross-validation. M OSCHITTI
and B ASILI (2004) reports that phrasal representation is much less effective than
a “bag of words” approach.
S COTT and M ATWIN (1999), in a series of experiments again using the Reuters
newswire data (although a different subset to that used by M OSCHITTI and
B ASILI (2004), reported that phrase based representations (in this case, noun
phrases) failed to improve classification accuracy and concluded that “it is
probably not worth pursuing simple phrase based representations further.”
3.1.3 Feature Selection for Genre Classification
While the area of feature selection for text classification is vast, feature selection
for genre categorisation remains a relatively under-researched area, despite a
trickle of papers since around the mid-1990s. S ANTINI (2004b) provides a comprehensive survey of the state of the art in automatic genre classification, in
addition to briefly reviewing the treatment of genre in the linguistics literature, using B IBER (1989)’ S work on text typology as a starting point, and emphasising the lack of agreement regarding basic terms (for example, differing
conceptualisations of “genre” in the literature).
An early and much cited treatment of the area is provided by K ARLGREN and
59
C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK
Figure 3.3: Genre Categories Used in K ARLGREN and C UTTING (1994).
BROWN CORPUS
INFORMATIVE
IMAGINATIVE
MISC
PRESS
Reportage
Editorials
Reviews
Religion
Skills & Hobbies
Popular Lore
Belles Lettres
FICTION
General Fiction
Mystery
Science Fiction
Adventures
Romantic
Humour
NON-FICTION
Government Documents
Scholarly Articles
C UTTING (1994), which locates genre classification within the wider goal of
improving information retrieval (that is, post-processing information retrieval
results with respect to genre). Discriminant analysis was used in order to classify a subset of documents from the B ROWN Corpus according to their alloted
genre categories at three different levels of granularity (see Figure 3.3).
It was noted that on evaluation, classification accuracy decreased with granularity (that is, the greater the number of genre categories, the lower the accuracy), leading the researchers to question the validity and usefulness (for these
purposes) of the classification scheme. For example, in the B ROWN corpus
classification scheme used in the research, it is not obvious if there is a marked
stylistic difference between the “mystery” and “adventure” genres.
A distinctive quality of K ARLGREN and C UTTING (1994)’ S approach is the use
of syntactic features, based loosely on Biber’s studies of text typology, but concentrating only on those features that can be reliably identified using a partof-speech tagger (for example, Prepositions, First Person Pronouns). The intuition here is that syntactic features are ideal for capturing the topic independent
stylistic forms of a document – precisely what is important in genre classification.
60
C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK
K ESSLER ET AL . (1997) stresses the potential usefulness of effective genre recognition for a range of natural language processing applications. One of the examples discussed is word sense disambiguation. In some genres, particular
word senses are unlikely to occur. For example, the word “pretty” is much
more likely to have the sense “attractive” or “beautiful” in formal genres, than
in very informal, conversational genres, where “pretty” is typically used as a
synonym for “rather”). Two reasons are given for the lack of research attention
directed at genre classification:
1. It was not until the mid 1990s and the rise of the World Wide Web that
some kind of genre classification became desirable. Prior to that, classification techniques had been applied to detect topic and genre has traditionally been identified through the source of documents.
2. Theoretical understanding of genre is limited, especially when compared
to topic, which, although it has its own theoretical problem, does have
a more coherent and developed basis than genre (that is, there is agreement about what topic is. This cannot be said for genre). Indeed, even
if we can gain a theoretical understanding of genre, it remains an open
question whether techniques existed to identify genre specific features
computationally.
K ESSLER ET AL . (1997) refers to features for text classification as “generic facets”
which are indicated by “generic cues”. For Kessler, a facet “is simply a property that distinguishes a class of texts that answers to certain practical interests,
and which moreover is associated with a characteristic set of computational or
linguistic properties, whether categorised or statistical, which we will describe
as “generic cues”. Four kinds of generic cues were used:
1. Structural cues (passives, nominalisations, syntactic features).
2. Lexical cues (for example, Mr, Mrs).
3. Character level cues (for example, punctuation).
4. Derivation cues (rates of lexical and character level features).
Five genre categories were specified by K ESSLER ET AL . (1997): reportage, editorial, science and technology, legal, and fiction. Kessler concludes that their
approach delivers reasonable classification accuracy, but it does not suggest
that it is an improvement on existing, simpler methods.
S TAMATATOS ET AL . (2000a) describes the use of various features on the classification by genre of a corpus of modern Greek texts. The corpus and genre
classification scheme were prepared specifically for the experiment. A distinctive feature of the corpus is it is constructed from two hundred and fifty unfiltered web pages. The genres used were: press editorial, press reportage, academic prose, official documents, literature, recipes, curriculum vitaes, planned
speech and broadcast news scripts.
61
C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK
The corpus was split into equal sized test and training sets. Two classification algorithms were used: multiple regression and discriminant analysis. The
textual representation is based on the stylometric characteristics of the texts.
A natural language processing tool specifically designed for the processing of
modern Greek text, SCBD, was used extensively in the study. The program
first identifies sentences using a straightforward heuristic, then performs intersentential chunking using a multi-pass approach. The detected chunks are verb
phrases, noun phrases, prepositional phrases and adverbial phrases. The final
representation consists of twenty two features each consisting of the count of
the number of stylistic features present. The features are divided into three
categories; token level features, phrase level features and analysis level features.
Token level features indicate the average length of sentences in the document,
and density of punctuation. Phrase level features point to the relative number
of phrase tokens in a text (for example, number of noun phrases identified divided by the total number of phrases in the document) or the average length of
a given phrase type in a given document (for example, total number of words
in noun phrases, divided by number of noun phrases). There are ten of these
phrase based relational features.
The distinctive part of this approach is shown in the analysis level, when information from the SCBD processing of each text is used in the construction of
the feature representation. Nine features are identified here, including counts
of the number of words left unanalysed after each pass of the chunking algorithm, and ratios of keywords to total number of words in the document. These
features measure the syntactic complexity, and the ratio of unusual words in
the document (respectively). The use of stylometric features alone produces
classification accuracy of 88%. It is, however, difficult to compare this technique to a “bag of words” style representation.
In contrast to the computationally intensive method of analysing Greek text explored in S TAMATATOS ET AL . (2000a), S TAMATATOS ET AL . (2000b) uses word
frequency counts to identify those words that best serve as linguistic features.
However, in contrast to previously employed methods for identifying candidate words (for example, K ARLGREN and C UTTING (1994) K ESSLER ET AL .
(1997)), where genre categories are analysed to identify features, S TAMATATOS
ET AL . (2000b) uses frequencies in unrestricted text (that is, texts that have not
been pre-classified with respect to genre).
S TAMATATOS ET AL . (2000b) used three feature sets on a corpus made up of
four genres from the Wall Street Journal Corpus (editorial, letters to the editor, reportage, and sports news). The genres were identified using Wall Street
Journal header types. There was an equal split between training and test data,
and discriminant analysis was used as the classification algorithm. The three
feature sets used were:
1. Most common words from each of the four genre categories in the Wall
62
C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK
Street Journal Corpus.
2. Most frequent words from the British National Corpus.
3. Most frequent words in the British National Corpus, augmented by punctuation frequencies ( 97%).
S TAMATATOS ET AL . (2000b) found that feature set two was more successful
than feature set one at accurately classifying the Wall Street Journal test data,
but that feature set three (that is, the most frequent words in the British National
Corpus augmented by punctuation frequencies) was most effective.
Like K ARLGREN and C UTTING (1994), F INN and K USHMERICK (2003) envisage
genre recognition as important within an information retrieval framework, improving the user experience. The authors locate genre as an exclusively stylistic
concept, independent of topic. Genre is concerned with “what kind of document it is, rather than what the document is about” (F INN and K USHMER ICK , 2003). The example of querying a search engine with “chaos theory” is
given. Someone preparing a research paper on astro-physics and a ten year
old preparing a homework assignment require very different kinds of documents.
F INN and K USHMERICK (2003) used three feature sets in their work, together
with a decision tree learning algorithm. The experiment attempted to distinguish between objective and subjective genres (that is, reportage and reviews
respectively). The feature sets used were:
1. Bag-of-words — The text classification standard representation.
2. Part-of-speech statistics — Each document is represented by thirty six features (that is, thirty six parts of speech) and each of these features in turn
represents a part-of-speech. Instead of binary features, the features are
percentages, reflecting the proportion of words of a given part-of-speech
category in the document (for example, if 5% of the document’s words
are preposition, the preposition feature will have the value 5).
3. Text statistics — Average sentence lengths, frequency of function words,
and so on.
F INN and K USHMERICK (2003) found that classification performance could be
enhanced using a “meta-classifier” combining models created from each feature set, but also found that no single text representation performed best over
all genre categorisation tasks. Two classification scenarios were considered.
Part-of-speech based features proved most successful when attempting to distinguish between “objective” and “subjective” documents (for example, newspaper reportage and opinion pieces, respectively) at 84%. For the second classification task — assessing whether a review is either positive or negative7 —
a unigram (“bag-of-words”) representation was most successful at 82%.
7 Classification according to the emotional content of a document is often referred to as sentiment
analysis
63
C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK
For M EYER ZU E ISSEN and S TEIN (2004), genre classification is defined as the
discrimination of documents on the basis of “their form, style, or target audience”. Their approach uses a compact feature representation in conjunction with several classification algorithms to classify randomly selected web
pages.
Part of the work involved conducting a user study centred on the utility of classifying search engine results by genre. That is, whether post processing documents judged relevant by the search engine into different genre categories is an
aid to information seekers. Different genre labels were presented to the participants who were asked to label them as “very useful”, “sometimes useful”, “not
useful” and “don’t know”. The experiments’ eight highest scoring categories in
terms of usefulness were: help, articles, discussion, shopping, portrayals of institutions, private portrayals, link collections and software downloads.
M EYER ZU E ISSEN and S TEIN (2004) stresses that for the purposes of practical web genre classification, it must be possible to construct feature sets from
web pages (documents) within the time constraints necessary for a “real time”
system. Three levels of feature complexity are considered:
1. Low complexity: Character and word counts (for example, common words,
punctuation, and so on.)
2. Medium complexity: Features that require dictionary look up, or make use
of structural properties of HTML documents (for example, proportion of
link tags to other kinds of tags).
3. High complexity: Features that are dependent on grammatical analysis or
part-of-speech tagging.
Two feature sets were contrasted, the first consisted of features from the low
complexity and medium complexity categories, and the second makes use of features from each of the three complexity levels. The second feature set had the
best accuracy, yielding accuracy levels of above 70%.
S ANTINI (2004a) tested the use of part-of-speech trigrams for features as “trigrams are large enough to encode useful syntactic information, and small enough
to be computationally manageable” (S ANTINI, 2004a). Ten genres from the
British National Corpus were used in the work: conversation, interview, planned
speech, public debate, academic prose, advertising, biography, instructional,
popular lore, and reportage.
Three feature sets of part-of-speech trigrams were used in conjunction with
a Naive Bayes classifier, yielding encouraging results, although these results
are not directly comparable to similar work because of differences in genre
categories, training and test data, and classification algorithm. Classification
accuracy as high as 82% was achieved.
While it has been shown that the use of syntactic representations has not improved classification accuracy in the context of general text classification (L EWIS,
64
C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK
1992b; M OSCHITTI and B ASILI, 2004; S COTT and M ATWIN, 1999), it is possible
that syntactic features may be especially useful for genre classification, as genre
classification is less topic-orientated than most text classification tasks. The
current experimental work on text representations for genre classification suggests that syntactic features may be useful (for example, S ANTINI (2004a) and
S TAMATATOS ET AL . (2000a). It is, however, difficult to gain more than an impressionistic view of the issue, as the current experimental work uses different
learning algorithms, different data-sets, different genre classification systems
and different languages.
3.2 Systems that Produce Biographies
Most of the working systems reviewed here characterise the production of biographies from multiple documents as a summarisation problem8 (the Southampton A RTEQUAKT system — see page 78 — is a notable exception). Effective
document summarisation has been a goal of computational linguistics and
information retrieval work since the first digital computers L UHN (1958). A
summary extracts what is most important from its source document, or as
S P ÄRCK J ONES (1999) puts it, a summary is “a reductive transformation of
source text to summary text through content reduction by selection and/or
generalisation of what is important in the source”. Of course, what is judged
important is heavily reliant on the context or task.
This section first reviews some background issues in summarisation and its extension, Multiple Document Summarisation (MDS),9 as many biography production systems work within an MDS framework. Then, four systems that
produce biographies are reviewed.
3.2.1 The Summarisation Task
Summaries and summarisation tasks can vary along a number of different dimensions (M ANI, 2001b,a; M ANI and B LOEDORN, 1999; M AYBURY and M ANI,
2001; S P ÄRCK J ONES, 1999). The most important of those are compression rate,
intended audience, relation to source document, function, coherence and language.
The compression rate of a summary refers to the ratio of summary length to
source document length, normally expressed in percentage terms. For example, a summary that is one tenth the length of the source document has a compression rate of ten percent. A summary that is nine tenths the length of the
source document has a ninety percent compression rate.
8 Other approaches outside the summarisation paradigm exist however. For instance, the biographical Question Answering system developed by F ENG and H OVY (2005) produces answers to
standard biographical sentences from free text.
9 This section draws heavily on M ANI (2001a), a textbook on summarisation.
65
C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK
The intended audience of a summary is important in determining appropriate
content and is conventionally divided into two categories; user focused summaries, where either information or format is geared towards user interests,
and generic summaries, which are not designed for a specific set of user interests.
Summaries can differ in their relationship to the source document in two main
ways; they can be made up of extracts from the source document (that is, extracted sentences) or they can take the form of the traditional abstract, which
contains information about the source document that is not necessarily in the
document itself. Abstracts, unlike extractive summaries, are normally coherent.
The function of a summary is usually described as either indicative (to indicate
whether the document is worth reading), informative (where all information in
the document is captured at some level of detail), or critical (where an evaluative component is included in the summary).
Summaries can also be judged on the dimension of coherence, with extractive
summaries often less coherent than abstracts.
While summaries are normally mono-lingual (that is, the summary is in the
same language as the source document(s)), there is also the possibility that the
source document(s) are summarised into a different language (that is, they are
translated as part of the summarisation process).
The need for automatic summarisation, while present through the late 1950s
to late 1980s, only became a real pressure in the 1990s, with the exponential
increase in online textual materials. Before the nineteen nineties, the obvious
advantages of automatic summarisation over human summarisation — cost
— was offset by summary quality concerns. Only when the volume of online
text became overwhelming was significant work directed at improving basic
methods which had moved on little since the work of L UHN (1958) and E D MUNDSON (1969).
The high level architecture of a summarisation system is conventionally divided into three processing modules (M ANI, 2001a; M AYBURY and M ANI, 2001;
S P ÄRCK J ONES, 1999; H OVY, 2003); analysis, transformation and generation.
1. Analysis — the initial stage, where a representation of the source document is constructed.
2. Transformation — the internal representation is manipulated to produce
a representation of the summary. This module is most important in abstract summaries. Extractive summaries tend to conflate stages 1 and 2.
3. Generation — a natural language output is generated from the representation of the summary.
In the context of automatic summarisation, the term “shallow processing” is
66
C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK
usually used to refer to extractive systems, which, as discussed above, output
sentences from the source document. To use the distinction made by S P ÄRCK J ONES
(1999), extractive systems extract text rather than facts from a source document.
In the three part summarisation architecture detailed above, systems that employ shallow processing conflate the first two stages and move directly to generation (stage 3). The very earliest attempts at automatic summarisation were
shallow, relying on word frequency counts (L UHN, 1958). Later attempts, using
word frequency counts augmented with some corpus evidence (E DMUNDSON,
1969), formed the framework for summarisation for many years (M ANI, 2001a)
.
3.2.2 Multiple Document Summarisation
Multi-document summarisation (MDS) is an extension of the traditional summarisation task, inheriting all the older task’s requirements, but bringing its
own special problems. If summarisation can be defined as “a reductive transformation of source text to summary text through content reduction by selection and/or generalisation of what is important in the source” (S P ÄRCK J ONES,
1999), then, extending S P ÄRCK J ONES (1999)’s definition, MDS can be described
as a reductive transformation of source text to summary text (where “source text” is a
collection of related documents) through content reduction by selection and/or generalisation on what is important in the source (while removing redundancy and possibly
flagging inter-document differences and similarities).” To put this another way, an
MDS summariser can be seen as a particular kind of summariser, with the following distinguishing characteristics:
It takes as input a collection of related documents.
It removes redundancy (that is, repetitive information) from the summary. For example, news articles covering the same event will (probably)
contain a great deal of repetitive information that ought only to appear
once in an output summary.
It flags cross-document differences and similarities. For example, discrepancies between news articles can be flagged.
The nature of the MDS task, as well as some of the distinctive applications
in which it might be employed (for instance, online news article summarisation versus traditional abstracting of single scientific papers according to a
template), make certain demands in addition to those explicit in the definition:
MDS can accept any number of documents greater than 1. It might be
more appropriate to use different methods for summaries based on different sizes of source document collections. For example, a summary
based on a two document collection may best be tackled using a radi-
67
C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK
cally different method from a summary based on a five hundred document collection.
Compression needs to be greater with MDS. For example, for documents
of a fixed size (say two hundred sentences) a ten percent summary of a
single document is twenty sentences long. A ten percent summary of a
five hundred document collection — with documents of the same length
again — is ten thousand sentences long; not a very useful summary.
To gain a twenty sentences long summary from the five hundred documents, the compression rate would have to be 0.02 percent. This kind of
compression rate would be very difficult to achieve with standard single
document extractive or abstractive techniques.
Cross-document co-reference is an inescapable problem in MDS. Simple
extractive techniques that sidestep co-reference are inadequate for MDS
due to the need for high compression.
Redundancy of information is perhaps the central problem in MDS. We can see
the extent of this problem if we imagine a naive attempt at MDS that involves
simply feeding each source document through a single document summariser,
and concatenating the results. The product of this process would be both extremely long — which is clearly of limited value in a summary — and highly
repetitive. In order to produce a viable and useful summary, it is essential to
remove some of this repetition, or at least minimise it to (application specific)
acceptable levels. M ANI (2001a) suggests four signals that indicate repetition
when comparing two text elements,10 and where potential for eliminating elements exists:
1. Semantic equivalence: Two elements have exactly the same meaning (paraphrases).
2. Informational equivalence: Two elements contain the same information.
This is weaker than semantic equivalence.
3. Information subsumption: Text element A subsumes text element B if the
information in B is contained in A.
4. String identity: Two elements consist of exactly the same string.
While techniques have been developed to identify redundancy, research on
identifying and flagging inter-document differences are less well developed.
An exception is R ADEV (1999) who has developed a system of twenty four
relations that provides a structure for flagging differences (for example, CONTRACTION, REFINEMENT, and so on.)
M ANI (2001a) suggests a general architecture for MDS systems consisting of
five modules:
10 Text
elements are sentences or clauses.
68
C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK
1. Selection: select text elements from the document collection using standard summarisation approaches.
2. Matching: match the extracted text elements to identify and remove redundancy,
3. Salience: select the most salient elements. Rank and output according to
compression rates.
4. Reduction: use aggregation to reduce the text elements further and reduce
non-redundant information.
5. Generate: output the final summary, using natural language generation
techniques.
While MDS is a well developed area of research, comparatively little work
has been done in specifically biographical MDS. The usefulness of a functional
biographical summariser when, for instance, quickly producing succinct reports on named individuals from news articles is clear (for example, M C K E OWN ET AL . (1999)).
While it is currently not possible to produce a biography of the same quality
as a professionally written and published work, automatic biographical summarisation does hold out the possibility of speed gains when compared to humans sifting through large document collections. Great quantities of data can
be quickly filtered to provide a useful and informative summary, where significant events and facts about a person’s life are culled from source documents
and presented in an orderly and appropriately succinct manner.
Several attempts at building biography orientated MDS systems have been
attempted in recent years, notable efforts (reviewed below) include the New
Mexico system (C OWIE ET AL ., 2001), the Mitre/Columbia system (S CHIFF MAN ET AL ., 2001), the Southern California system (Z HOU ET AL ., 2004) and
the Southampton system (K IM ET AL ., 2002), which differs from the preceding
three as it is not presented within an biographical MDS framework.
3.2.3 New Mexico System
The system described by C OWIE ET AL . (2001) is designed to aid in the classic information retrieval situation, where a user is searching for information
about, say, Tony Blair and is engulfed with a huge quantity of hits (Google
returns 788,000 hits for the string “Tony Blair”) many of which only mention
Tony Blair incidentally. Even given that the information retrieval results are all
highly relevant, the user is still required to read and synthesise a potentially
huge number of documents to satisfy their information need. The problem
is exacerbated when the results are in a language unfamiliar to the user. The
time and effort required to construct a summary of the target’s career is very
high.
69
C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK
Cowie’s system aims to resolve these problems by automatically producing
a “personal profile” (biography) for the query term from the retrieved documents (see Figure 3.4 on page 72). The system outputs a personalised profile
consisting of a chronologically ordered list of events, with links to source documents, quickly enough for the system to be used in “real time”.
The system is designed as a three module pipeline, consisting of an information
retrieval stage, a summarisation stage, and a merging and output stage:
1. Information retrieval — A collection of documents concerned with a given
individual is gathered using standard information retrieval techniques.
The documents are automatically filtered to exclude those not in the designated languages (English, Spanish or Russian). The user is given the
opportunity to filter out obviously inappropriate documents. Those documents that are only incidentally related to the query person, or those
documents that refer to a person who shares a name with the target, can
be filtered at this stage. For example, if we are seeking to produce a biographical summary of Tony Blair, British Prime Minister, we are unlikely
to be interested in documents pertaining to Tony Blair, New Zealand Cafe
owner.
2. Summarisation — For each document in the collection, find a date for the
document, select the most relevant chunks of text and determine a date
to associate with each chunk (the default document date if no explicit
date reference is made). If the source language is not English, translate to
English (retaining both the text chunk date and a reference to the source
document).
3. Merging and output — Each of the translated extracted text chunks is arranged in chronological order and output in HTML format with links to
the respective source document.
The system uses a query guided standard statistical text summarisation technique to extract text chunks from each document, that is, the summarising system is positively biased towards those sentences that contain the query term.
The process follows six steps:
1. Language recognition — The system determines the source language using
a character based
gram model.
2. Low level text processing — HTML tags and extraneous text are stripped
from the document.
3. Tokenisation – Sentence, word and paragraph tokenisation.
4. Dating — Get document date (normally this is either at the beginning or
end of the text).
5. Sentence ranking — Each sentence is scored and ranked (in the original
language) biased by the query term. The system tries to determine a
70
C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK
date for the sentence using simple pattern matching. If a date cannot be
determined, then the default document date is used.
6. Translation — If the source document is other than English, the extracted
text chunks are translated to English and the dates transformed to a standardised format.
The work is partly based on McKeown’s work on summarising medical literature for health professionals (M C K EOWN, 1998); work which was later generalised and extended to new domains (M C K EOWN ET AL ., 1999). Both M C K E OWN (1998) and M C K EOWN ET AL . (1999) emphasise the inadequacies of “key
sentence” summarisation techniques when the aim is to produce one summary
of multiple documents, rather than a single summary per document. The single document technique risks producing a highly repetitive (hence lengthy)
summary that may at its most extreme present the same problems as a simple
search engine query.
The unpredictable nature of the World Wide Web news sources presents special challenges. Unlike the medical domain, where journal articles are highly
conventional — often with a labelled conclusion flagging relevant sentences
— news articles from disparate sources exhibit sometimes radically different
styles and formats. It follows that document structure or formatting cannot be
used as a scaffolding mechanism for the identification of relevant sentences.
Further, it cannot be assumed that all Web documents are created equal. The
system needs some way of assessing the relative authority (reliability, trustworthiness) of the document or document creator.
Cowie suggests a number of enhancements to the system.11
Cross document co-reference — As the query term is a simple name string,
if the term is ambiguous between two (or more) well known people with
that name, then we will have a contaminated (noisy) output.12 This problem is partially addressed in the system by including domain specific
terms in the original IR query (for example, “Tony Blair” + politics). This
mechanism is however potentially inadequate when dealing with individuals associated with more than one domain.
Establishing dates — The current system assumes that the date of publication is the date of any extracted sentence, unless a date is specifically
referred to in the sentence. The addition of even simple temporal reasoning, able to manage terms like “yesterday” or “next week” in relation to
the base date would be a significant improvement.
Merging — The straightforward outputting of sentences could be enhanced
to cope with potentially repetitive entries. Additionally, entries that are
11 Personal
communication.
12 Cowie uses the example of Berezovsky the politician
versus Berezovsky the musician. Another
obvious example is Freud the psychoanalyst and Freud the artist.
71
C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK
Figure 3.4: New Mexico System (C OWIE
ET AL .,
2001).
contradictory could be flagged at this stage as worthy of investigation by
the user.
The fact that encouraging (though unevaluated) results have been obtained
using fairly simple techniques shows that automatic biography extraction from
the World Wide Web is a viable technique and could be greatly improved by
tailoring standard techniques to the specific task.
3.2.4 Mitre/Columbia System
The system — described in S CHIFFMAN ET AL . (2001) — uses corpus statistics and linguistic knowledge at different points in processing. It concentrates
on selecting appropriate biographical descriptions from the source documents
and removing redundancy, rather than producing a coherent summary. No
attempt is made at temporally ordering the material selected for the output
summary, and the final presentation uses canned text generation methods.
72
C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK
Figure 3.5: Sample Output from S CHIFFMAN
ET AL .
(2001).
EXAMPLE 1: Vernon Jordon is a presidential friend and a Clinton adviser.
He is 63 years old. He helped Ms. Lewinsky find a job. He testified that Ms.
Monical Lewinsky said that she had conversations with the president, that
she talked to the president. He has numerous acquaintances, including Susan
Collins, Betty Curries, Pete Domenici, Bob Graham, James Jeffords and Linda
Tripp
1,300 documents
707,000 words
607 Jordan sentences
78 Extracted sentences
2 groups: friend, adviser
EXAMPLE 2: Victor Polay is the Tupac Amaru rebels’ top leader, founder and
organization’s commander-and-chief. He was arrested again in 1992 and is
serving a life sentence. His associates include Alberto Fujimoiri, Tupac Amaru
Revolutionary and Nestor Cerpa. 73 documents
38,000 words
24 Polay sentences
10 extracted appositives
3 groups: leader, founder and commander-in-chief.
The system is basically extractive, with some post extractive smoothing and
merging to improve coherence and reduce redundancy. Figure 3.5 gives two
examples of system output, both examples are reproduced from S CHIFFMAN
ET AL . (2001). Selected output descriptions have been “strung” together using
a canned text generation system.
Consider Figure 3.5, Example 1. The system used a newswire corpus of 13,000
documents concerned with the Clinton impeachment proceedings — the Clinton corpus — the corpus contains 607 sentences mentioning Vernon Jordan
explicitly, and 82 descriptions, 78 of which are appositives (discussed below)
and four relative clauses. Additionally, 65 sentences where Vernon Jordan is
an (again, explicitly named) deep subject are present — although these are not
used in further processing.
Apposition is conventionally defined as a relation where a phrase or word appears next to a phrase or word of the same kind. The term is most frequently
used to describe the relation between juxtaposed noun phrases (for example,
“I’ve lost my dog, Wilbur”). It is clear that this relation can be exploited to
gather information about named persons. Although not all the existing facts
can be picked out by apposition, those that are should be reliable (low recall,
73
C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK
but high precision in the language of information retrieval). Typical journalistic
practices are useful here, with their traditionally heavy stylistic reliance on appositive phrases (for example “The US President, George Bush” or “The British
Chancellor, Gordon Brown”). Appositive phrase detection (implemented with
Finite State Automata) carry most of the extractive burden in the system.
Relative clauses modify the head of a noun phrase, typically using a pronoun
which shares a referent with that head (for example, “. . . who helped Lewinsky find a job”).13 If relative clauses can be identified accurately for named
individuals, it is again clear that these can be harnessed for extracting relevant
biographical facts (for example, “Gordon Brown, who became chancellor in
1997”).
These two relatively shallow recognisers — implemented using finite state
techniques — allow biographical information to be harvested from the document. Relying on the intuition that an important fact will be mentioned multiple times over a large document set, it seems likely to appear in an appositive
description (or relative clause).
Corpus statistics are used at several points in processing to help identify and
rank the most suitable appositive descriptions for summarisation. For example, appositive phrases are clustered and ranked by analysing the corpus frequency of their head nouns. Additionally, the system utilises linguistic resources (in the form of WordNet) at several points in the processing. For example, when merging redundant extracted appositive phrases, if head nouns
from two appositive phrases share a common parent below P ERSON in the
WordNet concept classification system, then only the phrase containing the
most frequent head noun is retained. This use of disparate resources justifies
the claim that the system combines linguistic knowledge and corpus derived
statistics.
The system follows a pipeline architecture and is built around the extraction of
appositive phrases featuring named persons. Processing has four stages:
1. Pre-processing — Every document in the collection is tokenised at the sentence and word level, part-of-speech tagged (using the Alembic tagger14 )
tagged for named entities (only person names) and then parsed using
the CASS parser15 ). Finite state automata are then used to locate and extract appositive phrases (these automata are designed to match both pre
and post modifying appositive phrases, for example “Current US president George Bush” and “George Bush, the US President”). Additionally,
13 It is not clear whether S CHIFFMAN ET AL . (2001) distinguishes between person orientated restrictive relative clauses (for example, “The man who you see”. Note how the clause helps identify
the referent) and person orientated non-restrictive relative clauses (for example, “Sarah, who got the
job, was very happy”. Note how the clause provides additional information about the referent.) It
seems likely that the non-restrictive type of relative clause is easier to identify heuristically.
14 http://www.mitre.org/technology/alembic-workbench/manual/AWB-content.html
Accessed on 02-01-07
15 http://gross.sfs.nphil.uni-tuebingen.de:8080/release/cass.html
Accessed on 02-01-07
74
C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK
relative clauses are picked out at the pre-processing stage, but do not
contribute to the final summary.
2. Cross-document co-reference — A named entity recogniser was used to sort
the named person references from each document into bins (a bin for
each distinct person). The program uses linguistic knowledge (special
rules about name abbreviations) and a statistical measure of similarity of
the immediate window of surrounding words between named persons
with potentially identical referents. The output of this cross document
co-reference stage of processing is that the set of extracted descriptions
for each distinct person are grouped together.
3. Appositive processing — At this stage, there are already a set of (perhaps
highly repetitive) appositive descriptions for each distinct individual. Appositive processing has several steps:
The first stage of appositive smoothing involves removing duplicate
descriptions, which given the large document collection, and the
likely repetitive nature of competing news articles, could remove
a high proportion of occurring phrases. Only one copy of a duplicated phrase needs to be retained.
The second stage identifies errors in pre-processing by identifying
phrases that do not seem to have a person as head. Even the most reliable named-entity taggers will make mistakes; identifying companies as people, and so on. The system employs a novel “person typing” program for identifying erroneous appositives. The program
relies on WordNet for linguistic knowledge and implements the
following rule for distinguishing between those appositive phrases
that have persons as heads, and those which do not: A string (that
is, head noun of an appositive phrase) refers to a person if at least 35%
of senses of that string are descended from the synset for PERSON in
WordNet.
The third stage involves removing redundancy; a deep problem for
all MDS systems but especially pertinent when dealing with journalistic material where multiple articles describe the same incident.
Redundancy, as mentioned previously, occurs when multiple nonidentical strings contain the same information, or one string subsumes the information in another string. Again, the system uses
WordNet and corpus statistics to help identify and merge repetitive
descriptions.
Additionally, a further stage of processing occurs when duplicates
are eliminated and modifiers conjoined (for example, “British Prime
Minister” and “Member of Parliament for Sedgefield” might be conjoined as “British Prime Minister and Member of Parliament for
Sedgefield”).
75
C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK
4. Generation: Ranked descriptions are selected according to the desired
compression rate, and a shallow canned text generation technique is used
to “string” them together, resulting in occasionally incoherent, but intelligible output.
While the system is successful in meeting its stated aims in that output text is
relevant and coherent, several improvements could easily be made:
Organising output in temporal order. No attempt is made at maintaining temporal order, an important requirement for conventional biographies that are expected to follow linear narrative conventions (see Section 2.3 on page 30). This temporal incoherence is a potential cause of
confusion to a reader expecting linear narrative. A naive time stamping method (especially in journalism, where publication dates are usually
flagged in conventional ways) could be one simple way of ordering output. Additionally, temporal information could form part of the ranking
criteria for outputting descriptions, with more recent information having
a higher weighting than older news.
Linking descriptions to source documents. The extract smoothing and
merging processes and the fact that reference to source documents are
not retained, means that a user cannot easily locate the context or source
of an extracted description. This could be important in a situation where
a description is unusual, surprising or important and the reliability (or
authority) of a source document needs to be determined. In a web based
context, a link to the source document would be most appropriate, other
summarisation contexts would require their own referencing methods.
Mechanisms for resolving disagreement. There is no clear method to resolve and flag disagreement between descriptions.
3.2.5 Southern California System
This biographically orientated MDS system developed by Z HOU ET AL . (2004)
at the University of Southern California is highly relevant to the current thesis.
As part of the development of the Southern California system, a biographical
corpus was created (described in Section 4.3.6 on page 94).
The system architecture is outlined in Figure 3.6 on the following page and can
be divided into five steps: information retrieval, identification of biographical sentences, sentence merging, ranking of sentences and redundancy removal.
The information retrieval stage of processing identifies those documents that
contain a given person’s name (the target person), from a large document collection.
76
C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK
Figure 3.6: System Architecture for Z HOU
ET AL .
(2004) MDS system.
IR on Person's
Name
Identify Biographical
Sentences
Merge Biographical
Sentences
Rank Biographical
Sentences
Remove Redundancy/
Restrict Length
Output
Biography
The identification of biographical sentences stage takes as input those documents containing the target person’s name, splits the documents at the sentence level, and classifies each sentence as belonging to one of ten categories:
bio (birth dates, death dates, and so on), fame factor, personality, personal, social, education, nationality, scandal, work and finally none, which is simply
the absence of a biographical category. The classification scheme is based on
a “checklist” approach, with every biography made up of sentences selected
from each of the nine biographical categories.
77
C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK
Various sentence representations were chosen for input to the sentence classigrams. The Naive Bayes classifier was
fication module including frequent
trained and evaluated on the biographically marked up corpus. The representation chosen for the final system architecture is not specified.
The next stage of processing is the merging of biographical sentences. All sentences containing a variant of the target person’s name are added to the store
of biographical sentences. Those sentences identified by the classifier that are
direct quotations, or less than five words long are discarded, along with any
duplicate sentences.
After the sentences have been merged, the result is a list of distinct biographical
sentences. The task is now to rank them according to their importance, and a
variant of inverse document frequency was used to achieve this.
Due to the potential high degree of repetitive information characteristic of the
MDS task (with multiple – perhaps very similar – input documents) there
is a need for a redundancy removal filter. The system used a method outlined in M ARCU (1999) to remove sentences until the desired compression is
achieved.
The system’s output does not reflect the ten way classification system developed at the classification stage; instead, a two way (biographical and nonbiographical) classification system was adopted, with each of the original biographical categories (education, work, scandal and so on) subsumed in one
single biographical category. Note that the non-biographical category is unchanged in the transition to a binary scheme. One way of further developing
the system, suggested by the authors, is the utilisation of the fine grained classification scheme in the output summary. For example, general biographical
information (birth dates, death dates, and so on) can be output in the first sentences, followed by fame factor sentences (explaining why the target subject is
notable) and then more detailed information. Additionally, this kind of biographical summarisation system can be tailored according to user interest (for
example, a user may be interested in the educational background of the target
individual, and could request that a summary contain only this kind of information).
From the point of view of this thesis, the most important element of the Southern California system is the experimental work on feature representation, which
is explored, using the corpus data created by Z HOU ET AL . (2004), in Chapter 10 on page 177 of this thesis.
3.2.6 Southampton System
The A RTEQUAKT system (K IM ET AL ., 2002) uses a combination of linguistic
resources, information extraction technology and knowledge engineering to
produce user tailored biographies of artists. These biographies are not de-
78
C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK
Figure 3.7: Sample Output from ARTEQUAKT System (K IM
ET AL .,
2002)
French Impressionist painter, born at Limoges. In 1854 he began work as a
painter in a porcelain factory in Paris, gaining experience with the light, fresh
colours that were to distinguish his Impressionist work and also learning the
importance of good craftsmanship. His predilection towards light-hearted
themes was also influenced by the great Rococo masters, whose work he studied in the Louvre. In 1862 he entered the studio of Gleyre and there formed a
lasting friendship with Monet, Sisley, and Bazille. He painted with them in the
Barbizon district and became a leading member of the group of Impressionists
who met at the Cafe Guerbois. His relationship with Monet was particularly
close at this time, and their paintings of the beauty spot called La Grenouillere
done in 1869 (an example by Renoir is in the National museum, Stockholm)
are regarded as the classic early statements of the Impressionist style.
signed to rival human created attempts in terms of quality, but are designed
to be coherent and useful. In contrast to other biography creating systems examined here, the creators of the A RTEQUAKT system do not locate their work
in a summarisation framework but rather consider it a generation system: generating biographies from knowledge extracted from a collection of documents
rather than producing a biographically orientated MDS summary of those documents. While A RTEQUAKT shares its information extraction driven approach
for acquiring facts from documents with self-described MDS summarisation
systems (for example M C K EOWN ET AL . (2002)), A RTEQUAKT’s reliance on a
rich ontology (beyond ad hoc use of WordNet for hyponym, synonym, and
hypenym information) is the basis for its claim to be a biographical generation
system.
From a user point of view, the A RTEQUAKT system employs a web interface.
The system allows the user to tailor the biography produced along several parameters: artistic style, the painter’s family background, their influences, and
the extent of the user’s interest in their paintings. Example output is given as
a 150 word summary in Figure 3.7.
Processing is divided into three modules: knowledge extraction, information
management, and narrative generation. Each of these three modules is described below:
1. Knowledge extraction — This module strips factual information from documents returned by a search engine (with an artist’s name as the search
engine query). Instead of using a cascade of Finite State Automata —
the standard technique in information extraction approaches (C OWIE and
L EHNERT, 1996) — A RTEQUAKT uses syntactic and semantic analysis to
79
C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK
carry the extraction load. The system designers wished to develop a general purpose knowledge extraction method that did not require either
human designed extraction rules, or the extensive corpus annotation required for the Machine Learning of extraction rules (C ARDIE, 1997). The
designers did consider using newer, more adaptive Machine Learning
techniques (C IRAVEGNA, 2001; W ILKS and C ATIZONE, 1999; YANGAR BER and G RISHMAN , 2000) but the need to identify a suitable corpus —
clearly non-trivial in the case of unpredictable search engine results —
meant that a deeper approach was adopted.
The output from the knowledge extraction stage of processing is an XML
formatted document representing extracted facts, the source sentences
for the extracted facts, and the original document text and URL.
2. Information management — The second stage of processing involves adding
the extracted facts to a knowledge base of extracted facts. It is important
to emphasise that in contrast to other systems considered, A RTEQUAKT
stores its knowledge and generates biographies from this stored knowledge. This knowledge is built in two stages. First, by populating a special
purpose ontology from the extracted facts. Second, by running a set of
error checking routines over the newly constructed entities, checking for
obvious duplications.
In the context of the system, “ontology” is used to refer to “conceptualisation of a domain into machine readable format” (K IM ET AL ., 2002).
The ontology used is based on the Conceptual Reference Model (CRM) – an
ontology for representing the world of cultural artefacts (their location,
owners, and so on)16 – extended to cover the domain of artists and their
lives. The output of the extraction phase of processing — an XML file
with tags mapping to classes in the ontology — is parsed and used to
populate the knowledge base.
The populated knowledge base is then checked for duplicate information
and merging opportunities.
3. Narrative generation — The final stage of the process is narrative generation, where a biographical story is generated using facts and relations
stored in the knowledge base and user tailored templates are used to organise the output. The system designers use human written biographical
templates. For example, a template geared toward producing biographies that focus especially on artistic influences would include an “influences” entry. The decision to use this method is — as reported by
the system designers – grounded in narrative theory (B AL, 1985), where
a narrative is based on a sequence of events (a story), which in turn is
based on a collection of facts and events (a fabula). It could however,
equally well derive from traditional reliance of Artificial Intelligence on
16 http://cidoc.ics.forth.gr/index.html Accessed
80
on 02-01-07
C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK
script-like representations (S CHANK and A BELSON, 1977).17
The biographical templates used are written in XML, using the Fundamental Open Hypertext Model (FOHM) (M ILLARD ET AL ., 2000), a standard for representing hyper-media. The biographical templates are constructed from sub-structures, called “sequences”. These sequences are
made up of queries (expressed in terms of the knowledge base categories)
that must be inserted in the generated biographies, respecting the sequence order. Queries are put to the knowledge base and their responses
(often in the shape of URL’s to complete source sentences) are inserted
in the summary.
While the A RTEQUAKT system is not fully implemented, some simple example output is available and the system seems successful in meeting its limited
aims (although no evaluative work has been attempted). There are however,
several problems with extending the A RTEQUAKT approach to different applications:
1. The limited domain of the system is potentially a problem; its reliance
on a handmade ontology and biographical templates (specific to artists)
has serious implications for portability. The system could be extended to
manage other groups, but only after an intensive knowledge engineering
effort. It is possible that biographical templates could be developed for
politicians, business leaders, and so on, but it is not clear that the overall
quality of biographies produced by such a system would justify the extra
knowledge engineering cost.
2. The rigid separation of the knowledge base from the information retrieval stage could result in a failed user query — say for an obscure artist
— when highly relevant information existed, but had not found its way
into the knowledge base.
3. The use of fixed biographical templates, while providing a structured
narrative (providing, that is, that information is present in the knowledge base), does not cater for important biographical information that the
system designers had not anticipated. For instance, Da Vinci’s engineering and medical achievements rival his achievements as an artist, yet a
fixed template — which lacks “engineering achievement” and “medical
achievement” properties — runs the risk of excluding these biographically important facts. It is often the extent to which a person deviates
from the template that makes them biographically interesting.
17 Biographies are suitable for this kind of representation. All lives follow a similar script if
viewed from sufficient distance.
81
C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK
3.2.7 Other Relevant Work
Considerable research effort has been expended on the construction of biographical summaries in recent years under the auspices of the United States
government sponsored Document Understanding Conference18 (DUC) and the
Text Retrieval Conference19 (TREC). Competing groups are given several tasks
and scored on their system’s performance in the expectation that this kind of
focused competition and standardised evaluation of results will quicken the
pace of research.
Task 5 in DUC-200420 required that the competing systems generated answers
to the question “Who is X?” from a collection of documents, where X can refer
to an individual or, additionally a group of people. An example system from
DUC-2004 is B LAIR -G OLDENSOHN ET AL . (2004). The system, developed at
Columbia University, relies heavily on definitional predicates to identify relevant
sentences (for example, “X is a Y”). Clauses containing these definitional predicates are identified by two methods:
1. Text categorisation based on machine learning techniques.
2. Finite State Automata based on corpus derived patterns.
Under the TREC umbrella, there are several different tracks, one of which is
question answering (QA). One of the QA task is the answering of definitional questions. Definitional questions are of the form “What is X?” or “Who is X?” Examples of definitional questions are given by V OORHEES (2001) in an overview of
the QA track of TREC-2003. They include “Who is Colin Powell?” and “What
is mould?” An example system from TREC-2003 is G AIZAUSKAS ET AL . (2003),
which uses a collection of fifty patterns (for example, “X is a Y”).
This section has reviewed some recent biographically orientated systems, from
the rich ontology driven A RTEQUAKT system (K IM ET AL ., 2002), through information retrieval based MDS approaches (C OWIE ET AL ., 2001) and linguistic
and corpus based MDS approaches (S CHIFFMAN ET AL ., 2001; Z HOU ET AL .,
2004) to the pattern matching approach favoured by DUC and TREC conference entrants focussing on definitional questions.
There are a number of different tasks subsumed under the biographical summarisation heading. The definitional questions of the DUC and TREC competitions are designed to “pick out” general purpose descriptions of (“X is a
Y”), unlike, say the A RTEQUAKT system which uses an ontology to populate a
database of particularly significant facts relevant to an artist’s life (K IM ET AL .,
2002). The A RTEQUAKT system is domain specific and an extensive knowledge
engineering effort would be required to transport the system to another domain, where different factors are important (for example, a politician’s career
is likely to require a radically different ontology from that of an artist). These
18 http://www-nlpir.nist.gov/projects/duc Accessed
on 02-01-07.
on 02-01-2007.
20 http://duc.nist.gov/duc2004/tasks.html Accessed on 02-01-2007.
19 http://trec.nist.gov/ Accessed
82
C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK
differences in the aims of the different systems make direct comparison between them difficult. One major difference between the systems considered
and the current work is that the focus of the current work is on the identification of biographical sentences rather than the production of biographical
summaries as issues of selecting and ordering sentences for coherence are less
relevant.
3.3 Conclusion
This chapter has analysed recent work in automatic genre classification and
described functioning systems that produce biographies of named individuals
from multiple texts. The next chapter outlines the methodology and resources
used in the work, and the remaining chapters describe the research conducted
as part of this project.
83
C HAPTER 4
Methodology and Resources
This chapter describes the overall research methodology employed and the resources used in the thesis. The first section describes the methodology used.
The second and third sections describe the resources used (software and corpora, respectively).
4.1 Methodology
This section is designed to provide a bridge between the background chapters
(1, 2, 3 and 4) and the research chapters (5, 6, 7, 8, 9 and 10).
Here we describe how the human study and experimental work on automatic
genre classification support the main hypothesis that biographical writing can
reliably be identified at the sentence level using automatic methods. This claim has
two sub-hypotheses:
1. Humans can reliably identify biographical sentences without the contextual support provided by a discourse or document structure.
2. “Bag of words” style sentence representations augmented by syntactic
features provides a more effective representation for biographical sentence recognition than bag of words representations alone.
We address hypothesis 1 by determining the agreement level achieved when
a group of human participants classify a set of sentences as biographical or
non-biographical. If the agreement level is high, this would indicate that humans are able to distinguish between biographical and non-biographical sentences without the aid of a supporting discourse structure. Hypothesis 2 is
approached by comparing the performance of syntactic and “bag-of-words”
features on a gold standard corpus of biographical and non-biographical sentences using the 10 x 10 cross-validation technique described in Section 2.5.2 on
page 47.
84
C HAPTER 4: M ETHODOLOGY AND R ESOURCES
Note that it is possible for hypothesis 1 to be true, and hypothesis 2 false (and
vice versa). Additionally, it is possible that both sub-hypotheses are false,
while the main hypothesis is true. For example, it could be shown that people are not able to reliably identify biographical sentences and also that “bagof-words” style features perform better than syntactic features. The success of
the main hypothesis is not dependent on the kind of sentence representation
used (“bag-of-words” or syntactic), but rather on the existence of some sentence representation that provides good results in conjunction with a learning
algorithm..
Hypothesis 1 is designed to indicate whether biographical sentence classification is a task that can be reliably performed by humans. It remains possible that
while humans may struggle to perform the biographical sentence classification
task, there exists an automatic method that performs the task reliably.
Hypothesis 2 claims that a particular type of sentence representation — a syntactic representation — is likely to be especially useful for biographical sentence classification. This claim is associated with the contention that biographical sentence classification is a genre classification task, and that features which
capture the non-propositional content of sentences will be especially useful
and enhance classification accuracy. Syntactic features are contrasted with
“bag-of-words” style features. “Bag-of-words” style features have traditionally been used with some success in topical text classification. However, in
examining representations for automatic sentence classification, this research
is not confined by hypothesis 2, but also explores other sentence representations that may be useful (for example, the use of function words and “keykeywords”).
The main hypothesis (and two sub-hypotheses) provides a framework for the
thesis, but other research questions are addressed within that framework (for
example, the utility of the key-keywords methodology (see page 21) for identifying biographically relevant features).
The first hypothesis is addressed in Chapters 5 and 6. Chapter 5 describes the
creation of a biographical annotation scheme and corpus. Chapter 6 — a human study — establishes that human beings can reliably identify biographical
sentences (according to the scheme identified in Chapter 5). This is an important first step, as it shows that it is possible that biographical sentences can be
identified without supporting discourse or document structure. This section
also provides gold standard data for subsequent automatic classification experiments.
The second hypothesis is broadly addressed in Chapters 7, 8, 9 and 10. Chapter 7 tests several different learning algorithms using the gold standard data
and a standard feature set, in order to identify the learning algorithm which
provides best accuracy for the biographical classification task. Chapter 8 does
not describe learning experiments, but rather outlines those feature sets used
in Chapters 9 and 10. Chapter 9 compares the performance of different fea85
C HAPTER 4: M ETHODOLOGY AND R ESOURCES
ture sets using the gold standard data, particularly focusing on hypothesis 2
(that is, whether syntactic features improve classification accuracy). Chapter
10 tests the biographically orientated feature set identified by Z HOU ET AL .
(2004) using the gold standard data in order to discover whether the feature
identification mechanism used by Z HOU ET AL . (2004) is exportable to other
biographical data. The thesis concludes with a chapter detailing contributions
made and possible directions for future research.
4.2 Software
The human study (described in Chapter 6 on page 124) did not require extensive software resources. The main work was in producing the content and
marking it up in HTML for online use. The relevant web page was uploaded to
a publicly accessible web server with limited scripting capabilities from which
participant responses were harvested. Several scripts were written in Perl to
post-process the data into a form suitable for the R1 statistical programming
environment.
A script written in Perl (bio features)2 was developed as part of the project,
and used to convert input sentences into various sentence representations (sentence vectors) in a format appropriate for the Weka machine learning environment.3 The program utilises several existing NLP tools, including standard
implementations of sentence splitting algorithms, and a Perl part-of-speech
tagging library based on the Penn Treebank tagging scheme.
The Weka machine learning and evaluation environment was used extensively
to evaluate the classification success of the various sentence representations
(W ITTEN and F RANK, 2005). Weka provides a suite of machine learning algorithms in both GUI driven and command line (scriptable) environments. Weka
also provides implementations of standard machine learning evaluation techniques. The system is written in Java, and is hence portable across all major
platforms.
Perl scripts were used to organise the computationally expensive multiple
runs of the bio features script and Weka machine learning algorithms on a
UNIX machine with enhanced RAM.
Further scripts were written for data analysis purposes and to perform statistical tests. See Appendix G on page 273 for details of the Perl implementation
of the corrected re-sampled -test.
1 http://www.r-project.org Accessed
on 02-01-07.
code is available by contacting [email protected].
3 Weka input files have the suffix “.arff” and are colloquially referred to as “ARFF files”.
2 This
86
C HAPTER 4: M ETHODOLOGY AND R ESOURCES
4.3 Corpora
In order to isolate the distinctive qualities of biographical text, it was necessary
to identify two types of corpora: biographical corpora and multi-genre corpora.
The biographical text used in the development of this work came from a variety
of different sources. All the corpora used were published in English (British,
American and New Zealand English) and are roughly contemporary (that is,
all the corpora consist of modern English). The biographical corpora described
in this section are: The Oxford Dictionary of National Biography, Chambers Biographical Dictionary, Who’s Who, The Dictionary of New Zealand National Biography, Wikipedia biographies and a biographical corpus developed at the University of Southern California. The multi-genre corpus used in the experiments
were the B ROWN corpus and the S TOP corpus. The TREC newstext corpus
was also used, and is described in this chapter. Descriptive statistics for all the
biographical corpora are presented in Table 4.1 on page 96
4.3.1 Dictionary of National Biography
The Oxford Dictionary of National Biography, of which there was a substantially
re-written new edition published in 2004 (OUP, 2004), is designed as a historical reference of notable British people:
The Oxford DNB aims to provide full, accurate, concise and readable articles on noteworthy people in all walks of life, which present
current scholarship in a form accessible to all. No living person is
included; the Dictionary’s articles are confined to people who died
before 31 December 2000. It covers people who were born and
lived in the British Isles, people from the British Isles who achieved
recognition in other countries, people who lived in territories formerly connected to the British Isles at a time when they were in
contact with British rule, and people born elsewhere who settled in
the British Isles for significant periods or whose visits enabled them
to leave a mark on British life.
OUP (2003)
The original Dictionary of National Biography was published in instalments of
sixty three volumes between 1885 and 1900, under the editorship of Leslie
Stephen (and Sidney Lee from 1891). The complete dictionary was reissued
in 1908-09, and supplements were published periodically between 1911 and
1996, when work on the new dictionary began in earnest (FABER and H ARRI SON , 2002).
An XML encoded version of the original Dictionary of National Biography — the
old DNB — was provided by the Oxford University Press for this study. The
old electronic DNB contains the 1908-09 edition text, and the supplementary
87
C HAPTER 4: M ETHODOLOGY AND R ESOURCES
data produced up to 1996.
The entries follow a clear set of editorial guidelines. Each entry begins with
the subject’s name, birth and death dates, and a brief description of their occupation. The remaining part of the biography concentrates on the subject’s
achievements and importance. The length of the article is a reflection of the
importance attached to the subject and is thus an editorial decision (FABER
and H ARRISON, 2002). There is considerable variation in the lengths of biographical entries in the old DNB. See Table 4.1 on page 96 for descriptive
statistics.4
A short example entry from the old DNB concerning Charles Babbage is given
below:
Babbage, Charles 1792-1871, mathematician and scientific mechanician, was the son of Mr. Benjamin Babbage, of the banking firm of
Praed, Mackworth, and Babbage, and was born near Teignmouth
in Devonshire on 26 Dec. 1792. Being a sickly child he received
a somewhat desultory education at private schools, first at Alphington near Exeter, and later at Enfield. He was, however, his own
instructor in algebra, of which he was passionately fond, and, previous to his entry at Trinity College, Cambridge, in 1811, he had
read Ditton’s Fluxions, Woodhouse’s Principles of Analytical Calculation, and other similar works. He thus found himself far in advance of his tutors’ mathematical attainments, and becoming with
further study more and more impressed with the advantages of the
Leibnitzian notation, he joined with Herschel, Peacock (afterwards
Dean of Ely), and some others, to found in 1812 the Analytical Society for promoting (as Babbage humorously expressed it) “the principles of pure D-ism in opposition to the Dot-age of the university.”
The translation, by the three friends conjointly (in pursuance of the
same design), of Lacroix’s Elementary Treatise on the Differential and
Integral Calculus (Cambridge, 1816), and their publication in 1820
of two volumes of Examples with their solutions, gave the first impulse to a mathematical revival in England, by the introduction of
the refined analytical methods and the more perfect notation in use
on the continent.
Babbage graduated from Peterhouse in 1814 and took an M.A. degree in 1817. He did not compete for honours, believing Herschel
sure of the first place, and not caring to come out second. In 1815
he became possessed of a house in London at No. 5 Devonshire
Street, Portland Place, in which he resided until 1827. His scientific activity was henceforth untiring and conspicuous. In 1815-17
he contributed to the Philosophical Transactions three essays on the
calculus of functions, which helped to found a new, and even yet
little explored, branch of analysis. He was elected a fellow of the
4 Queen Victoria’s entry in the Dictionary of National Biography is by far the lengthiest at one
hundred thousand words.
88
C HAPTER 4: M ETHODOLOGY AND R ESOURCES
Royal Society in 1816. He took a prominent part in the foundation
of the Astronomical Society in 1820, and acted as one of its secretaries until 1824, subsequently filling the offices, successively, of
vice-president, foreign secretary, and member of council. . .
OUP (2004)
We can see from this (truncated) example, how the entry adheres to the prescribed pattern; name, birth and death dates, occupation, and family background, before elaborating on the achievements of the subject (in this case, his
numerous important publications and memberships). In addition to the bulk
of the biography, which is descriptive, the biography has an evaluative element
(for example, the description of his early education as “desultory”). The biography also adds source materials and references at the end of the article.
4.3.2 Chambers Biographical Dictionary
The Chambers Biographical Dictionary (C HAMBERS, 2004) is a single volume from
a British publisher. The dictionary aims to give single paragraph descriptions
of people of historical or contemporary importance from a British perspective.
However, biographical subjects are not exclusively British but drawn from a
wider, international pool. Examples here include, Jeff Bridges (US actor), Julian Barnes (British novelist) and Gu Ban (Chinese historian). The biographies
are designed to provide a brief introduction rather than critical evaluation,
and contain stereotypically biographically significant information (that is, birth
date, death date, occupation, location of birth). For this study, Chambers provided an XML encoded electronic copy of a subset of the dictionary (those
names beginning with “B”). The example below shows the form of a typical
entry:
Babbage, Charles 1791-1871 English mathematician Born in Teignmouth,
Devon, and educated at Trinity and Peterhouse colleges, Cambridge,
he spent most of his life attempting to build two calculating machines. The first, the difference engine, was designed to calculate
tables of logarithms and similar functions by repeated addition performed by trains of gear wheels. A small prototype model described to the Astronomical Society in 1822 won the Society’s first
gold medal, and Babbage received government funding to build a
full-sized machine. However, by 1842 he had spent large amounts
of money without any substantial result, and government support
was withdrawn. Meanwhile he had conceived the plan for a much
more ambitious machine, the analytical engine, which could be programmed by punched cards to perform many different computations. The cards were to store not only the numbers but also the
sequence of operations to be performed, an idea too ambitious to
be realized by the mechanical devices available at the time. The
idea can now be seen to be the essential germ of today’s electronic
89
C HAPTER 4: M ETHODOLOGY AND R ESOURCES
computer, with Babbage regarded as the pioneer of modern computers. He held the Lucasian chair of mathematics at Cambridge
from 1828 to 1839.
C HAMBERS (2004)
4.3.3 Who’s Who
Who’s Who (B LACK, 2004) is a comprehensive, single volume collection of biographical sketches whose subjects are substantially connected to the United
Kingdom. In contrast to the DNB, the biographical subjects in Who’s Who are
all living (there is a companion volume — Who was Who — to cater for the
deceased). The first volume was published in 1849, and has been reissued regularly (in recent times, every year) since that date.
In contrast to the other biographical dictionaries considered here, Who’s Who is
autobiographical (that is, the entries are written by the subjects). The biographical subject completes a form containing pertinent information, which is the primary source for the generation of the final biography. Certain groups of people
are included by default. For example, Members of the British Parliament, High
Court Judges, and certain subgroups of the British aristocracy. Other professional groups are — for example, sportspeople, artists, journalists, and so on
— are selected for inclusion in the dictionary by a selection committee.
The form of a Who’s Who entry is designed to provide information rather than
evaluation; the format can seem rather schematic compared to the discursive
style of the multi-volume DNB. An example entry is presented below:
FRY, Stephen John, writer, actor, comedian. b. 24 Aug. 1957 of
Alan John Fry and Marianne Eve (ne Newman). Education Uppingham Sch.; Queens Coll., Cambridge (MA). Career TV series: Blackadder, 1987-89; A Bit of Fry and Laurie, 1989-95; Jeeves in Jeeves
and Wooster, 1990-92; Gormenghast, 2000; presenter, QI, 2003; Theatre: Forty Years On, Queen’s, 1984; The Common Pursuit, Phoenix,
1988; films: Peter’s Friends, 1992; I.Q., 1995; Wilde, 1997; Cold
Comfort Farm, 1997; The Tichborne Claimant, 1998; Whatever Happened to Harold Smith?, 2000; Relative Values, 2000; Gosford Park,
2002; (dir) Bright Young Things, 2003. Columnist: The Listener,
1988-89; Daily Telegraph, 1990 Publications: Me and My Girl, 1984
(musical performed in West End and on Broadway); A Bit of Fry
and Laurie: collected scripts, 1990; Moab is My Washpot (autobiog.), 1997; novels: The Liar, 1991; The Hippopotamus, 1994; Making History, 1996; The Stars and Tennis Balls, 2000 Recreations: smoking, drinking, swearing, pressing wild flowers. Address: c/o Hamilton, Ground Floor, 24 Hanway Street, W1P 9DD. Clubs: Savile, Oxford and Cambridge, Groucho, Chelsea Arts.
B LACK (2004)
90
C HAPTER 4: M ETHODOLOGY AND R ESOURCES
It can be seen from the example above that the biography consists of important
dates, occupational information, career highlights (publications, television and
film appearances), contact details, and a list of recreations. The presentation is
obviously schematic, with the information set forth in list format under suitable headings.
4.3.4 Dictionary of New Zealand Biography
The Dictionary of New Zealand Biography (U NIVERSITY OF A UCKLAND P RESS,
1998) is a collaborative enterprise between Auckland University Press and the
New Zealand Department of Internal Affairs.5 There are two criteria for inclusion in the dictionary. First, the subject must have made a contribution to
(or had an impact on) New Zealand’s development. Second, the subject must
be dead. Rather like the Oxford DNB, the Dictionary of New Zealand Biography aims at a discursive, evaluative function. The three thousand entries have
multiple paragraphs and are written in continuous prose. They are designed
to supply the biographical basics, and also to provide context and evaluation.
Below is an extract from an entry on one time New Zealand resident, Karl Popper:
Karl Raimund Popper was born in Vienna, Austria, on 28 July 1902,
the son of Simon Siegmund Carl Popper, a lawyer, and his wife,
Jenny Schiff. He turned his inquisitive and enterprising mind to a
variety of activities. He was apprenticed to a cabinet-maker, joined
a youth organisation where he worked with delinquent adolescents,
tramped in the Austrian mountains, taught himself mathematics
and physics, and became active in political movements during the
First World War as a socialist. Most significantly, he studied philosophy at the University of Vienna and was awarded a PhD in 1928.
On 11 April 1930 he married Josefine Anna Henninger in Vienna;
there were no children of the marriage. By that time he was an accomplished musician with a decided preference for classical music
and had qualified as a schoolteacher.
It was Popper’s practical and political interests that first directed
him to the philosophy of science, because he realised that it was
vital to be able to tell genuine knowledge from pseudo-knowledge
and superstition. Having become acquainted with the dominant
philosophical school in Vienna, known as the Vienna Circle, he concluded that its philosophy of science was deficient. It was based on
the old Baconian inductionist theory that science involves making
observations and generalising them into universal laws. This, Popper argued, explained neither how scientific thought actually proceeds, nor why we consider that its findings correspond to reality.
He therefore formulated a counter-proposal, which he published in
07.
5 The
dictionary can be accessed online at: http://www.dnzb.govt.nz Accessed on 01-02-
91
C HAPTER 4: M ETHODOLOGY AND R ESOURCES
1934 under the title Logik der Forschung its English translation, The
Logic of Scientific Discovery, was not published until 1959. On Popper’s account, science proceeds by formulating hypotheses. The criterion of a scientific hypothesis is that it generates predictions that
are capable of being falsified by reference to empirical data. These
hypotheses can never be conclusively proved true, so that scientific knowledge is necessarily provisional. This idea revolutionised
the understanding of the nature and value of scientific knowledge.
Albert Einstein read Logik in manuscript and applauded it vigorously. Popper’s radical revision of induction at once brought the
philosophy of science into line with actual practice and provided
an unprecedentedly convincing account of how science succeeds in
arriving at knowledge about nature.
With the advance of Nazism in Austria and the growth of antiSemitism, Popper, who was of Jewish origin though not a practising Jew, decided to emigrate. In 1936 he learnt of an advertisement
for a lectureship in philosophy at Canterbury University College,
Christchurch, New Zealand. He applied and took up the position
in early 1937. . .
U NIVERSITY
OF
A UCKLAND P RESS (1998)
The extract above shows the discursive nature of the New Zealand biographies.
The style can be distinguished from the other dictionaries in that it is chronological in form, and does not conform to the inverted pyramid form characteristic of the Dictionary of National Biography. For instance, in the first sentence,
although a birth date and location is provided, a death date is not.
4.3.5 Wikipedia Biographies
A wiki is a user editable web application.6 The traditional model of information
provision on the web involves a service provider serving content to a community. In the case of wikis however, the content is provided by the community
itself, in the form of user editable web pages. Any user has the opportunity to
edit information on a wiki, if that information is judged by the user to be incorrect or misleading, and in their turn, any corrections made may themselves
be corrected by another user.
Wikipedia is an attempt to create a new, free, user-editable encyclopedia using methods loosely based on open software development techniques. It was
launched in 2001, and currently has 1,690,892 articles in English.7
Wikipedia classifies and organises biographies in a number of ways, including death year and alphabetically by family name, along with numerous more
6 http://en.wikipedia.org/wiki/Wiki Accessed
on 02-01-07.
of 01-03-07.
7 http://en.wikipedia.org/wiki/Wikipedia%3AAbout As
92
C HAPTER 4: M ETHODOLOGY AND R ESOURCES
whimsical categories.8 For this work, the biographies of those who died in
the first three years of the twenty-first century were used, yielding 959 biographies.9 In line with previous examples, an example biography for the Nineteenth Century scientist and mathematician, Charles Babbage is reproduced
below:
Charles Babbage (26 December 1791 – 18 October 1871) ) was an English mathematician, analytical philosopher, mechanical engineer
and (proto-) computer scientist who originated the idea of a programmable computer. Parts of his uncompleted mechanisms are
on display in the London Science Museum. In 1991, working from
Babbage’s original plans, a difference engine was completed, and
functioned perfectly. It was built to tolerances achievable in the
19th century, indicating that Babbage’s machine would have worked.
Nine years later, the Science Museum completed the printer Babbage had designed for the difference engine; it featured astonishing
complexity for a 19th century device.
Charles Babbage was born in England, most likely at 44 Crosby
Row, Walworth Road, London. A blue plaque on the junction of
Larcom Street and Walworth Road commemorates the event. There
was a discrepancy regarding the date of Babbage’s birth, which was
published in The Times obituary as 26 December 1792. However,
days later a nephew of Babbage wrote to say that Babbage was born
precisely one year earlier, in 1791. The parish register of St. Mary’s
Newington, London, shows that Babbage was baptised on 6 January 1792 .
Babbage’s father, Benjamin Babbage, was a banking partner of the
Praeds who owned the Bitton Estate in Teignmouth. His mother
was Betsy Plumleigh Babbage. In 1808, the Babbage family moved
into the old Rowdens house in East Teignmouth, and Benjamin Babbage became a warden of the nearby St. Michael’s Church.
His father’s money allowed Charles to receive instruction from several schools and tutors during the course of his elementary education. Around age eight he was sent to a country school to recover
from a life-threatening fever. His parents ordered that his “brain
was not to be taxed too much” and Babbage felt that “this great
idleness may have led to some of my childish reasonings.” He was
sent to King Edward VI Grammar School in Totnes, South Devon,
a thriving comprehensive school still extant today, but his health
forced him back to private tutors for a time. He then joined a 30student academy under Reverend Stephen Freeman. The academy
had a well-stocked library that prompted Babbage’s love of mathematics. He studied with two more private tutors after leaving
the academy. Of the first, a clergyman near Cambridge, Babbage
8 Categories include “professional cyclists who died during a race”, “famous left handed people” and “people known as The Great”.
9 http://en.wikipedia.org/wiki/Lists of people Accessed on 02-01-07.
93
C HAPTER 4: M ETHODOLOGY AND R ESOURCES
said, “I fear I did not derive from it all the advantages that I might
have done.” The second was an Oxford tutor from whom Babbage
learned enough of the Classics to be accepted to Cambridge. . .
http://en.wikipedia.org
Unlike the previously described biographical corpora, Wikipedia entries are
not subject to a strict editorial policy, nevertheless, most adhere to the intuitively plausible “biographical pyramid” idea (see Figure 3.2 on page 57), with
name, location of birth, and birth and death dates in the first sentences. They
are also homogenous in length of the first, summary paragraph, indicating that
while an externally imposed editorial policy is not in force, the expectations of
the Wikipedia community compel writers towards writing “model” biographies.
4.3.6 University of Southern California Corpus
The University of Southern California Corpus (USC Corpus)10 is a small biographical corpus consisting of 130 multi-paragraph biographies (the average
number of words being 1339 per biography). Unlike the other corpora considered here, the USC Corpus focuses on only ten named biographical subjects
(Curie, Edison, Einstein, Gandhi, Hitler, King, Mandela, Monroe, Picasso, and
a group of people, the Beatles). The biographies are harvested from various
online biographical websites and contain much repetitive, redundant information. Another distinctive feature of the corpus is that it has been tagged for
biographically relevant clauses throughout. Details of the annotation scheme
used are provided in Section 5.1.2 on page 103.
An extract from one of the ten biographies of Martin Luther King included in
the USC corpus is reproduced below:
Martin Luther King, Jr., bio (January 15,1929-April 4, 1968) /bio bio was born /bio Michael Luther King, Jr., but later had
his name changed to Martin. His grandfather began the family’s
long tenure as pastors of the Ebenezer Baptist Church in Atlanta,
serving from 1914 to 1931; his father has served from then until the
present, and from 1960 until his death Martin Luther acted as copastor. edu Martin Luther attended segregated public schools in
Georgia, graduating from high school at the age of fifteen /edu ;
edu he received the B. A. degree in 1948 from Morehouse College /edu , a distinguished Negro institution of Atlanta from which
both his father and grandfather had been graduated. After three
years of theological study at Crozer Theological Seminary in Pennsylvania where he was elected president of a predominantly white
senior class, edu he was awarded the B.D. in 1951 /edu . With
10 The USC corpus was kindly provided by Ling Zhou & Eduard Hovy at the Information Sciences Institute, University of Southern California.
94
C HAPTER 4: M ETHODOLOGY AND R ESOURCES
Figure 4.1: Discrepancies in Annotation Styles in the USC (Curie Biographies).
TAGGED B IOGRAPHICAL C LAUSE
education Successfully passed
/education the
examinations
in
medicine
TAGGED B IOGRAPHICAL W ORDS : N OTE THAT THE CORPUS IS REPORTED AS
ANNOTATED AT THE CLAUSE LEVEL ONLY.
education inventor /education personal sister /personal personal father /personal fame celebrity /fame personal married /personal a fellowship won at Crozer, edu he enrolled in graduate studies
at Boston University, completing his residence for the doctorate in
1953 and receiving the degree in 1955 /edu In Boston personal he
met and married Coretta Scott /personal , a young woman of
uncommon intellectual and artistic attainments. personal Two
sons and two daughters bio were born /bio into the family /personal . In 1954, Martin Luther work King accepted
the pastorale of the Dexter Avenue Baptist Church in Montgomery,
Alabama /work .
USC corpus (Z HOU
ET AL .,
2004)
The mark-up scheme used in the USC corpus is not applied consistently throughout the corpus. Sometimes clauses are identified, and sometimes biographical
words. Some examples of the discrepancies in annotation are shown in Figure 4.1.
4.3.7 The TREC News Corpus
A one hundred megabyte extract from the 1998 APW (Associated Press Wire)
TREC (Text Retrieval Conference) corpus was used in this work.11 The corpus was used as a source of non-biographical data. Traditionally, local news
providers have subscribed to a news wire service, who provide a skeletal text
for breaking news stories; the local news provider then augments this story according to the house style. The news stories, although they frequently mention
persons, are not biographical in purpose. That is, biographical information is
sometimes presented in the context of a wider news story. In order to assess the
11 http://trec.nist.gov/ Accessed
on 02-01-07.
95
C HAPTER 4: M ETHODOLOGY AND R ESOURCES
Table 4.1: Descriptive Statistics for Biographical Corpora.
No. Entries
Total No. Words
Mean Words Per Entry
Stan. Dev. Words Per Entry
No. Chars Per Word
Mean Sent Per Entry
Stan. Dev. Sent Per Entry
Chambers
921
96707
105.39
51.32
5.18
5.58
2.19
NZ
2977
2878412
1002.01
480.44
6.01
32.12
16.90
DNB
36466
33853718
928.42
1364.20
6.55
34.89
50.90
WW
36687
5280196
143.97
98.95
5.35
-
Wiki
959
152481
159.00
215.34
4.57
17.58
24.31
proportion of biographical sentences in the corpus, a sample of one thousand
sentences was taken. These sentences were then manually classified according to the criteria specified in Section 5.2.2 on page 112. It was found that
11% of sentences were classified as biographical, and 89% were classified as
non-biographical. Of the 11% of sentences classified as biographical, the vast
majority were examples of apposition.
Topic coverage is wide; from sports results to the activities of US politicians and
international political entities (for example, the World Health Organisation).
An example entry is reproduced below:
LONDON (AP) The European Union will hold a two-day tradeboosting meeting with 12 Mediterranean countries in Parlermo, Italy,
this week, Foreign Secretary Robin Cook said Monday. Britain,
which currently hold the presidency of the EU, will chair the meeting with foreign ministers from Algeria, Cyprus, Egypt, Israel, Jordan, Lebanon, Malta, Morocco, Palestinian Authority, Syria, Tunisia
and Turkey. The meeting, starting Wednesday, is part of the socalled EuroMed process started at a conference in Barcelona, Spain,
in 1995 to boost political, economic and cultural links between the
EU and neighbors on the southern and eastern Mediterranean rim.
http://trec.nist.gov/
4.3.8 The B ROWN Corpus
The B ROWN corpus12 was developed by Francis and Kucera at Brown University and made available digitally in 1964. It is thus one of the earliest electronic corpora, and remains widely used in the twenty-first century. The cor12 A Standard Corpus of Present-Day Edited American English, for use with Digital Computers. The B ROWN corpus is supplied with the Python Natural Language toolkit (lite)
http://nltk.sourceforge.net Accessed on 02-01-07.. An online manual for the
B ROWN corpus is available at http://icame.uib.no/brown/bcm.html Accessed on 02-0107.
96
USC
130
174141
1339.55
1185.76
6.09
79.40
63.77
C HAPTER 4: M ETHODOLOGY AND R ESOURCES
pus consists of around one million words from 500 articles (approximately 2000
words per article). The corpus is balanced in that it consists of texts from various
sources in an attempt at providing a snapshot of written American English in
the 1960s (see Figure 4.2 on the following page for the text types presented in
the corpus). The version of the corpus used in this work was originally part-ofspeech tagged, but these tags were removed for the purposes of the current research. Below is a brief extract from an article classified under “Popular Lore”
in the B ROWN corpus:
Yet, in spite of this, intensive study of the taped interviews by teams
of psychotherapists and linguists laid bare the surprising fact that,
in the first five minutes of an initial interview, the patient often reveals as many as a dozen times just what’s wrong with him; to spot
these giveaways the therapist must know either intuitively or scientifically how to listen. Naturally, the patient does not say, “I hate my
father”, or “Sibling rivalry is what bugs me” . What he does do is
give himself away by communicating information over and above
the words involved. Some of the classic indicators, as described by
Drs. Pittenger, Hockett, and Danehy in The First Five Minutes, are
these: ambiguity of pronouns, stammering or repetition of I, you,
he, she, et cetera signal ambiguity or uncertainty.
The B ROWN Corpus
4.3.9 The STOP Corpus
The STOP (Lancaster Speech, Thought and Writing Presentation) Corpus (S EMINO
and S HORT, 2004)13 was developed at Lancaster University in the 1990s. The
corpus consists of around 250,000 words, made up of 120 documents, each
around 2000 words long. The corpus is classified approximately equally into
three narrative genres (fiction, newspaper news, and (auto)biography) (see Figure 4.4 on page 100, note that the leaves of the tree depicted represent example
texts used in the corpus) and is heavily annotated according to the theory of
speech and thought presentation described in L EECH and S HORT (1981) (see
Section 2.2.1 on page 23 for more on the approach to stylistics presented by
L EECH and S HORT (1981)). The annotation used in the corpus was not utilized
in the current work — all annotation was stripped before processing — and
will not be discussed here. However, a truncated example entry (with markup)
is provided in Figure 4.3 on page 99.
13 The
corpus manual is available at:
http://www.comps.lancs.ac.uk/computing/users/eiamjw/stop/handbook/ba.html
(Accessed on 02-01-07) and the corpus itself is available from the Oxford Text Archive.
97
C HAPTER 4: M ETHODOLOGY AND R ESOURCES
Figure 4.2: B ROWN Corpus Hierarchy of Text Types
REPORTAGE
44 texts
EDITORIAL
27 texts
RELIGION
17 texts
374 texts
INFORMATIVE PROSE
POPULAR LORE
SKILLS & HOBBIES
30 texts
LEARNED
80 texts
BELLES LETTRES
REVIEW
GENERAL FICTION
126 texts
IMAGINATIVE PROSE
MYSTERY
SCIENCE FICTION
98
36 texts
MISC
500 texts
BROWN CORPUS
48 texts
75 texts
17 texts
29 texts
24 texts
6 texts
ADVENTURE
29 texts
ROMANCE
29 texts
HUMOUR
9 texts
C HAPTER 4: M ETHODOLOGY AND R ESOURCES
Figure 4.3: Truncated Example Entry from the STOP Corpus (S EMINO and
S HORT, 2004): Michael Caine’s Autobiography
header Author: Michael Caine
Title: What’s it all about?
Date: 1992
Publisher: Century
B Michael Caine (first person narrator)
C Rene Clement
G Harry Salzman
J “the powers that be”(in filmmaking)
K Johnny Morris
L Spanish policeman
M Brigitte Bardot
O Eric Sykes
U Sean Connery
X unknown
Y “ the British” [troops in the film ’Play Dirty’]
/header body head Bardot tries it on /head pb n=3D242 sptag cat=3DN next=3DNRSAP whonext=3DX s=3D1 w=3D19 I have been in over seventy-three films in thirty years and by the time you read this it
will probably be seventy-six.
sptag cat=3DNRSAP who=3DX next=3DN s=3D1 w=3D15 People often criticise me for not being discriminating enough and even for working so hard. sptag
cat=3DN next=3DNI whonext=3DB s=3D17+0.82 w=3D454 Why bother? As
far as discrimination is concerned I have a definite standard by which I choose films:
I choose the best one available at the time I need one. Of course this has often led
me down dubious artistic paths, but even they are not without their advantages. It is
much more difficult to act well in a bad film with a bad director than in any other type
of movie and it gives you great experience in taking care of yourself. It also means that
when a good script does turn up you’re ready for it. It’s not unlike athletes in training
who will practise running on sand so they find it easy to run on a solid track in competition. Plus of course there’s the money. You get paid the same for a bad film as you
do for a good one- because no one knows for sure if the bad film is going to be bad or
the good film is going to be good until the premiere. You can wind up, as I do when
a good role comes along, absolutely prepared, having worked right up to date, or you
can sit there waiting for it for five years, scared /body 99
C HAPTER 4: M ETHODOLOGY AND R ESOURCES
Figure 4.4: Hierarchy of Texts Included in the STOP Corpus (S EMINO and
S HORT, 2004)
STOP CORPUS
NEWS
FICTION
SERIOUS
(AUTO)BIOGRAPHY
SERIOUS
SERIOUS
C.S.Lewis(biography)
Laurie Lee (autobiography)
Guardian
Independent
Greene (Brighton Rock)
Woolf (Night & Day)
POPULAR
POPULAR
Lewis (Get Carter)
Peters (The Holy Thief)
Mirror
Star
100
POPULAR
Michael Caine (autobiography)
Doris Stokes (autobiography)
C HAPTER 5
Developing a Biographical
Annotation Scheme
In order to produce a set of gold standard biographical sentences for machine
learning experiments, an annotation scheme specifying a procedure for identifying biographical sentences was developed, and a small corpus then created
based on this annotation scheme.
This chapter describes the development of a biographical annotation scheme
and corpus. First, three existing annotation schemes for tagging biographical
texts are outlined. Second, a new scheme is described, informed by existing
schemes, but aimed at ease of use in annotating biographical texts. This scheme
is then tested against biographical data, in order to test whether the scheme can
be used to comprehensively annotate short biographical texts. Finally, a small
biographical corpus, based on the annotation scheme is described.
5.1 Existing Annotation Schemes
There are numerous annotation schemes in existence that have some relevance
to describing significant life events for individuals. A review of specialist annotation schemes relevant to biography has been produced by the Text Encoding Initiative1 , these include schemes designed to represent genealogical data,
inter-family relationships and archaeological artifacts.
This document reviews in detail annotation schemes specified by the Text Encoding Initiative, the University of Southern California (Z HOU ET AL ., 2004)
and the scheme used as a guideline to contributors to the Dictionary of National
Biography (OUP, 2003).
1 Report on XML Markup of Biographical and Prosopographical data, http:://www.tei-c.org.
Accessed on 01-08-06. Prosopography is a research method in history which examines the relationships between historical figures in order to identify common experiences (among other things).
101
C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME
5.1.1 Text Encoding Initiative Scheme
The Text Encoding Initiative (TEI) 2 publishes SGML and XML standards for
the description of textual data for the humanities and allied areas. As part of
their standard, the TEI publishes a special purpose module for “the encoding
of proper names and other phrases descriptive of persons, places or organisations, and also of dates and times.”3 Although the TEI standard tag set
allows for the identification of proper names, it does not allow for the tagging
of the strings constituent parts (for example, forename, family name, and so
on). The TEI Names and Dates module does however allow for this more detailed level of analysis.4 The most directly relevant specifically biographical
part of the Names and Dates TEI module, is section 20.4 — Biographical and
Prosopographical Data. The authors of the scheme envisage three possible usage
situations:
1. The conversion of existing biographical records (for example, the Dictionary of National Biography).
2. The creation of structured biographical data from a document collection
or corpus.
3. The creation of biographical (or curriculum vitae like) data structures in
business contexts (for example, human resources).
The scheme is built around three “basic principles”:
1. Personal characteristics or traits are the qualities of an individual not under that individual’s control. These include sex, ethnicity, eye colour.
2. Personal states are (among others) marital state, occupation, and place
of residence. These states are temporally extended, normally having a
clear beginning and a clear end (for example, marriage, living in a certain
location and so on) and normally reflect the choice of the individual.
3. Events are changes in personal states associated with a specific date (or
narrow range of dates).
The TEI scheme divides its biographical tags into three groups, reflecting the
division between personal characteristics, personal states and events. These
three tag groups are described in detail below:
Personal Characteristics
–
–
faith : refers to an individual’s religious beliefs.
langKnowledge : describes a person’s language knowledge (for
example, languages spoken).
2 For
the many activities of the Text Encoding Initiative Consortium see
http://www.tei-c.org. Accessed on: 01-08-06
3 The full guidelines for the TEI annotation scheme are available at www.tei-c.org. Accessed
01-08-06.
4 Section 20.1, Personal Names of the TEI guidelines describes this facility.
102
C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME
–
langKnown : describes the person’s knowledge of a given language of interest.
–
nationality : describes nationality (or previous nationality).
–
sex describes the sex of the person.
–
socecStatus : socio-economic status describes a person’s social or economic status.
–
persTrait : This is a general tag that describes any personality
trait of interest (for example, “She was known for her generosity”).
Personal States
–
persName : contains the persons name or part thereof (including
titles, honourifics, and so on).
–
relation : describes relationships (family, professional, social).
–
occupation : describes a person’s job, occupation or career.
–
residence : details a person’s place of residence (or past place
of residence).
–
affiliation : describes a person’s relationship with some organisation.
–
–
education : describes a person’s educational experiences.
floruit : describes a person’s “flourishing” period (that is, the
period in their life that they were productive).
Personal Events
–
birth : details information about a person’s birth (for example,
location, date)
–
death : details information about a person’s death (for example,
location, date).
–
persEvent : is a general tag that describes any event (excluding birth and death) of significant or importance in the life of that
person.
5.1.2 University of Southern California Scheme
A small annotated biographical corpus already exists (Z HOU ET AL ., 2004) at
the University of Southern California and has been used in this work (see page
94 for a description of this corpus). The annotation scheme and corpus are
unsuitable for use as gold standard data however, as:
103
C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME
1. The corpus is created entirely from explicitly biographical texts (that is,
biographical articles from the web) rather than general texts that contain biographical information. Texts are harvested from a single type of
source. There may be differences between web based and published biographies. For example, published biographies may adopt a more formal
style.
2. The annotation scheme developed by Z HOU ET AL . (2004) is under specified and hence applied inconsistently to the corpus.
Z HOU ET AL . (2004) uses nine factors, identified from biographical texts. Broadbrush categories are described, but the fine points of how categorisation decisions are made in difficult cases are not supplied. It can be noted that annotation styles differ considerably within the Zhou (2004) corpus. For example,
biographical clauses are sometimes tagged (this is the stated aim of the corpus)
and sometimes, biographical words are tagged. For example:
work King was ordained in 1947 and became (1954) minister of a
Baptist church in Montgomery, Alabama /work Marilyn work appeared /work in the work production /work of George Cukov’s Let’s Make Love.
See Figure 4.1 on page 95 for more examples of inconsistent tagging in the USC
corpus.
Z HOU ET AL . (2004) used used nine annotation categories ( bio , fame ,
personality , personal , social , edu , nation ,
scandal and
work ). See page 94 for more details of these categories.
XML style tags were used for each of the nine categories (although the documents were not validated using XML technology) (W YNNE, 2004). It is assumed that if a sentence or clause is not tagged, then it is non-biographical.
The nine categories focus on essential facts, suitable for inclusion in a short
summary about a person’s life. Further, biographical facts may be embedded
in a document that is primarily concerned with another individual (for example, there may well be biographical sentences referring to Tony Blair in an article about George Bush). Additionally, biographical facts may be included as
incidental information in general texts (for example, a general news story may
include biographical information about “British Deputy Prime Minister, John
Prescott”). The nine categories are detailed below, along with examples taken
from the USC corpus. The annotation aims to pick out biographical information. The subject of the biographical information — who it is about — is not
relevant.
BIO Information on birth and death. Clause may also contain information about location:
– “? was born in Warsaw on March 14th 1879”
104
C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME
– “was born on Jan 16th 1879”
– “Thomas Alva Edison was born to Sam and Nancy on Feb 11th
1947”
– “He died in Princeton”
– “Einstein died in Princeton on April 18th 1955”
– “Marie Curie died of leukemia in 1934”
– “Marie Curie died at the age of 67”
– “Hitler’s mother died in 1907”
FAME : What a subject is famous for. This kind of information is broadly
positive (for example, awards, honours, achievements). More negative
notable events (notoriety) come under the scandal heading.
– “was awarded the 1964 Nobel Peace Prize for his efforts”
– “Hitler received the Iron Cross, 2nd class”
– “King was the youngest man to receive the Nobel Peace Prize”
– “Ghandi is that rare great man held in universal esteem”
– “Ghandi became the international symbol of India”
– “He asked the whole nation to strike for one day.”
CHARACTER : Attitudes, qualities, character traits, political or religious
attitudes.
– “Edison was conservative”
– “Einstein, though not religious, was a believer”
– “She was always exceedingly modest about her achievements”
– “His mental abilities and powers of concentration were extraordinary”
– “The young Hitler was a resentful, discontented child”
PERSONAL : Information concerning relationships with intimate partners, parents, siblings, children, friends. Also, non-fatal illnesses.
– “She was also her mother’s faithful companion”
– “She had recently got married”
– “He had six children, three by each wife”
– “They were married in 1953 and would have four children”
– “His parents rejected her because of her family’s impoverished financial situation”
105
C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME
SOCIAL : Introduction to friends, partners, collaborators and colleagues.
Changes in location and social milieu.
– “During 1923, he visited Palestine.”
– “Edison met Eadweard Muybridge [sic]5 at West Orange”
EDUCATIONAL : Institutions attended, dates, evaluative judgements
on time in education, educational choices. Types of education.
– “Marie attended science classes”
– “In 1896 he entered the Swiss Federal Polytechnic School”
– “He majored in sociology and in his junior year decided to enter the
ministry”
NATIONALITY : References to a person’s or persons’ nationality.
– “Einstein renounced German citizenship”
– “He became a citizen of the United States”
– “He was unkind to his first wife, Serbian physicist, Mileva Marick”
SCANDAL : Understood as reasons for fame that are negative:
– “Marie’s critics have charged that she neglected her children while
younger.”
– “Bobby Kennedy was also reported to have had an affair with Marilyn.”
– “Paul Langevin challenged the editor of the newspaper to a duel
with pistols”
WORK : This includes references to position, job titles (including apposition)
– “In addition to teaching, Curie also began to spend time in the laboratory.”
– “During 1958, he published his first book.”
– “Edison’s company produced over 1700 movies.”
– “British Chancellor, Gordon Brown.”
5 Eadweard
Muybridge (1830-1904) was a pioneer photographer who was born in England, but
spent most of his adult life in the United States. A recent biography is entitled, The Man Who
Stopped Time: The Illuminating Story of Eadweard Muybridge: Pioneer Photographer, Father Of The Motion Picture, Murderer, (C LEGG , 2007).
106
C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME
5.1.3 Dictionary of National Biography Scheme
The Dictionary of National Biography (DNB) biographical scheme — which is
not strictly speaking an annotation scheme, but rather a set of guidelines for
biography writers (OUP, 2003) — stipulates that biographies should contain
“standard factual components” when these are available. These “standard factual components” fall into four categories: personal data, family data, career
and sources of information.6 Each category contains a number of “standard
factual components”, some of which are obligatory (required) and some optional:
Personal Data
Required:
– Name: full names, alternative names, nicknames, short forms.
– Full dates of birth (or as second best, baptism), death and burial.
– Titles: aristocratic titles, knighthoods, baronetcies, high ecclesiastical titles, and so on.
– Places of birth (or, as second best, baptism), death and burial: addresses should be given if possible, and places identified by country,
modern place name, or other means.
– Places of settled residence: addresses should be given if possible
and places identified by county, modern place name or other means
if necessary.
– Cause of death: disease, condition or other cause; where possible a
contemporary report should be supplied, with subsequent interpretation, if any.
Optional:
– Physical appearance.
– Character traits.
Family Data
Required:
– Father: full names, alternative names, titles, vital dates (years only),
occupation.
– Mother: maiden name, alternative names, titles, vital dates (years
only), occupation (when other than “wifely”).
– Subject’s spouse(s) or partner(s) other than spouse (common-law
spouse, mistress, established lover): full names, for women maiden
6 The “sources of information” section has been ignored as it deals primarily with bibliographical data and archived material.
107
C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME
name and former name if previously married, titles, vital dates (years
only), occupation, date of marriage or start of the relationship, date
of its dissolution.
Optional:
– Subject’s place in the family: number of sisters and brothers, seniority in relation to the subject.
– Children: number, name(s) of parent(s) where the subject had more
than one spouse or partner, more information if relevant.
Career
Required:
– Religious affiliation(s): faith and sect, degree of adherence, evidence
of lack of religious affiliation.
– Geographical/ethnic interest: countries, regions, and cultures with
which the subject was associated, and which had an impact on his/her
life and career.
– Place(s) of education: school, college, university, Inn of Court, apprenticeship, and so on, with dates of attendance; degrees or other
awards and qualifications with dates.
Optional:
– Occupation(s).
– Offices and ranks held (with dates): precise dates (day, month, year)
of appointment to major offices should be given.
– Honours conferred (with dates): the number listed should be determined by importance relative to other information in the text.
– Works by the subject: major works, with summary of minor works.
– Historiographical context: comment on significance and changing
histiographical reputation (depending on the importance of the subject).
The opening of an entry follows a rigidly defined format, whereas other factual components (obligatory and non-obligatory) can be integrated in the prose
structure as required. The New Dictionary of National Biography: Notes for Contributors Handbook (OUP, 2003), uses the example of Gladstone’s biography:
Gladstone, William Ewart (1808-1898), statesman and author, was
born on 29 Dec 1809 at 62 Rodney Street, Liverpool, the fifth of
six children of John Gladstone (1764-1861), merchant and MP, and
108
C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME
Figure 5.1: Dictionary of National Biography Opening Schema.
subject Familyname, subject forename, (year of birth
- year of death), job roles/s, date of birth
year of birth, location of birth, birth order
father name (year of birth - year-of-death),
father job role/s, mother name (year of birth
- year of death), mothers father name,
mothers fathers location, mothers name.
his second wife, Anne (1773-1835), daughter of Provost Andrew
Robertson of Dingwall and his wife, Annie.
The rigidly defined format is given schematically in Figure 5.1.
5.2 Synthesis Annotation Scheme
This section describes the issues and considerations involved in developing a
biographical annotation scheme for the purposes of this work. After describing the scheme, and how it differs from the existing biographical schemes described in Sections 5.1.1, 5.1.2 and 5.1.3, the new scheme will be tested on short
biographies in order to assess how well it accounts for biographical data.
5.2.1 Developing a New Biographical Scheme
The new scheme must be simple enough for people to annotate quickly, confidently and effectively without a prolonged training period (that is, it must not
be too “deep” in structure).
The Dictionary of National Biography’s three top level categories subsume six
of the University of Southern California’s schemes categories (that is, bio ,
nation , character , work , education and personal ), but do not
account well for the three remaining categories in the University of Southern
California scheme ( fame , social and scandal ). For instance, it is conceivable that a person could be famous due to his or her family background (for
example, someone related to the British Queen), for their career (for example,
the British Prime Minister) and for some personal fact (for example, an aristocratic title, which would be counted as a personal fact under the Dictionary
of National Biography scheme). The relationship between the Dictionary of National Biography categories and the University of Southern California categories
is depicted in Figure 5.2 on the following page.
109
C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME
Figure 5.2: Relationship Between the Dictionary of National Biography and University of Southern California Biographical Schemes.
DNB
Personal
Career
Family
<fame>
USC
<bio>
<personality>
<nation>
<education>
<work>
<social>
<personal>
<scandal>
The TEI scheme covers similar ground to the University of Southern California scheme, but is more explicitly detailed. It also includes the categories
persTrait and persEvent . These are general purpose tags that can be
used to describe any character trait or event in a person’s life that seems significant to the annotator. This catch-all approach is less appropriate in a situation
where we are assessing inter-annotator agreement and wish to produce a set of
clear guidelines for making annotation decisions.
Elements of all three schemes were used to construct a biographical annotation
scheme designed for both ease and consistency of annotation (see Figure 5.3 on
the next page for a comparison between the new scheme and the USC and TEI
schemes7 ). Some differences between the USC scheme and the new scheme
are summarised below:
In the new scheme, the (USC) nation tag is discarded, and nationality information is contained in the key tag (Key Life Facts).
In the new scheme, cause of death is included in the
key tag.
In the new scheme, place of settled residence is included in the
key tag.
7 The Dictionary of National Biography scheme is a set of authorial guidelines rather than an annotation scheme proper, and therefore is not included in Figure 5.3.
110
C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME
The USC tag social is dropped, as this kind of information is included in relationships in the new scheme.
The USC tag scandal has been dropped from the new scheme, as
scandal can be understood as “negative” fame (that is, infamy, notoriety).
Some differences between the Text Encoding Initiative scheme and the new
scheme are given below:
Several TEI biographical tags relating to key life facts (that is, TEI tags
nation , persName , residence , birth and death )
are subsumed in the new key category.
Some tags in the TEI scheme — like flourit , the period in which a
person “flourishes” — have not been retained in the new scheme.
The TEO tags occupation and
the (new scheme) tag work .
affiliation are subsumed in
Figure 5.3: Relationship Between New Synthesis Scheme, Text Encoding Initiative Scheme, and University of Southern California Scheme.
USC
<bio>
New Scheme
<nation>
<key>
<persName>
<nation>
<fame>
<fame>
<character>
<personality>
<death>
<persEvent>
<relationships>
<persTrait>
<social>
<edu>
<residence>
<birth>
<scandal>
<personal>
TEI
<education>
<relation>
<education>
<work>
<work>
<occupation>
<affiliation>
111
C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME
5.2.2 A Synthesis Biographical Annotation Scheme
The synthesis annotation scheme, unlike the USC annotation scheme discussed
above, is designed to identify biographical text at the sentence level. This strategy has the advantage that, unlike clauses, sentences can be straightforwardly
identified both by people and automatically. A disadvantage of this sentence
level approach however is that biographical information may form only part of
a multi-clause sentence. For instance the sentence “ The Economist Intelligence
Unit compiled the index on behalf of the Australian IT entrepreneur and philanthropist, Steve Killelea, who said he hoped it would encourage nations to
address the issue of peace” 8 in addition to the highlighted biographical appositive phrase, contains non-biographical information about the Economist Intelligence Unit and “the issue of peace”. According to the synthesis biographical
scheme, the entire sentence would be tagged as biographical ( work ).
The six tag biographical scheme is presented below:
key :
Key information about a person’s life course:
– Information about date of birth, or date of death, or age at death.
– Names and alternate names (for example, nicknames).
– Place of birth: “Orr was born in Ann Arbor, Michigan but was raised in
Evansville, Indiana”.
– Place of death: “He died of a heart attack while holidaying in the resort
town of Sochi on the Black Sea coast”.
– Nationality: “He became a naturalized citizen of the United States in
1941”.
– Cause of death: “He died of a heart attack in Bandra, Mumbai”.
– Longstanding illnesses or medical conditions: “He stepped down from
the position on grounds of poor health in February 2004”.
– Place of residence: “Sontag lived in Sarajevo for many months of the
Sarajevo siege”.
– Physical appearance: “With his movie star good looks he was a crowd
favourite”.
– Major threats to health and wellbeing (for example, assassination
attempts, car crashes).
fame :
What a person is famous for. This kind of information can
be broadly positive (for example, rewards, prizes, honours, and so on)
or negative (for example, scandal, jail terms, and so on). Examples of
fame tags include:
8 http://www.guardian.co.uk Accessed
on 01-05-07.
112
C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME
– “His study of Dalton won him the Whitbread prize”
– “In 1976 heroin landed him in Los Angeles County Jail, where he spent
two months for possession of narcotics”
character :
Attitudes, qualities, character traits and political or
religious views. For example:
– “He was raised Catholic, the faith of his mother”
– “Jones is recalled as a gentle and unassuming man”
relationships : Information concerning relationships with intimate partners and sexual orientation. Relationship with parents, siblings,
children and friends.
– “Her mother died when she was eleven”
– “Nine people testified against him at his trial, including another wife he
tried to set on fire”
education : Institutions attended, dates, educational choices, qualifications awarded (with dates if available). General comments on educational experiences. For example:
– “Corman studied for his master’s degree at the University of Michigan, but
dropped out when two credits short of completion”
work : References to positions, job titles, affiliations, (for example,
employers), lists of publications, films or other work orientated achievements. General areas of interest (for example, industries, sectors, geographical regions).
– “He returned to England in 1967 to work for the offshore pirate radio station Wonderful Radio, London”
5.2.3 Assessing the Synthesis Annotation Scheme
One obvious method of exploring the utility of this scheme is to assess whether
it successfully accounts for “real world” biographical data available in short,
information-packed biographical summaries (for instance the short biographical entries found in Wikipedia biographies). That is, if all (or most) sentences
in short biographies can be tagged using the new scheme, then on the face of
it, this seems to indicate that the scheme is worth developing, testing more rigorously, and using as a standard for the development of gold standard data. In
other words, if the new annotation scheme covers or accounts for the sentences
in short biographical texts, then that is a first step to showing that sentences
tagged using the scheme are biographical.
As a first step towards assessing the scheme, four self contained biographies
were obtained from several different sources. Two of the biographies were
113
C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME
multiple paragraph texts (Phillip Larkin and Alan Turing, both from the Dictionary of National Biography) and two were single paragraph biographies (Paul
Foot and Ambrose Bierce, from Wikipedia and Chambers Dictionary of Biography
respectively). A truncated example biography of Alan Turing (annotated using the new scheme) is shown in Figure 5.4 on the following page and all four
marked up biographies are reproduced in Appendix F on page 265. Note that
only two sentences out of twenty-three are unaccounted for using the new six
tag annotation scheme for the Alan Turing Text (these are shown in Examples
5.1 and 5.2).
(5.1) “He tackled the problems arising out of the use of this machine with a
combination of powerful mathematical analysis and intuitive short cuts
which showed him at heart more of an applied than a pure
mathematician”
(5.2) “He suggested that machines can learn and may eventually ‘compete
with men in all purely intellectual fields’ ”
Table 5.1 on page 116 summarises the data from the initial analyses of annotation scheme coverage on different data sources. Note that for both the single
paragraph biographies — Bierce and Foot — total coverage was achieved (that
is, all the sentences could be tagged using the six tags from the new scheme).
For the multi-paragraph Dictionary of National Biography biographies — Larkin
and Turing — some sentences were not accounted for by the scheme. In the
case of Larkin 30.8% of sentences in the entry could not be accounted for. The
figure for the Turing entry was 8.7%. Note that the Larkin biography was by
far the longest biography considered (almost twice the length of the Turing
entry). On the basis of this data, the scheme accounts less well for longer biographical essays than for short, punchy biographies. As outlined in Chapter
2, short biographies contain more focused biographical text, centering on key
facts about an individual, rather than the discursive patterns obvious in essay
or book length biographical texts.
As a second step towards annotating the scheme — after having achieved indicative results that the annotation scheme accounts for short biographies —
the scheme was tested on four Wikipedia biographies: Jack Anderson, Kerry
Packer, Richard Pryor and Stanley Williams.9 Table 5.2 presents the results of
this annotation exercise. It can be seen that for three of the four biographies
analysed, coverage was total. One sentence in one of the entries was unaccounted for (see Example 5.3) as it referred to a posthumous event. For an
example of a Wikipedia biography annotated according to the new scheme, see
Figure 5.5. Note that all four annotated Wikipedia biographies are reproduced
in Appendix F.
(5.3) “A few months after his death, the FBI attempted to gain access to his
files as part of the AIPAC case on the grounds that the information could
9 All biographical subjects are classified as December 2005 deaths in the Wikipedia categorisation system: www.wikipedia.org. Accessed 01/08/06.
114
C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME
Figure 5.4: Entry for Alan Turing in the Dictionary of National Biography Annotated Using New Six Way Scheme.
relationships work key Turing , Alan Mathison 1912 - 1954 , mathematician, was born in London 23 June 1912, the younger son of Julius Mathison Turing, of the Indian Civil Service, and his wife, Ethel Sara, daughter of
Edward Waller Stoney, chief engineer of the Madras and Southern Mahratta
Railway. /key /work /relationships relationships G. J. and G. G. Stoney
were collateral relations. /relationships character education He was educated at Sherborne School where he was able to fit in despite his independent unconventionality and was recognized as a boy of marked ability and
character. /education /character education He went as a mathematical
scholar to King’s College, Cambridge , where he obtained a second class in part i and
education He
a first in part ii of the mathematical tripos (1932-4). /education was elected into a fellowship in 1935 with a thesis “On the Gaussian Error Function”
which in 1936 obtained for him a Smith’s prize. /education fame In the following year there appeared his best-known contribution to mathematics, a paper for
the London Mathematical Society “On Computable Numbers, with an Application
to the Entscheidungsproblem” a proof that there are classes of mathematical problems which cannot be solved by any fixed and definite process, that is, by an automatic machine. /fame fame His theoretical description of a “universal” computing machine aroused much interest. /fame work After two years (1936-8)
at Princeton, Turing returned to King’s where his fellowship was renewed. /work fame work But his research was interrupted by the war during which he worked
for the communications department of the Foreign Office; in 1946 he was appointed
O.B.E. for his services. /work /fame work The war over, he declined a lectureship at Cambridge, preferring to concentrate on computing machinery, and in the
autumn of 1945 he became a senior principal scientific officer in the mathematics division of the National Physical Laboratory at Teddington. /work work With
a team of engineers and electronic experts he worked on his “logical design” for the
Automatic Computing Engine (ACE) of which a working pilot model was demonstrated in 1950 (it went eventually to the Science Museum). /work work In
the meantime Turing had resigned and in 1948 he accepted a readership at Manchester where he was assistant director of the Manchester Automatic Digital Machine
(MADAM). /work hurt U.S. government interests”
As a third step towards assessing the scheme, one thousand sentences were
sampled from five data sources (summarised in Table 5.3 on page 118). These
one thousand sentences were then classified by the researcher using the annotation scheme. The proportion of biographical sentences for each sample is
presented in Table 5.3 on page 118. The data sources used, and the reasons for
using them, will be described in turn.
DNB-5 On the basis of an analysis of the Dictionary of National Biography (see
Section 4.3.1 on page 87) it was observed that the first five sentences of
115
C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME
Table 5.1: Coverage of New Annotation Scheme Using Different Sources.
source
Chambers
DNB
DNB
Wikipedia
name
Bierce
Larkin
Turing
Foot
words
129
1213
635
552
sentences
7
32
23
19
key
3
2
3
2
fame
0
4
4
3
char
0
3
4
0
relation
0
2
3
2
edu
0
3
3
1
work
5
10
8
12
Table 5.2: Coverage of New Annotation Scheme on Short Wikipedia Biographies
(Deaths in December 2005.)
name
Anderson
Packer
Pryor
Williams
words
224
81
218
197
sentences
10
4
9
8
key
3
1
1
4
fame
4
3
4
7
char
2
1
1
1
relation
1
0
0
0
edu
0
0
0
0
work
4
1
3
0
unclas
1
0
0
0
Figure 5.5: Wikipedia biography for Richard Pryor Annotated Using New Six
Way Scheme.
work His catalog includes such concert movies and recordings as Richard Pryor:
Live Smokin’ (1971), That Nigger’s Crazy (1974), Bicentennial Nigger (1976), Richard
Pryor: Wanted Live In Concert (1979) and Richard Pryor: Live on the Sunset Strip
(1982). /work work He also starred in numerous films as an actor, usually in
comedies such as the classic Silver Streak, but occasionally in the noteworthy dramatic role, such as Paul Schrader’s film Blue Collar. /work work He also collaborated on many projects with actor Gene Wilder. /work fame He won an Emmy
Award in 1973, and five Grammy Awards in 1974, 1975, 1976, 1981, and 1982. /fame fame In 1974 he also won two American Academy of Humor awards and the Writers Guild of America Award. /fame each biographical entry consistently contain facts characteristic of biographies ( key data, according to our biographical scheme; birthdates,
marital status, and so on). Intuitively, it seemed that this subsection of
the Dictionary of National Biography would provide a good source for testing whether the annotation scheme developed captures exemplary biographical writing.
DNB-R Sentences from the Dictionary of National Biography that are not one of
the first five sentences of an entry were selected as it was hypothesised
that a lower proportion of these sentences would be biographical according to the scheme presented here. Entries in the Dictionary of National Biography typically become more discursive after the initial few sentences,
116
unclas
0
10
2
0
C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME
sometimes dwelling on historical background or context.
TREC-U Sentences from the TREC corpus (a corpus of news text, see Section 4.3.7 on page 95) were sampled as it was hypothesised that news
text would contain relatively few sentences containing biographical information (according to the scheme used here), and that the biographical
information present, would mainly be in the form of apposition (for instance, “job title, name”).
TREC-F Sentences from the TREC corpus that do not contain person names
or personal pronouns. That is, the TREC corpus was first filtered to remove sentences containing person names and personal pronouns, then
1000 sentences from the remaining sentences were sampled. It was hypothesised that this sample would produce a very small proportion of
biographical sentences (close to zero).
CHA-A Sentences from the Chambers Biographical Dictionary (see page 89) were
used on the intuition that the short, information packed entries would
contain a high proportion of biographical sentences according to the biographical scheme.
The results of the analysis (see Table 5.3 on the next page) show that those sentences harvested from sources where we would reasonably expect a high density of biographical sentences (that is, CHA-A and DNB-5) score very highly
(93.5% for CHA-A and 84.9% for DNB-5). Those sentences taken from data
sources where we would expect a lower proportion of biographical sentences
(that is, DNB-R and especially TREC-U) have a much lower proportion of biographical sentences (29.1% for DNB-R, and 11.1% for TREC-U). Note that DNBR contains a higher proportion of biographical sentences than TREC-U. This is
perhaps because, although DNB-R contains a lower proportion of biographical
sentences than DNB-5, the sentences are from a biographical dictionary, and
hence likely to contain a higher proportion of biographical text than newstext
(that is, TREC-U). The data source that includes no person names or personal
pronouns is TREC-F, consisting of only 0.6% biographical sentences. This result
is explained by the difficulty of expressing biographical information without
the use of names or pronouns.
This section has indicated through three different approaches that the annotation scheme developed is sufficiently adequate to act as a provisional annotation scheme. First, the annotation scheme was tested on biographical texts
from different sources in order to establish the “coverage” of the scheme. Second, coverage of the scheme on a small set of short Wikipedia biographies was
assessed. Finally, the scheme was used to classify sets of 1000 sentences from
disparate sources (for example, published biographies and newswire text), in
order to test whether a higher proportion of sentences from explicitly biographical text (like Chambers) were accounted for by the biographical scheme, compared to newswire text.
117
C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME
Table 5.3: Percentage of Biographical Sentences Based on 1000 Sentence Sample.
Data Source
CHA-A
DNB-5
DNB-R
TREC-F
TREC-U
Description
Entries from the Chambers Dictionary of
Biography
The first five sentences of entries in the
Dictionary of National Biography
All but the first five sentences of entries
from the Dictionary of National Biography
Entries from a subset of the TREC corpus containing no person names or
personal pronouns
Entries from a subset of the TREC corpus that includes person names and
personal pronouns
Proportion Bio
93.5%
84.9%
29.1%
0.6%
11.1%
Cumulatively then, the success of these three approaches to validating the synthesis biographical scheme, suggest that the scheme is appropriate for the annotation of short biographies. The work conducted in Chapter 6 on page 124
(Human Study) on assessing the agreement of several annotators, also provides support for the adequacy of the scheme.
5.3 Developing a Small Biographical Corpus
This section describes the creation of a small corpus of texts annotated using
the six tag biographical scheme described in Section 5.2.2 on page 112.10 The
corpus consists of 84,305 word tokens from 80 different documents.11 First, the
four sources of texts are described and an example of annotated text given for
each source. Second, some issues involved in creating the corpus and descriptive statistics are presented.
5.3.1 Text Sources
Four text sources were used: news text from The Guardian12 newspaper, text
from BBC obituaries, obituaries from The Guardian newspaper, and finally lit10 To
access the corpus, email [email protected]
texts were annotated using the SGML aware EMACS text editor.
12 http://www.guardian.co.uk Accessed on 02-01-07.
11 The
118
C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME
erary texts selected from the multi-genre STOP Corpus.13
Guardian News Text
Texts were sampled from The Guardian newspaper online edition on three days,
(11-08-06 (13 documents), 12-09-06 (12 documents), and 24-09-06 (12 documents)). The Guardian is a “serious” British newspaper known for its moderate left wing bias. News items only were chosen, though theme or subject was
not restricted. A short extract from an article describing events in the British
Home Office is reproduced below:
work John Reid, the home secretary, called for solidarity “across
all sections of the community” today in the face of the “immense”
terrorist threat facing Britain. /work Mr Reid used a press briefing to announce that the “critical” terrorist alert would remain as a “precautionary measure” until further
notice.
work Both he and Douglas Alexander, the transport secretary,
will be meeting with national aviation security representatives later
today. /work http://www.guardian.co.uk
Guardian Obituaries
Seventeen obituaries were sampled from The Guardian newspaper from the first
half of 2006. The obituaries include those of prominent lawyers, civil servants,
diplomats and journalists (see Section 2.3 on page 30 for more on obituaries).
The extract below is from the obituary of the musician Gene Simmons:
fame key Among the others who created memorable rockabillystyle recordings was Gene Simmons, who has died aged 69 and
who achieved success in 1964 with Haunted House, a schlock-horror
number previously recorded by the RB artist Johnny Fuller. /key /fame work key Born in Elvis’s home town of Tupelo, Mississippi,
Simmons took up the guitar as a child after his two sisters brought
an instrument home. /key /work work He began his professional musical career at 15, playing
with his brother Carl at local dances and on radio as the Simmons
Brothers band /work .
http://www.guardian.co.uk
13 The Lancaster Speech, Thought and Writing Presentation Corpus, available from the Oxford Text
Archive: http://ota.ox.ac.uk. Accessed on 02-01-07.
119
C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME
BBC Obituaries
The eleven BBC obituaries used in the corpus were downloaded from the BBC
website in July 2006.14 They include writers, actors, politicans and princes. The
extract below is from the biography of novelist Saul Bellow:
fame With an awareness of death and the miracle of life at the
foundation of his work, Saul Bellow’s novels brought him huge success, and both the Nobel and Pulitzer Prize. /fame He is cited by many contemporary authors as a critical creative influence.
Bellow’s message was one of hope and affirmation. He said, “In the
greatest confusion, there is still an open channel to the soul.”
key Many of his novels were set in Chicago where his poor RussianJewish parents moved when he was a child. /key He later reported, “I saw mayhem all around me. By the age of eight, I knew
what sickness and death were.”
http://news.bbc.co.uk/obituaries
STOP Corpus
Fifteen texts were included from the STOP corpus (see Section 4.3.9 on page 97).
Although the STOP corpus includes texts from newspaper sources, only texts
from the (auto)biography and literary categories were included (see Figure 5.6 on
the following page for a list of the literary texts used). Each text used is around
two thousand words in length. The extract below is if from a biography of
former British Prime Minister, Margaret Thatcher.
character Her eyes, according to Alan Watkins of the Observer,
took on a manic quality when talking about Europe, while her teeth
were such as ’to gobble you up’. /character More sinister
still, she slipped into the habit of using the royal “we” in public.
(“We are a grandmother”).
The STOP Corpus
Descriptive statistics for all the text sources that constitute the new biographical corpus are presented in Table 5.415 and Table 5.5.16 The proportion of documents of each type (STOP corpus, obituary and newstext) are depicted in
Figure 5.7 on the next page. Note however that documents from the STOP
corpus are considerably lengthier than obituaries or newstext documents, and
14 http://news.bbc.co.uk/obituaries Accessed
on 08-02-07.
that as single sentences can have multiple tags, there are fewer biographical sentences
than biographical tags for each data source (see Table 5.4, rows 4 and 6.)
16 Note that Table 5.5 shows the “Guardian News” category broken down into three subcateogories according to the date on which the information was gathered.
15 Note
120
C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME
Figure 5.6: Sources of Documents Used From the Literary Genres of the STOP
Corpus.
(AUTO)B IOGRAPHY
Alan Turing: The Enigma of Intelligence, by Andrew Hodges
Leonard Cohen: Prophet of the Heart, by L.S.Dorman
The Benny Hill Story, by John Smith
A Bag of Boiled Sweets, by Julian Critchley
Curriculum Vitae, by Murial Sparke
The Downing Street Years, by Margaret Thatcher
What’s it all About? by Michael Caine
F ICTION
Possession, by A.S.Byatt
Peach, by Elizabeth Adler
Brighton Rock, by Graham Greene
Money, by Martin Amis
The Moor’s Last Sigh, by Salman Rushdie
Lace, Shirley Conran
Get Carter, Ted Lewis
Daughter of Deceit, by Victor Halt
Table 5.4: Descriptive Statistics for Biographical Corpora.
No. of Documents in Corpus
Avg. Length of Documents (in Words)
Total Number of Bio Tags
Avg. No. of Bio Tags per Document
Total Number of Bio Sentences
Avg. No. of Bio Sentences per Document
Guardian News
37
824
194
6.5
170
4.6
BBC Obits
11
643
173
15.7
150
13.6
Guardian Obits
17
778
327
19.7
247
14.5
therefore the proportion of text derived from the STOP corpus is higher than
19%.
5.3.2 Issues in Developing a Biographical Corpus
In building and annotating a biographical corpus, we are interested in exploring features that distinguish biographical from non-biographical text (according to the annotation scheme adopted). Two major concerns guided our choice
121
STOP Corpus
15
2257
107
7.1
90
6.0
C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME
Figure 5.7: Types of Document Used in Biographically Tagged Corpus.
Table 5.5: Average Number of Biographical Tag Types per Text.
Source Type
Guardian News 11-08-06
Guardian News 12-09-06
Guardian News 24-09-06
BBC Obituaries
Guardian Obituaries
STOP Corpus
key
0.38
3.01
0.41
3.82
6.35
1.47
fame
0.61
0.5
0.25
3.54
1.05
0.27
char
0.15
0.08
1.16
2.36
2.82
1.67
relation
0
0.75
0.33
2.18
4.23
1.8
edu
0
0
0.25
0.64
1.29
0.27
work
3.38
3.33
3.41
5.45
8.94
3.00
of source material:
First, that the corpus should provide non-trivial results. For example, if we
had chosen two data sources — short entries from a schematic biographical
dictionary and software manuals — it is likely that the software manual texts
would contain very few biographical sentences, and the biographical texts (for
example, obituaries) a high proportion of biographical sentences (to take an
example, for the BBC obituaries described above, approximately 75% of the
sentences are classified as biographical using this scheme). Genres included
should frequently include non-biographical (according to our scheme) person
orientated information (for example, information about a person that is not covered by the annotation scheme described in this chapter).
Second, data sources should be spread across different genres, in order to identify different types of biographical constructions, common in different genres.
122
C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME
For example, apposition is a very common stylistic device used by journalists
in news texts, but is less common in more literary writing (represented by the
STOP corpus).
It is important that both these issues are addressed, as the data is to be used as
training data in machine learning experiments to identify biographical text (see
Chapters 7, 8, 9 and 10).
5.4 Conclusion
This chapter has described the creation of a biographical annotation scheme,
and a corpus based on that scheme. The next chapter, Chapter 6 goes on to
describes a human study designed to validate both the biographical annotation
scheme described in this chapter.
123
C HAPTER 6
Human Study
6.1 Introduction
This chapter reports the results of a web based human study. Participants were
invited to categorise a series of sentences as biographical or non-biographical,
and agreement between the assessors was calculated. The chapter is divided
into three main sections. First, some necessary background on inter-annotator
agreement1 issues is presented. Second, an initial pilot study using a ternary
classification scheme is described. Third, the main study, which uses data from
various biographical and multi-genre sources, along with the categorisation
scheme developed in Chapter 5, is set forth.
The study has three goals:
1. To validate the biographical annotation scheme developed in Chapter 5.
If participants agree on the status of sentences — as biographical or nonbiographical — it suggests that the biographical annotation scheme developed can be applied consistently.
2. Given that the biographical annotation scheme developed in Chapter 5 is
adequate (that is, point 1), to establish to what extent people are able to
reliably distinguish between isolated (that is, context-less) biographical
and non-biographical sentences.
3. To provide high quality “gold standard” data for experiments in automatic text classification (see Chapters 7, 9 and 10).
1 Inter-annotator agreement is also described as inter-rater agreement and inter-classifier agreement.
124
C HAPTER 6: H UMAN S TUDY
6.2 Agreement
This study involves presenting several participants with a list of sentences,
along with instructions for the suitable classification of those sentences. Consider Example 6.1. Each participant was asked to classify this sentence according to the annotation scheme provided. Assessing agreement has the primary
aim of checking that the annotators have a shared understanding of the categories. High agreement levels indicate that annotators have a good understanding of the concepts (categories) involved, and a clear decision procedure
for allocating sentences to those categories. In the context of the current work,
high agreement would suggest that annotators have a good understanding
of which features are characteristic of biographical and non-biographical sentences (as defined by the annotation guidelines).
(6.1) He was born in Widnes and educated at the University of Liverpool
Statistical techniques for measuring inter-classifier reliability are regularly used
in areas where human classifiers are required to place previously unseen instances into determined categories, without the benefit of pre-classified “gold
standard” examples with which to assess the individual classifier’s efforts. Example areas include medical statistics (comparing a group of specialists’ diagnoses (K RAEMER, 1992)), and optometry (comparing human and machine
methods for gathering optometric measurements (WATKINS, 2003)). Inter-classifier
agreement is important in the computational linguistics research tradition due
to a lack of gold standard data. In the computational linguistics research tradition, data tends to be derived from intuitive judgments, and one way of verifying (or supporting) intuitive judgments is to assess how many people who
are proficient in the language of interest, have the same judgment in a given
situation.
There are numerous methods available for calculating agreement. Two common agreement metrics are presented here: percentage based scores and variants
of the KAPPA statistic.
6.2.1 Percentage Based Scores
Percentage based scores (which simply report the mean percentage of annotators who agree on the class of each item) , while straightforward to understand,
are not optimal for assessing agreement, as they do not account for expected
agreement. For example, consider the (idealised) data presented in Table 6.1 on
the next page. The table shows the result of two annotators assigning ten sentences to one of two categories. The two participants can only agree in 60% of
cases (that is, six sentences). Are we then entitled to say that agreement is good,
despite the fact that for 40% of the sentences there was no agreement? Note
that even if the participants categorised the sentences randomly, it is likely that
there would be 50% agreement.
125
C HAPTER 6: H UMAN S TUDY
Table 6.1: Raw Agreement Scores (Idealised Example Data)
Sentences
Annotator 1
Annotator 2
1
yes
yes
2
no
yes
3
yes
yes
4
no
yes
5
no
no
6
no
no
7
yes
yes
8
no
yes
9
yes
yes
10
no
yes
6.2.2 The KAPPA Statistic
C ARLETTA (1996) challenged the usefulness of percentage measures for assessing agreement and instead proposed the use of the K APPA statistic, arguing
that K APPA allows a level of interpretability and insight into agreement data
that cannot be provided by raw percentage scores.2
The KAPPA (C OHEN, 1960) statistic measures agreement between a pair of classifiers , with variants for measuring agreement where the number of classifiers
is greater than two (see page 128). KAPPA (Equation 6.2) measures the raw
agreement between classifiers, while discounting expected agreement. A score
of 0 indicates that any agreement can be accounted for by chance, and a score
of 1 indicates perfect agreement (C ARLETTA, 1996).
(6.2)
# Here P(A) is the proportion of times the classifiers agree and P(E) is the proportion of times we would expect the classifiers to agree by chance. There
are several different methods of calculating expected agreement (discussed below).
An important limitation with the use of KAPPA is the lack of an accepted significance level for agreement. While we know that 1 is total agreement and 0 is
no agreement beyond chance, where are the thresholds for average, good and
excellent agreement? Traditionally, KAPPA greater than 0.8 has been regarded
as “good reliability” and KAPPA greater than 0.67 but less than 0.8 as fair reliability “allowing tentative conclusions to be drawn” (C ARLETTA, 1996). This
scale is the default used in computational linguistics research since C ARLETTA
(1996), despite the fact that it is widely acknowledged as arbitrary (C RAGGS
and M C G EE W OOD, 2005; D I E UGENIO, 2000; D I E UGENIO and G LASS, 2004).
C RAGGS and M C G EE W OOD (2005) suggests that researchers should acknowledge “that there is no magic threshold that, once crossed, entitle us to claim
2 Note that D I E UGENIO and G LASS (2004) suggest that percentage scores should be used as one
of several methods for analysing agreement levels. C RAGGS and M C G EE W OOD (2005) however
rejects this as unnecessary, arguing that a single variant of the KAPPA statistic is sufficient.
126
C HAPTER 6: H UMAN S TUDY
Table 6.2: Types of KAPPA; Methods for Calculating Expected Probability.
Method 1
S COTT (1955)
F LEISS (1971)
K RIPPENDORFF (1980)
Method 2
C OHEN (1960)
that a coding scheme is reliable” (C RAGGS and M C G EE W OOD, 2005), and —
this is implicit rather than directly stated — that only indicative claims can be
supported with agreement statistics alone.
Other researchers in psychology and medical sciences consider KAPPA greater
than 0.75 as “almost perfect” (L ANDIS and K OCH, 1977) and “excellent” (E MAM,
1999). Acceptability levels dip as low as 0.5 in psychiatric diagnosis (G ROVE
ET AL . (1981) referenced by D I E UGENIO (2000)).
Types of KAPPA statistic
In recent years, there has been a debate about the most appropriate variant of
the KAPPA statistic to use in computational linguistics research. The variants
fall into two main
groups, according to their method for calculating expected
probability ( ) (see Equation 6.2 on the preceding page). Method 1 assumes
that the distribution of proportions over the categories is the same for all annotators (that is, that annotators make classification decisions in the same proportion). Method 2 does not assume that the distribution of proportions over the
categories is the same for all annotators. Instead, expected probability is calculated on the basis that each annotator has a distinct distribution of proportions
over the categories (that is, that some annotators may systematically favour
one classification over another). Table 6.2 gives examples of KAPPA variants
from each of these groups.3
D I E UGENIO and G LASS (2004) suggests that when reporting agreement in
computational linguistics research, two KAPPA statistics should be reported,
one each from Methods 1 and 2, as well as a percentage measure, as this will allow a more balanced view of the data. On the other hand C RAGGS and M C G EE
W OOD (2005), in response to D I E UGENIO and G LASS (2004), suggests that
only one KAPPA statistic is worthwhile reporting, a Method 1 KAPPA. C RAGGS
and M C G EE W OOD (2005) suggests that the use of multiple agreement measures shows a lack of confidence in the statistics, and goes on to explicitly reject
Method 2 KAPPA techniques as the “purpose of assessing the reliability of cod3 Note
that some of these agreement statistics are variously referred to as P I and
than KAPPA .
127
ALPHA
rather
C HAPTER 6: H UMAN S TUDY
Table 6.3: Idealised Data for KAPPA Example.
Sentences
1
2
3
4
Total
Proportion
Biographical Category
0
5
4
6
15
0.375
Non-Biographical Category
10
5
6
4
25
0.625
ing schemes is not to judge the performance of the small number of individuals
participating in the trial, but rather to predict the performance of the scheme in
general” (C RAGGS and M C G EE W OOD, 2005). They go on to suggest that any
“bias” exhibited by individual annotators (that is, marked differences in the
proportion of classification decisions between annotators) is best minimised by
increasing the number of annotators, a hypothesis confirmed by A RTSTEIN and
P OESIO (2005), who compared agreement scores from Method 1 and Method 2
KAPPA types (S COTT (1955) and C OHEN (1960), respectively) and showed that
as the number off annotators grows, bias decreases.
K APPA for more than one annotator
F LEISS (1971) describes a frequently used agreement statistic for more than
two annotators that satisfies the requirements set out by C RAGGS and M C G EE
W OOD (2005) (that is, a Method 1 agreement statistic). The statistic also allows
for multiple sets of annotators. That is, the annotators classifying one sentence
may be different to the annotators classifying another sentence. Fleiss’s KAPPA
statistic is also often implemented in standard statistical software.4 In order
to demonstrate how the statistic is calculated, a worked example is included,5
using the idealised data presented in Table 6.3, which shows four sentences,
two categories (biographical and non-biographical) and ten annotators.
Before describing the method for calculated Fleiss’s KAPPA, it is necessary to
introduce some notation.
is the total number of sentences, is the number
of annotators per sentence, the subscript is the sentence number, the subscript
is the category number, is the total number of categories, and is the
number of annotators who assign the th sentence to the th category.
4 Fleiss’s KAPPA for two or more annotators is implemented in the IRR package of the open
source R statistical programming language.
5 A fuller example based on psychiatric diagnoses is presented in F LEISS (1971) (the original
paper).
128
C HAPTER 6: H UMAN S TUDY
The first step is calculating the total level of agreement (uncorrected for expected agreement). The proportion of agreeing pairs for each sentence is cal
culated using Equation 6.3 (Equation 6.4 shows the calculation for Sentence 1,
Table 6.3 on the page before). Note that is the proportion of agreeing scores
in Sentence .
(6.3)
(6.4)
(6.5)
The total agreement for all data (
mean of the proportion for all
) is
the
or . See Equation 6.5
sentences (that is, for the formula). We have now calculated agreement for the data presented in
Table 6.3 on the preceding page, but in order to calculate KAPPA we need to
identify and discount expected agreement . Equation 6.6 shows the equation used to calculate expected agreement, and Equation 6.7 shows the calculation used to identify expected agreement for the data shown in Table 6.3 on
the page before. Note that (lowercase ) is the proportion of all assignments
belonging to category .
(6.6)
(6.7)
We now have all the data required to perform the final KAPPA calculations. See
Equation 6.8 and Equation 6.9 on the next page.
(6.8)
# 129
C HAPTER 6: H UMAN S TUDY
(6.9)
# Using the accepted scale, KAPPA = 0.12 is a very poor agreement score, indicating that there is very little agreement above chance in the idealised data
presented in Table 6.3 on page 128.
This section has explored some difficulties in quantifying agreement and provided a description of and justification for the use of Fleiss’s KAPPA. A worked
example has also been included. The next two sections describe how Fleiss’s
KAPPA has been applied to assess inter-annotator agreement in two related
studies.
6.3 Pilot Study
This section describes a pilot study designed to explore the ability of people
to distinguish between biographical and non-biographical sentences using a
provisional biographical classification scheme. Results of the study indicated
shortcomings in the biographical scheme used, and led to the development of
a new scheme (see Chapter 5 on page 101) and a more extensive study based
on that scheme (see Section 6.4 on page 132 in this chapter).
The pilot study was conducted using a web form questionnaire format.6 Seventeen potential participants were contacted of whom fifteen completed the
web questionnaire. All participants were between the ages of twenty and seventy and all were university educated native English speakers (British English,
Hiberno-English and New Zealand English) and were personally known to the
researcher as family, friends or colleagues.7
Each participant was asked to categorise one hundred sentences of varying
lengths.8 There were three possible categories:
1. Core Biographical — Relevant information here could include details about
birth and death dates, education history, nationality, employment, achievements, marital status, number of children, and so on. If the central purpose of a sentence is to convey information about an individual, then
that sentence can be classified as core biographical. Note that the sentence may contain anaphors rather than person names. For example, the
sentence “He was jailed for a year in 1959 but, given an unconditional
pardon, became Minister of National Resources (1961), then Prime Minister (1963), President of the Malawi (formerly Nyasaland) Republic (1966),
6 See
http://www.dcs.shef.ac.uk/ mac/bioexp.html
email containing the URL of the study was sent to potential participants.
8 The guidelines for completing the test and a list of the one hundred sentences used are reproduced in Appendix A on page 191.
7 An
130
C HAPTER 6: H UMAN S TUDY
and Life President (1971)” about President Banda of Malawi is, according
to this scheme, core biographical.
2. Extended Biographical — Contains information about a person, but ancillary to the main thrust of the sentence. The distinction between core and
extended biographical sentences is that extended biographical sentences
while they may contain information about an individual, are not directly
about that individual. For example in the sentence “This new consumer
is a pretty empowered person,” said Wendy Everett, director of a study
commissioned by the Robert Wood Johnson Foundation.” Wendy Everett is not who the sentence is about — although we do learn that she
is the director of a study — the real focus of the sentence is the “new
consumer”.
3. Non Biographical — Contain no person names or titles or pronouns. For
example in the sentence “Of the 6 million notebooks Taiwan turned out
last year, Quanta produced 1.3 million sets, accounting for about 8 percent of the world output.” there is clearly a reference to a company
(Quanta) but no reference to a person either directly or anaphorically.
The test sentences were selected from five sources, twenty sentences from each.
Table 5.3 on page 118 provides details (and the shorthand code) for each of the
five sources used. Further details of these corpora can be found in Chapter
5.
On the basis of an analysis of the Dictionary of National Biography it was observed that the first five sentences of each entry consistently contain facts characteristic of biographies (birth dates, career, marital status, and so on). Intuitively, it seemed that this subsection of the Dictionary of National Biography
would provide the best available source of core biographical sentences. Sentences from the Chambers Dictionary of Biography were chosen for the same reason. These intuitions are supported by the result reported on page 115, where
89.9% of DNB-5 sentences and 93.5% of CHA-A sentences were classified as
biographical in a sample of 1000 sentences taken from each group, using the
classification scheme described in Chapter 5, rather than the three-way classification scheme described in this section.
Sentences from the Dictionary of National Biography that are not one of the first
five sentences of an entry were selected as a first attempt at approximating
an extended biographical category. Entries in the Dictionary of National Biography typically become more discursive after the initial few sentences, sometimes
dwelling on historical background to an extent that is not obviously biographical in the limited sense of the word, although often information about the
individual is given. Sentences from the subset of the TREC corpus containing
person names and personal pronouns (that is, TREC-U) were used for similar
reasons. This intuition is supported by the result reported on page 115, where
29.1% of the DNB-5 sample, and 11.1% of the TREC-U sample were found to
be biographical.
131
C HAPTER 6: H UMAN S TUDY
Table 6.4: Inter-classifier Agreement Results.
Categories
2 Categories
3 Categories
KAPPA
0.752
0.431
A subset of the TREC corpus that contained no person names or personal pronouns (that is, TREC-F) was used on the intuition that sentences containing
no references to persons, either directly or indirectly, could not be classified as
biographical. It was expected that sentences from this source would reliably
be classed as non biographical. Note that only 0.6% of the TREC-F sample
discussed on on page 115 was classified as biographical using the binary annotation scheme described in Chapter 5.
While three categories were used in the experiment core biographical, extended
biographical and non biographical it was decided to subsume the two biographical
categories in the light of feedback from participants (that is, sentences classified as biographical and extended biographical regarded as belonging to one single
biographical category. Note that no data was discarded in this transition from a
ternary to a binary classification). Participants had little difficulty distinguishing a biographical sentence from a non biographical sentence, yet they reported
confusion over the distinction between the core and extended categories, suggesting that the distinction between the two biographical categories was under
specified. This reported difficulty is empirically validated in the results (see
Table 6.4) where the inter-classifier kappa using three categories was poor, and
the score using two categories (biographical and non-biographical, with the biographical category consisting of those sentences classified as core and extended
biographical) was very good at around 0.75. The relative success of the binary
classification suggested that a new study be conducted, employing a more understandable biographical categorisation scheme. This clearer, binary annotation scheme is described in Chapter 5 on page 101 and empirically assessed in
the next section.
6.4 Main Study
This section describes a study that involved twenty five participants classifying sentences as biographical or non-biographical, using the binary annotation
scheme developed in Section 5.2 on page 109. It is important to emphasise that
the participants were confronted with a binary classification task. In this main
study, the seven way biographical scheme (key, fame, character, relationships,
education, work and unclassified) developed in Section 5.2 are reduced to two
132
C HAPTER 6: H UMAN S TUDY
biographical classes. The biographical class subsumes the first six biographical
classes of the synthesis scheme, and the non-biographical class corresponds to
“unclassified” in the synthesis scheme.
6.4.1 Motivation
The three way classification scheme (extended, core and non-biographical) used
in the pilot study caused confusion among participants. The distinction between (extended) and core biographical categories, where the participants was
asked to decide if the sentence was about a person or merely contained biographical information, was particularly problematic. This distinction is dropped in
the binary classification scheme developed in Chapter 5 on page 101. Instead
of asking participants what the sentence is about — whether a sentences is about
a person (core) or incidentally contains information about a person (extended)
— the new scheme focuses entirely on the information content of the sentence.
That is, whether the sentence contains biographically relevant facts according
to the six biographical categories identified in Chapter 5 on page 112 ( key ,
fame , character , relationships , education and work ).
This confusion is reflected in the KAPPA scores for the pilot study, where for
three categories (extended biographical, core biographical and non-biographical)
they are poor, but if we subsume the core and extended categories, so that the
categorisation task becomes binary, we achieve a good KAPPA score (0.75).
In order to assess the claim that a binary annotation scheme with well developed informational criteria for classifying sentences would achieve high agreement scores, a new study was designed, using the annotation scheme and
corpus developed as part of this research. Although the agreement results
achieved by subsuming the two biographical categories suggesting that a single binary classification scheme was most appropriate for describing the distinction between biographical and non-biographical sentences, it was felt that
a new study, which presented participants with a new binary classification
scheme based on previous biographical annotation schemes (for example, the
Text Encoding Initiative biographical scheme, described on page 102), was a
more robust method of testing the hypothesis.
A second aim of this study is to provide a corpus of “gold standard” attested
sentences for the automatic sentence classification experiments described in
Chapters 7, 9 and 10).
6.4.2 Study Description
Twenty five participants used a web interface to classify five hundred sentences, guided by the biographical annotation scheme described in Chapter 5 on
page 101. The participants also were provided with annotation instructions in
133
C HAPTER 6: H UMAN S TUDY
the form of a PDF file (reproduced in Appendix B on page 202) which they were
advised to print and consult while answering questions.9
Each participant did not classify all five hundred sentences. Instead, sentences
were divided into stratified sets of one hundred sentences and five annotators
classified each set of sentences.
All sentences were derived from the biographical corpus described in Section 5.3 on page 118, and were therefore representative of a number of genres
(for example, newspaper text, web news reports, short published obituaries,
published fiction). For each set of sentences, forty-eight biographical sentences
were selected (eight from each of the biographical six sub categories listed on
on page 112) and fifty-two sentences were randomly selected from those untagged10 sentences in the biographical corpus.11 In other words, each stratified
set of sentence consisted of approximately 50% biographical sentences and 50%
non-biographical sentences.
Of the twenty-five participants, thirteen were anonymous, and twelve provided personal information.12 Of the twelve who provided personal information, eleven were native English speakers (eight British English, Two HibernoEnglish, and one American English). One participant who provided personal
information was a native speaker of Finnish with a near native standard of
English. All those who provided information were aged between twenty and
sixty and were university educated. An email containing the URL of the study
was sent to possible participants. All participants approached to take part in
the study were personally known to the researcher as family, friends or colleagues.
6.4.3 Results
The agreement scores for each set of sentences calculated using Fleiss’s KAPPA
statistic are shown in Table 6.5 on the following page. Note that the KAPPA
scores for each sentence set is at or above the 0.67 level regarded as “good”.
The mean KAPPA scores for all 5 sentence sets is 0.72, well above the 0.67 threshold.
9 As a minimum, participants were asked to leave the PDF file containing instructions open in a
different window in order to consult it while classifying sentences.
10 Note that untagged sentences are those sentences that are not assigned a biographical tag and
hence are considered non-biographical.
11 As some biographical subcategories were less well represented in the biographical corpus (for
example, education ), two of the sentence sets use disproportionately more sentences from
the key subcategory.
12 Four fields were provided for participant information: forename, family name, email address and
age.
134
C HAPTER 6: H UMAN S TUDY
Table 6.5:
KAPPA
Scores for Each Sentence Set
Sentence Set
Set 1
Set 2
Set 3
Set 4
Set 5
KAPPA
Score
0.75
0.80
0.71
0.68
0.67
6.4.4 Discussion
These results show that good agreement can be obtained between multiple
classifiers over a range of sentences using the binary classification scheme developed in Chapter 5 on page 101. The overall agreement score of 0.72 is
slightly lower than that obtained in the “binary” version of the pilot study
(that is, the subsumption of the extended and core biographical categories),
this can perhaps be explained by the nature of the data used in the pilot study;
particularly the use of sentences that had been filtered of pronouns and personal names. The difficult classification decision for participants in the pilot
study was deciding between the extended and core biographical categories,
rather than between the biographical categories and the non-biographical category. In contrast to the pilot study, the main study uses randomly selected
non-biographical sentences from the biographical corpus described in Chapter
5, which can be more challenging to the participant than those used in the pilot
study. For example, “Mr Blair was also snubbed by radical politicians linked
to Hizbullah”, although it references Tony Blair, is not biographical according
to the annotation scheme used in the main study.
The mean KAPPA (0.72) score exceeds the minimum conventional significance
score of 0.67, but does not reach the 0.8 required for excellent agreement. Consider the data presented in Table 6.5. It can be seen that K APPA varies considerably for each of the five sentence sets, from the lower threshold level of
0.67 to the “excellent” level of 0.8. No score falls below C ARLETTA (1996)’s
0.67 threshold however, suggesting that even with challenging sentences, the
biographical annotation scheme developed in Section 5.2 on page 109 yields
good accuracy. It is important to qualify this judgment with the observation
that commonly used agreement thresholds — unlike the significance levels of
inferential statistics — are essentially arbitrary (C RAGGS and M C G EE W OOD,
2005). Therefore, even a KAPPA score of 0.9 would remain indicative (albeit
very strongly indicative).
135
C HAPTER 6: H UMAN S TUDY
6.5 Conclusion
This chapter has shown (in the main study) that an information orientated binary annotation scheme consistently yields high agreement over a wide range
of sentences. This result supports the central hypothesis of this thesis, that people are able to reliably identify biographical sentences (where “reliable” means
with a good standard of agreement on challenging data).
Note that the classes of sentences in the gold standard data are determined
by the annotated corpus, rather than on the judgements of experimental participants. That is, decisions about the biographical status of individual sentences were made by the researcher, using the biographical annotation scheme.
Agreement between the researcher and the participants was very high however
(93%).13
The remainder of this thesis uses the data gathered in the main study (that is,
five hundred sentences with high agreement14 ) in a series of machine learning
experiments in order to assess the accuracy of automatic sentence classification
using a variety of sentence representations. It is important to stress that the
gold standard data used in the machine learning experiments described in later
chapters is derived from the researcher’s annotation efforts rather than those of
the twenty-five participants involved in the main study. However, agreement
between the researcher’s annotation and the participants’ annotation is very
high (94%).15
13 Note that as there were five participants judging each sentence, it was straightforward to
accept the participants’ majority decision as the sentence class, and compare this with the researcher’s decision.
14 Note that these five hundred sentences are reproduced in Appendix B.
15 As there were five participants for each set of 100 sentences in the main study, the majority
decision for each sentence was recorded and compared to the annotation decision made by the
researcher.
136
C HAPTER 7
Learning Algorithms for
Biographical Classification
This chapter compares six different learning algorithms using the “gold standard” data described in Chapter 5 and utilising a feature set consisting of the
500 most frequent unigrams in the the Dictionary of National Biography. This
feature set was used as it contains a wide range of function words, as well
as words that could intuitively be regarded as being especially characteristic
of the biographical genre (“born”, “married” and so on). Deriving features
from the “gold standard” data set was avoided, as it was suspected that using
the gold standard data to derive features, to train a classifier, and to test that
classifier would artificially inflate classification accuracy. The chapter serves
as a “first pass” of the data, allowing indicative results to be drawn about the
usefulness of different machine learning algorithms for the biographical sentence classification task. Later chapters concentrate on varying the feature sets
used.
The chapter is divided into five sections, Motivation, Procedure, Presentation of
Results, Discussion and Conclusion.
7.1 Motivation
In recent years there has been a steady trickle of published work comparing
feature sets for genre classification (see Section 3.1.3 on page 59). However,
there has been little work directed at the comparison of learning algorithms.
Previously published research has focused on the comparison of feature sets
using one or two algorithms. For example, F INN and K USHMERICK (2003) (see
63) uses the C4.5 decision tree algorithm (described on on page 39) in conjunction with various feature sets to assess whether news articles were subjective
137
C HAPTER 7: L EARNING A LGORITHMS FOR B IOGRAPHICAL C LASSIFICATION
or objective, and whether reviews were positive or negative. F INN and K USH MERICK (2003) varied the feature set but not the learning algorithm.
Z HOU ET AL . (2004) uses three learning algorithms — SVM, C4.5, and Naive
Bayes (see Section 2.5.1 on page 38 for descriptions of these algorithms) — for
biographical sentence classification. The main focus of this work was the identification of optimal features for biographical classification, rather than learning algorithms. Z HOU ET AL . (2004) compared the performance of the three
algorithms on a “biographical corpus” annotated using the scheme described
on page 103,1 and a feature set consisting of all the unigrams present in their
biographical corpus. Z HOU ET AL . (2004) identifies Naive Bayes as the best
performing algorithm (82.42%) followed by the C4.5 algorithm (75.75%) and
finally the SVM algorithm (74.47%).
The current work builds on that presented in Z HOU ET AL . (2004), but differs
in that it explores the performance of six algorithms on a gold standard data set
(described in Chapter 5) using a feature set composed of the five hundred most
frequent unigrams in the Dictionary of National Biography. Most importantly, in
addition to raw accuracy scores, a statistical test — the corrected re-sampled
-test (see page 49) — is used to compare classifiers.
7.2 Experimental Procedure
The bio features Perl script was used to created a W EKA ARFF (see Section 4.2 on page 86) file from the 500 gold standard sentences described in Section 5.2.1 and validated in Chapter 6. The feature representation chosen (and
used on all experiments in this chapter) was a binary representation based on
the most frequent five hundred unigrams in the Dictionary of National Biography.2 Further feature selections was not used as the purpose of this experiment
is to compare the success of different learning algorithms with a constant, basic, feature set. Six learning algorithms were used:
ZeroR — This is a baseline classifier that simple assumes all test instances
belong to the most common class (see page 38)
OneR — The attribute with most predictive power is used to classify all
instances (see page 38).
C4.5 — A decision tree algorithm (see page 39).
Ripper — A rule based algorithm (see page 43).
1 While the annotation scheme describes ten biographical categories, the comparison of algorithms was based on a binary classification scheme (biographical and non-biographical) where the
original biographical categories (work,fame, etc) are subsumed in one single biographical category.
2 The most frequent one hundred words in the Dictionary of National Biography are reproduced on page 145.
A complete list of all five hundred words is available at
http://www.dcs.shef.ac.uk/ mac/frequency lists/dnb 500frequent.txt.gz.
138
C HAPTER 7: L EARNING A LGORITHMS FOR B IOGRAPHICAL C LASSIFICATION
Naive Bayes — The commonly used variant of Bayes that assumes all
attributes are independent (see page 43).
SVM — A Support Vector Machine based algorithm (see page 46).
All experiments were run in the W EKA machine learning environment (see
Section 4.2 on page 86), using the W EKA E XPERIMENTER interface.3 Each algorithm was assessed on the biographical data using 10 x 10 fold cross validation
(see Section 2.5.2 on page 47) in order to allow for reliable comparisons between algorithms. A comparison between the results produced by 100 x 10
fold cross validation and 10 x 10 fold cross validation is also reported in order
to informally test the adequacy of the 10 X 10 fold cross validation methodology.
7.3 Results
Results of the 10 x 10 fold cross validation run for each algorithm are presented
in Table 7.1 on the next page. Note that the Naive Bayes algorithm scores the
highest accuracy at 80.66 percent, followed by the SVM algorithm at 77.47%
and the C4.5 decision tree algorithm at 75.87%. All algorithms performed better than the baseline algorithm, ZeroR, at a highly significant level when subjected to the corrected re-sampled -test.4 The performance of the Naive Bayes
algorithm — the most accurate algorithm — compared to the SVM algorithm
— the second most accurate algorithm — fails to reach the required significance threshold, although the -value is low at 0.1. The Naive Bayes algorithm
does perform better than all the other algorithms studied (apart from the SVM
algorithm) at a highly significant level. In other words, the Naive Bayes algorithm performs better than the ZeroR, OneR, C4.5 and Ripper algorithms at a
highly significant level, and it also consistently performs better than the SVM
algorithm, although this difference fails to meet the significance threshold. Accuracy for the 10 x 10 fold cross validation is also depicted in Figure 7.1 on
page 141.
In order to test whether 10 x 10 fold cross validation was sufficient to gain a
reliable result, the experiment was repeated using 100 x 10 fold cross validation. These results are presented in Table 7.2 on the next page, where it can be
seen that the difference between the mean score for each algorithm is less than
0.5% for 10 x 10 fold cross validation and 100 x 10 fold cross validation. For
example, the score for the Naive Bayes algorithm under 10 x 10 fold cross validation is 80.66%, and for 100 x 10 fold cross validation is 80.70%; a difference
of 0.04 %. This finding is in line with B OUCKAERT and F RANK (2004)’s suggestion that 10 x 10 fold cross validation, in conjunction with the corrected re3 the W EKA EXPERIMENTER is a component of the W EKA machine learning toolkit, which facilitates the comparison of the performance of algorithms.
4 The term “significant” is used when
and the term “highly significant” when
.
139
C HAPTER 7: L EARNING A LGORITHMS FOR B IOGRAPHICAL C LASSIFICATION
Table 7.1: Six Learning Algorithms Compared using “Gold Standard” Data
and a Feature Representation Based on the 500 Most Frequent Unigrams in the
DNB: 10 x 10 Fold Cross Validation
Algorithm
ZeroR
OneR
C4.5
Ripper
Naive Bayes
SVM
Mean (%)
53.09
59.94
75.87
70.18
80.66
77.47
Standard Deviation
0.95
4.85
5.85
6.33
5.14
5.62
Table 7.2: Six Learning Algorithms Compared using “Gold Standard” Data
and a Feature Representation Based on the 500 Most Frequent Unigrams in the
DNB: 100 x 10 Fold Cross Validation
Algorithm
ZeroR
OneR
C4.5
Ripper
Naive Bayes
SVM
Mean (%)
53.09
59.88
76.02
69.82
80.70
77.51
Standard Deviation
0.95
5.51
5.45
6.30
5.01
5.25
sampled t-test, is sufficient for making reliable inferences concerning classifier
performance (see Section 2.5.2 on page 47 for more on evaluating classification
algorithms)..
7.4 Discussion
The results gained confirm Z HOU ET AL . (2004)’s finding that compared to the
SVM and C4.5 algorithms, the Naive Bayes algorithm performs better on the biographical sentence classification task when using unigrams as features.5 Note
however that Z HOU ET AL . (2004) used different implementations of these algorithms (that is, not the WEKA implementations used in this work). Z HOU
ET AL . (2004) also used a different feature set (based on all unigrams from their
biographical corpus) and different training data. The results gained in this research, and in Z HOU ET AL . (2004) are remarkably similar, with Naive Bayes
providing the most accurate results (82.46% and 80.66% for Z HOU ET AL . (2004)
and the current research, respectively). For Z HOU ET AL . (2004) however, the
5 Z HOU ET AL .
(2004) used a greater number of unigrams than the 500 used here.
140
C HAPTER 7: L EARNING A LGORITHMS FOR B IOGRAPHICAL C LASSIFICATION
Figure 7.1: Mean Performance of Learning Algorithms with 10 x 10 CrossValidation on “Gold Standard” Data using a Unigram Based Feature Representation.
C4.5 algorithm performed better than the SVM, whereas, in the current work,
the reverse was true (that is, the C4.5 algorithm scored 75.87% and the SVM
77.47%; a difference of 1.6%). It can also be noted that the performance of the
C4.5 algorithm in the current research is within 0.12% of that reported by Z HOU
ET AL . (2004).
The success of Naive Bayes in this task — with its assumption that all features
are independent and equally important — compared to more sophisticated algorithms (like C4.5) would be surprising if Naive Bayes had not been shown to
be successful in other text classification domains (L EWIS, 1992b; M ANNING and
S CH ÜTZE, 1999). The assumption of the Naive Bayes classifier that all features
are independent of one another, allows the algorithm to “ignore” irrelevant
features (that is, features that occur randomly with respect to the category of
interest). The assumption of independence in Naive Bayes is in contrasted to,
for example, the C4.5 algorithm, where irrelevant features damage classification accuracy, as the decision tree is likely to “split” on an irrelevant feature,
leading to a the provision of sub-optimal data for subsequent decisions (W ITTEN and F RANK , 2005).
141
C HAPTER 7: L EARNING A LGORITHMS FOR B IOGRAPHICAL C LASSIFICATION
The OneR algorithm was used to identify the single rule that provided maximum accuracy. On examination, this single rule was found to be the presence
or absence of the “in “ feature. Of the 501 “gold standard” sentences, 219 contain “in” and 282 do not contain “in”. In the case of the 219 sentences that do
contain “in”, 140 are biographical, and 79 are non-biographical. Example 7.1
shows a sentence from the training corpus that was correctly classified by the
OneR algorithm, and Example 7.2 shows a sentence that was incorrectly classified.
(7.1) And in 2000, aged 80 Doohan boldly went into fatherhood for the
seventh time when his then 43-year-old wife gave birth to a daughter,
Sarah
(7.2) Linex liked the way I was thinking but he said that you’d never get the
punters in and out quickly enough
The relative success of the “in” rule can be attributed to the relatively common
use of “in” as a temporal and geographical locator for biographically salient
life events and states in biographical texts. For example “He was born in 1780”,
“He lived in London for most of his adult life”. The almost half-and-half split
between biographical and non-biographical texts in the training data may well
have contributed to the selection of the “in” rule. If the data had been constituted from 10% biographical sentences, and 90% non-biographical sentences,
the biographically salient instances of “in” may well have been “drowned out”
by the non-biographical instances of “in”. This question cannot be resolved
using the current data, however.
The feature set used in this experiment was derived from the Dictionary of National Biography; a (predominantly) Nineteenth Century British cultural product (see Section 4.3.1 on page 87). The five hundred most frequent unigrams in
the DNB were used. These included function words (“in, “the” and so on), as
well as words we would intuitively consider to indicate biographical content
(“born”, “died”, “married” and so on).
7.5 Conclusion
This chapter has compared six different learning algorithms using the “gold
standard” data described in Chapter 5 utilising a feature set consisting of the
five hundred most frequent unigrams in the Dictionary of National Biography.
We have found, like Z HOU ET AL . (2004), that the Naive Bayes algorithm produces the best results for binary biographical sentence classification, at least
using the feature set and data used. Although the difference between Naive
Bayes and the next best algorithm, a Support Vector Machine, failed to reach
a satisfactory significance level (using the two-tailed corrected re-sampled test), the difference was close to the significance threshold (
), which
provides some tentative support for the claim that Naive Bayes is the best
142
C HAPTER 7: L EARNING A LGORITHMS FOR B IOGRAPHICAL C LASSIFICATION
Figure 7.2: Root section of a C4.5 Decision Tree Derived From the Gold Standard Training Data
school
yes
no
Bio
wife
yes
no
Bio
university
yes
no
Bio
won
yes
no
Bio
married
yes
Bio
no
york
.
.
.
performing classification algorithm for the biographical sentence classification
task. It is also notable that on some feature sets the SVM algorithm outperforms
Naive Bayes (see Table 9.1 on page 163 for an example).
143
C HAPTER 8
Feature Sets
This section describes the different feature sets used in the empirical work in
Chapter 9. The feature sets are divided into four groups, standard features, biographical features, empirically derived syntactic features and keyword-based features.
Feature sets are collections of similar features. For example, the most frequent
five hundred unigrams in the Dictionary of National Biography constitute a feature set.
8.1 Standard Features
The standard features used are listed below:
2000 most frequent unigrams derived from the Dictionary of National Biography.
2000 most frequent unigrams with (function words removed) derived
from the Dictionary of National Biography.
2000 most frequent unigrams derived from the Dictionary of National Biography stemmed using the Porter Stemmer.
2000 most frequent bigrams derived from the Dictionary of National Biography.
2000 most frequent trigrams derived from the Dictionary of National Biography.
319 function words.1
1 The list of English function words is available from the University of Glasgow, Department of
Computer Science:
http://www.dcs.gla.ac.uk/idom/ir resources/linguistic utils/stop word Accessed on 02-01-07.
144
C HAPTER 8: F EATURE S ETS
Table 8.1: 100 Most Frequent Unigrams in the Dictionary of National Biography.
Unigrams not Present in the 100 Most Frequent Unigrams in the British National
Corpus are Italicised.
the
of
in
and
to
he
a
was
his
s
on
at
by
for
with
as
that
which
from
had
an
him
but
it
is
were
this
london
who
first
sir
be
not
her
i
one
p
been
after
john
when
or
have
years
died
ii
made
two
son
life
time
king
became
lord
some
also
she
college
year
st
all
c
william
daughter
their
published
there
under
work
may
where
england
no
great
other
house
royal
into
himself
new
are
appointed
church
they
death
born
its
d
second
english
more
married
general
henry
before
society
2
many
thomas
took
These features are referred to as standard as they are based on feature identification methodologies commonly used in the text classification literature (S E BASTIANI , 2002).
All frequencies were derived from the Dictionary of National Biography (DNB)
based on the intuition that these words were likely to be especially characteristic of the biography genre. It can be seen from Table 8.1 there are marked differences between frequencies derived from the DNB biographical text and those
from more general English text (in this case, frequency lists from the British National Corpus2 ). The frequencies obtained from biographical text include person
names (“william”, “john”), place nouns (“london”, “england”), titles (“lord”,
“king”) and life events (“born”, “died”, “appointed”, “married”). Stemming
— the reduction of inflected word forms to a single canonical stem (for example, “marry” and “married” may become “marri”) — was achieved using a
standard implementation of the Porter algorithm (P ORTER, 1980). Stemming is
usually performed in order that the various inflections of a base word are not
regarded as a separate word.
2 Frequency lists are available from Alan Kilgariff’s web site:
http://www.itri.brighton.ac.uk/ Adam.Kilgarriff/bnc-readme.html Accessed
on 02-01-07.
145
C HAPTER 8: F EATURE S ETS
The most frequent bigrams and trigrams from the Dictionary of National Biography were used as features, based on two intuitions. The first is that frequent
-grams in biographical text are likely to have special discriminatory power
compared to -grams frequent in standard English text collections. The second
is that -grams provide a computationally inexpensive method of capturing
syntactic information (S ANTINI, 2004a).
Examples of the most common bigrams and trigrams from the Dictionary of National Biography are shown in Tables 8.2 on the following page and 8.3 on the
next page respectively. It can be seen that while many of these bigrams seem
to be specific to the particular domain and subject matter of the Dictionary of
National Biography (for example, “duke of”, “the english”, “the british”), many
also seem to be plausibly characteristic of biographical sentences more generally (“his death”, “daughter of”, “died in” , “he married”, and so on). There
are also a number of bigrams composed entirely of function words that could
be expected to occur in a general corpus of English (for example, “but the”, “to
a”, “of the”, and so on).
The most frequent trigrams (Table 8.3 on the following page) are rarely constituted entirely of function words, and neither are they made up of general biographical phrases. Instead, many frequent trigrams seem highly specific to the
Dictionary of National Biography, with its emphasis on British history and culture. Examples here would include; “the royal society”, “the royal academy”,
“house of commons”, “the earl of”, and so on). Table 8.3 on the next page
shows up the gender bias in the Dictionary of National Biography. Of the 100
trigrams presented, forty have exclusively male referents. Only two female
referents occur in the most frequent one hundred trigrams, and both of these
refer to “his wife” (“by his wife”, “and his wife”).
8.2 Biographical Features
Four specifically biographical feature groups were used as part of the work.
These features are not empirically derived, but rather based on intuitions regarding the likely characteristics of biographical sentences:
1. Pronoun: This is a boolean feature. If the sentence contains a pronoun (he,
she, him, her, his, hers) then the feature is positive.
2. Name: Six boolean features are identified here, using a combination of
gazetteers and FSAs:
Title (for example, Mr, Ms, Captain).
Company (for example, IBM, International Business Machines).
Non Commercial Organisation (for example, Army, Parliament, Senate).
146
C HAPTER 8: F EATURE S ETS
Table 8.2: 100 Most Frequent Bigrams from Dictionary of National Biography.
of the
on the
with the
as a
to be
to his
the first
the royal
with a
his father
to a
him to
which was
for his
the house
his death
house of
after the
duke of
the church
in the
at the
he had
in his
one of
that he
to have
was appointed
where he
his wife
of which
of st
in which
in london
king s
was not
was buried
and of
but the
who was
he was
and the
and in
which he
and his
daughter of
and a
son of
that the
earl of
his own
as the
it is
under the
history of
life of
and to
the english
died at
but he
to the
for the
and was
from the
it was
in a
was the
had been
he died
have been
for a
of sir
with his
the british
when he
died in
the university
was in
returned to
he married
of his
by the
of a
was a
the king
by his
the same
and he
was born
and on
he became
on his
during the
by a
member of
was elected
the following
to him
published in
the most
Table 8.3: 100 Most Frequent Trigrams from Dictionary of National Biography.
Trigrams with Male Referents are Italicised.
one of the
to have been
a member of
the same year
the end of
was educated at
in the following
in which he
history of the
the church of
part in the
of which he
the age of
president of the
he had been
he was made
in the house
at the age
s life of
of the first
he was one
at the same
of the house
at the end
of his own
he was appointed
the house of
and in the
and was buried
the duke of
was born in
was buried in
which he was
of his life
seems to have
the royal society
s hist of
which he had
account of the
the british museum
of the church
was appointed to
where he was
and on the
he died in
of the english
whom he had
the history of
the royal academy
is in the
he was a
member of the
in the same
the death of
was born at
the earl of
said to have
the university of
as well as
house of commons
is said to
he was educated
he was also
of the british
of the king
he became a
he had a
the son of
the battle of
he died on
at the time
one of his
on the death
a man of
he was in
147
of the royal
the king s
was one of
he was elected
that he was
he died at
by his wife
and he was
the following year
his father s
and his wife
that he had
he was the
when he was
he returned to
the same time
part of the
to the king
he went to
a fellow of
and of the
buried in the
cal state papers
the author of
eldest son of
C HAPTER 8: F EATURE S ETS
Forename (for example, David, Dave).
Surname (for example, Smith, Jones, Brown).
Family relationship (for example, father, son, daughter).
3. Year: This boolean feature is triggered if the sentence contains a year (for
example, 2005, 2005-06, or 2005-2010).
4. Date: This boolean feature is triggered if there is a date in the sentence.
Dates for these purposes include any month name (for example, January,
Jan) and also numerical dates of various kinds (for example, 09/09/2005,
9.9.2005, and so on3 ).
8.3 Syntactic Features
Ten syntactic features, were identified as particularly appropriate for representing biographical texts, based on data published as part of the research project
described by B IBER (1988) (see Section 2.1.2 on page 11), who made available
comprehensive frequency counts of syntactic features by genre in a corpus
constructed largely of the Lancaster-Oslo-Bergen (LOB) corpus (J OHANSSON
ET AL ., 1978) (see page 16 for a list of the genre used).
From this data, it is straightforward to calculate those features most and least
prevalent in the biographical genre. For each syntactic feature (see Table 8.4 on
the next page and Appendix C) the mean frequency across genres (excluding
the biographical genre) was calculated for each syntactic feature. Then the
syntactic features with the maximum distance from the mean for the biographical genre were identified. For example, if the mean frequency per thousand
words for the feature “past tense” is 43.7, across all genres excluding biography, and the biography frequency is 68.4 per thousand, then the biography
genre has 24.6 past tense features above the mean. All 67 features can be analysed in this way to produce a ranking of the features most distinctive of the
biographical genre. Table 8.5 on page 150 shows the top twenty syntactic feature in rank order according to that feature’s distance from the mean (whether
positive or negative). Another method for identifying biographically relevant
features was tried, which used standard deviations (that is, z-scores) from the
mean instead of raw distance (see Appendix C). The initial method of using
features identified by their raw distance from the mean was favoured however,
as those features identified using standard deviations from the mean included
features that can sensibly be used only at the document level (for example,
type/token ratios). Table 8.5 on page 150 shows the twenty most discriminatory features ranked by distance from the mean. Tables 8.6 on page 150
3 The regular expressions used to identify dates are reproduced at:
http://www.dcs.shef.ac.uk/ mac/date regexp.txt
148
C HAPTER 8: F EATURE S ETS
Table 8.4: Syntactic Features Used by B IBER (1988).
past tense
present tense
time adverbials
second person pronouns
pronoun IT
indefinite pronouns
WH questions
gerunds
agentless passives
BE as main verb
THAT verb complements
WH clauses
present participle clauses
past prt. WHIZ deletions
THAT relatives: subj. position
WH relatives: subj. position
WH relatives: pied pipes
adv. subordinator - cause
adv. sub. - condition
prepositions
predictive adjectives
type/token ratio
conjuncts
hedges
emphatics
demonstratives
necessity modals
public verbs
suasive verbs
contractions
stranded prepositions
split auxillaries
non phrasal coordination
analytic negation
perfect aspect verbs
place adverbials
first person pronouns
third person pronouns
demonstrative pronouns
DO as pro-verb
nominalisations
nouns
BY passives
existential THERE
THAT adj complements
infinitives
past participle clauses
present prt. WHIZ deletions
THAT relatives: obj. position
WH relatives: obj. position
sentence relatives
adv. sub. - concession
adv. sub. - other
attributive adjectives
adverbs
wordlength
downturners
amplifiers
discourse particles
possibility modals
predictive modals
private verbs
SEEM/APPEAR
that deletion
split infinitives
phrasal coordination
synthetic negation
and 8.7 on page 151 show the twenty most characteristic features of the biographical genre, and the twenty least characteristic syntactic features respectively. Additionally, Appendix C shows results for all the syntactic features
identified by B IBER (1988).
Biber’s original work was conducted in the late 1980s, at a time when natural language processing tools were less well developed. Highly accurate (97%
+) part-of-speech tagging was not available for the original work, instead, in
order to identify the linguistic features of interest, Biber relied heavily on a
gazetteer based approach, augmented with simple pattern matching. For instance, when identifying past tense verbs, Biber’s work relied on the use of a
149
C HAPTER 8: F EATURE S ETS
Table 8.5: Twenty Syntactic Features Most Characteristic of Biography Ranked
by Maximum Distance from Mean.
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Distance
41.9
29.7
24.6
16.3
13.0
12.5
10.1
8.0
7.1
4.4
4.2
3.6
3.3
2.7
2.7
2.5
2.3
2.2
2.2
2.0
Feature Name
present tense
adverbs
past tense
prepositions
nouns
contractions
second person pronouns
first person pronouns
attributive adjectives
private verbs
BE as main verb
type/token ratio
demonstrative pronouns
pronoun IT
predictive modals
nominalisations
analytic negation
emphatics
non phrasal coordination
that deletion
Non-bio mean
77.8
95.6
43.7
106.2
179.3
13.4
10.7
30.1
59.2
18.0
28.4
51.5
4.2
10.3
6.0
18.0
8.5
6.4
4.6
3.2
Biographical Mean
35.9
65.9
68.4
122.6
192.4
0.9
0.6
22.1
66.4
13.6
24.2
55.2
0.9
7.6
3.3
20.6
6.2
4.2
2.4
1.2
Table 8.6: Twenty Syntactic Features Characteristic of Biography Ranked by
Positive Association with Biographical Genre.
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Distance
24.6
16.3
13.0
7.1
3.6
2.5
1.4
1.4
1.3
1.3
1.2
0.9
0.7
0.4
0.4
0.35
0.3
0.3
0.2
0.1
Feature Name
past tense
prepositions
nouns
attributive adjectives
type/token ratio
nominalisations
agentless passives
perfect aspect verbs
phrasal coordination
split auxiliaries
infinitives
demonstratives
synthetic negation
WH relatives: obj. position
suasive verbs
WH relatives: subj. position
WH relatives: pied pipes
third person pronouns
BY passives
adv. sub. - other
150
Non-bio Mean
43.7
106.2
179.3
59.2
51.5
18.0
8.4
9.1
3.5
5.2
15.6
9.7
1.8
1.4
2.7
2.05
0.6
33.9
0.6
0.9
Biographical Mean
68.4
122.6
192.4
66.4
55.2
20.6
9.9
10.6
4.9
6.6
16.9
10.7
2.6
1.9
3.2
2.4
1.0
34.3
0.9
1.1
C HAPTER 8: F EATURE S ETS
Table 8.7: Twenty Syntactic Features Characteristic of Biography Ranked by
Negative Association with Biographical Genre.
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Distance
-41.9
-29.7
-12.5
-10.1
-8.0
-4.4
-4.2
-3.3
-2.7
-2.7
-2.3
-2.2
-2.2
-2.0
-1.9
-1.8
-1.7
-1.6
-1.3
-1.2
Feature Name
present tense
adverbs
contractions
second person pronouns
first person pronouns
private verbs
BE as main verb
demonstrative pronouns
pronoun IT
predictive modals
analytic negation
emphatics
non phrasal coordination
that deletion
possibility modals
predictive adjectives
DO as pro-verb
adv. sub. - condition
stranded prepositions
place adverbials
Non-bio Mean
77.8
95.6
13.4
10.7
30.1
18.0
28.4
4.2
10.3
6.0
8.5
6.4
4.6
3.2
5.9
4.9
2.9
2.5
1.9
3.2
Biographical Mean
35.9
65.9
0.9
0.6
22.1
13.6
24.2
0.9
7.6
3.3
6.2
4.2
2.4
1.2
4.0
3.1
1.2
0.9
0.6
2.0
stored dictionary, and assumed that any word ending in –ed and longer than
six letters was a past tense verb. The current work relied heavily on a standard
Hidden Markov model based part-of-speech tagger, using a subset of the Penn
Treebank tagset.4
Ten features were chosen, the five features most characteristic of biography,
and the five features least characteristic. The five most characteristic features
were:
1. Past Tense: Identified by the part-of-speech tagger.
2. Preposition: Identified by the part-of-speech tagger.
3. Noun: Identified by the part-of-speech tagger.
4. Attributive Adjective: These are adjectives that fit into the pattern ADJ +
ADJ/N. For example, “big cat”, or “big scary cat”.
5. Nominalisation: These were Nouns identified by the part-of-speech tagger, ending in –tion, –ment, –ness, or –ity.
4 The Perl module Lingua-EN-Tagger available from CPAN. http://www.cpan.org Accessed on 02-01-07.
151
C HAPTER 8: F EATURE S ETS
The five least characteristic features were:
1. Present Tense: Identified by the part-of-speech tagger.
2. Adverb: Identified by the part-of-speech tagger.
3. Contraction: Identified using a gazetteer of common contractions.
4. Second Person Pronouns: Identified using a gazetteer.
5. First Person Pronouns: Identified using a gazetteer. Note that the biographical texts used by B IBER (1988) excluded autobiography, hence the
low frequency of first person pronouns.
8.4 Key-keyword Features
A further alternative for selecting biographical features involves using keykeywords. T RIBBLE (1998) adopted a “keyword” methodology for genre analysis that is much more straightforward to execute than the multi-dimensional
method (see Section 2.1.2 on page 21 for more on the motivation for using the
key-keyword method for genre analysis). As the key-keyword methodology is
designed to select those features especially distinctive of a given genre, it can
also (we hypothesise) be employed as a feature selection method for genre classification purposes. First, two corpora were constructed, a biographical corpus
consisting of 383 short biographical documents from Wikipedia and Chambers
Dictionary of Biography (see Section 4.3 on page 87 for a description of these corpora) and a reference corpus (the B ROWN corpus, see Section 4.3 on page 87). 5
It is important that attempts are made to make the reference corpus balanced
(that is, containing text from various different sources), hence the use of the
B ROWN corpus, which, despite its roots in the 1960s, does cover a large number of text types (again, see Section 4.3 on page 87). Note that T RIBBLE (1998)
found that the size of the reference corpus used is not of vital importance, a
result also gained by X IAO and M C E NERY (2005), who discovered that the one
million word FLOB corpus6 and the 100 million word British National Corpus,7 yielded a similar keyword list. This suggests that the one million word
BROWN corpus is a suitable choice for the task in terms of its size and balance. However, one important difference between the biographical and reference corpora is that the reference corpus is entirely composed of American English, whereas the biographical corpus is composed of British English (Chambers) along with other English variants, including American and British English
(Wikipedia biographies).
5 The biographical corpus consisted of 47,967 words taken from 383 documents. These documents were randomly selected from Wikipedia Biographies (194 documents used) and Chambers
Biographies (189 documents used). Both these sources of biographical data are described in Section 4.3 on page 87.
6 Freiburg-LOB corpus http://khnt.hit.uib.no/icame/manuals/flob/INDEX.HTM
Accessed on 02-01-07
7 British National Corpus http://www.natcorp.ox.ac.uk Accessed on 02-01-07
152
C HAPTER 8: F EATURE S ETS
Two related methods for extracting key-keywords were used in this work.
First, the naive key-keywords method. Second, the WordSmith key-keywords
method. Note that the WordSmith method was used by T RIBBLE (1998).
8.4.1 Naive Key-keywords Method
The process of identifying “naive” key-keywords can usefully be divided into
two stages:
1. The most discriminatory one thousand biographical keywords were identified by comparing the biographical corpus (the Chambers and Wikipedia
biographical documents) with a reference corpus (the BROWN corpus)
using the feature selection method (described in Section 2.5.3 on page 50).
These 1000 most discriminatory unigrams as identified by the method
are referred to as “keywords” for the biographical genre.
2. The 1000 most discriminatory keywords as identified by the method
were re-ranked according to the number of biographical documents in
which they occur, remembering that there are 383 biographical documents in total. The resulting ranking is the naive key-keyword8 ranking.
For example, if the unigram “born” occurs in 320 biographical documents, and the unigram “married” occurs in 205 biographical documents,
then the unigram “born” will be ranked above the unigram “married” in
the key-keywords list. That is, the unigram “born” will have a higher
key-keyword ranking than “married”. The intuition here is that while
a high ranked keyword may occur in only one or two biographical document, a high ranked naive key-keyword is likely to appear in many biographical documents.
Table 8.8 on the following page presents the twenty most frequent unigrams in
the biographical corpus, together with information about the number of biographical documents (that is, Chambers or Wikipedia biographies) in which the
unigram occurs. Table 8.9 on page 155 shows the 20 unigrams with the highest
naive key-keyword value (that is, of the most discriminatory 1000 keywords
as identified by the algorithm, those 20 that appear in the most biographical documents). Note that column three of Table 8.8 and Table 8.9 refers to
the proportion of biographical texts in which the keyword occurs, and column
four gives the number of texts in which the keyword occurs (of which there
were 383 in total). Note also that ordinary function words appear high on both
lists (for example, “in”, and “and”). The word “in” is used disproportionately
frequently in the biographical texts to indicate the time of a biographically significant event (for example, “He died in 1964”) or the location of an event (“He
was born in London”). Of the 26,339 instances of “[iI]n”(that is “in” or “In”) in
the BROWN corpus, only 607 (3%) were followed by a four digit date. When the
8 We have named this method the “naive” method as it is less computationally intensive than
the WordSmith method.
153
C HAPTER 8: F EATURE S ETS
Table 8.8: Unigrams in the Biographical Corpus Ranked by Frequency (with
Additional Information about the Number of Biographical Documents in
which the Unigram Occurs).
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Unigram
the
in
of
and
he
a
was
to
his
as
for
at
on
s
with
by
she
that
from
an
% of Bio Docs in which Unigram is Present
97
94
87
89
78
81
83
79
63
53
49
50
42
37
44
38
17
26
39
40
No. of Bio Docs in which
Unigram is Present
372
361
336
344
302
312
321
270
244
204
189
193
163
145
170
148
067
102
153
154
biographical texts were analysed, the proportion of instances of “in” followed
by four digits was 26% (504 out of 1983 instances).9 Additionally, “in” occurs
more than twice as often in the biographical texts as in the reference (B ROWN)
corpus (4.24% and 2.09% respectively). It is possible that the large discrepancy
in the frequency of “in” is likely to arise — at least partially — from the increased use of the word “in” to associate an event with a year in biographical
text.
8.4.2 WordSmith Key-keywords Method
The process for identifying WordSmith key-keywords falls into two stages:
1. For each of the 383 biographical documents a keyword list was produced
(using the algorithm and the B ROWN corpus as a reference corpus.10
9 The regular expressions used to identify “in”, and “in” followed by a four digit year, were
“ s[Ii]n( s|,|.)”, and “ s[Ii]n s d d d d( s|,|.)”, respectively.
10 Keywords were selected from each biographical document by calculating the value for each
word type in that document against the reference corpus. The average value of all the word
types in the biographical document was then calculated, and those word types that had a value
greater than the average were selected as keywords for that biographical document. This operation
was performed using the AntConc concordancing software.
154
C HAPTER 8: F EATURE S ETS
Table 8.9: Unigrams in the Biographical Corpus Ranked by Naive Key-keyness
(with Additional Information about the Number of Biographical Documents in
which the Unigrams Occur).
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Unigram
in
and
was
he
his
born
as
at
an
became
after
first
also
died
she
later
known
years
her
work
% of Bio Docs in which Unigram is Present
94
89
83
78
63
57
53
50
40
29
28
22
22
20
17
17
16
15
15
14
No. of Bio Docs in which
Unigram is Present
361
344
321
302
244
222
204
193
154
112
110
086
086
078
067
066
065
061
059
057
2. A key-keyword list was produced by identifying those words that appeared as keywords in the greatest number of biographical documents
“A ‘key key-word’ is one which is ‘key’ in more than one of a number of
related texts. The more texts it is ‘key’ in, the more ‘key key’ it is.”11 This
method can be contrasted with the naive key-keywords method. Instead
of re-ranking the keywords according to the number of biographical documents in which they occur, the Wordsmith method simply ranks words
according to the number of documents in which they are key. For example, if the unigram “born” is a keyword in 100 biographical documents
and the unigram “married” is a keyword in 40 biographical documents,
then the keyword “born” will have a higher WordSmith key-keyword
ranking than “married”.12
11 WordSmith documentation:
http://www.lexically.net/downloads/version4/
Accessed on 01-05-07.
12 In his original work on genre analysis, T RIBBLE (1998) used the WordSmith suite of programs
to identify key-keywords (this approach was also adopted by X IAO and M C E NERY (2005)). The
WordSmith program was not available for this work, but similar functionality was achieved using AntConc, a text analysis and concordancing tool developed by Laurence Anthony, at Waseda
University, Tokyo. For the naive key-keywords method AntConc was used to identify biographical keywords against a reference corpus, using the feature selection method, then identified
keywords were post-processed using Perl scripts in order to rank them by the proportion of biographical texts in which they occurred. For the WordSmith key-keywords, AntConc was used to
generate keyword lists for each of the 383 biographical documents (again using the method and
155
C HAPTER 8: F EATURE S ETS
Table 8.10: Unigrams in the Biographical Corpus Ranked by WordSmith Keykeyness (with Additional Information about the Number of Biographical Documents in which the Unigrams Occur.
Rank
Unigram
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
ii
usa
actress
stub
honour
centre
iii
edinburgh
video
bbc
barry
albums
yorkshire
vols
uk
medal
lionel
iraq
honour
eng
% of Bio Docs in which Unigram is Key
6.8
5.0
3.4
2.9
2.1
2.1
1.8
1.8
1.6
1.6
1.6
1.6
1.3
1.3
1.3
1.3
1.3
1.3
1.3
1.3
No. of Bio Docs in which
Unigram is Key
26
19
19
11
08
08
07
07
06
06
06
06
05
05
05
05
05
05
05
05
Table 8.10 shows twenty unigrams from the biographical corpus ranked by
WordSmith key-keyness. It is noticeable that the unigrams identified using
the WordSmith key-keyword method differ from those identified by the naive
key-keyword method. The lack of function words among the highest ranking
WordSmith key-keywords is striking, as is the appearance of unigrams that
are perhaps tied to the particular biographical corpora used, rather than the
biographical genre in general. For example, the unigrams “ii” and “iii” appear very high in the key-keyword list because of the convention of referring
to monarchs and emperors using Roman numerals (for instance, “Selim II”
or “Mahmud II” in Chambers biographies). Similarly the unigram “stub” — a
word used by Wikipedia to indicate that an entry is a short summary — appears
as a keyword in 2.9% of biographical documents. Place names (“edinburgh”,
“uk” and “yorkshire”) also appear on the list, as well as a single personal name
(“lionel”). It is noticeable that intuitively biographical words like, for instance,
“born”, “married” or “died”, do not occur in the list.
the B ROWN corpus as a reference corpus). A Perl script was then used to identify those unigrams
that were key in the greatest number of documents.
156
C HAPTER 8: F EATURE S ETS
8.5 Conclusion
This chapter has described various feature sets developed for this research
work. It is important to understand the feature sets as they are referenced
extensively in Chapters 9 and 10, where a series of experiments compares the
performance of different feature sets on the biographical sentence classification
task.
157
C HAPTER 9
Automatic Classification of
Biographical Sentences
This chapter presents a series of experiments based on the gold standard data
described in Chapter 5 and validated in Chapter 6 on page 124. The gold standard data, together with the bio features feature extraction program and
the WEKA machine learning environment (see Section 4.2 on page 86) provides
a test-bed for assessing the classification power of different feature sets for the
biographical sentence classification task.
This chapter is divided into four sections. The first section describes the common procedure for all experiments, and the remaining three sections each reflects a different research theme:
Syntactic Features — comparing the performance of syntactic and “bagof-words” based feature representations for the biographical categorisation task (see page 160). This section addresses Hypothesis 2 (““Bagof-words” style sentence representations augmented with syntactic features provide a more effective sentence representation for biographical
sentence recognition than “bag-of-words” style representations alone.”)
Lexical Features — analysing the performance of lexically based alternatives to the “bag-of-words” approach for the biographical categorisation
task (see page 165).
Exploring Keyness — Assessing methods for identifying optimal lexically based features, especially the concept of “keyness” and “key-keyness”
(see page 170).
158
C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES
9.1 Procedure
The bio features (see Section 4.2 on page 86) program was used to construct a feature matrix from the gold standard data of five hundred and one
sentences (see Chapter 5) for all the sentence representations used. The resulting feature matrixes were fed to the WEKA machine learning environment. As
it has been shown that the Naive Bayes and Support Vector Machine learning
algorithms provide the best results on the gold standard biographical data using a 500 frequent unigrams derived from the Dictionary of National Biography
(see Chapter 7 on page 137), these two algorithms were both used in all experiments, with decision tree algorithms — the WEKA implementation of the C4.5
algorithm — also used for data exploration purposes.
While this chapter describes work using feature derived from the DNB (and
other corpora) with the ”gold standard” sentences used as training and test
data, other approaches are possible. We can usefully distinguish between:
(A) Corpus used to select features.
(B) Corpus used to train classifiers.
(C) Corpus used to test trained classifiers.
The current work primarily uses DNB data as (A) and the gold standard data
as (B) and (C). One alternative to this strategy is the use of the “gold standard”
data for all three categories (that is, using the gold standard thesis as a source of
unigram features, as training data and as test data). However, this strategy has
been avoided as it was suspected that using unigrams derived from the gold
standard corpus would artificially inflate accuracy. This intuition was found
to be well grounded when the Naive Bayes algorithm was used to classify the
gold standard data using a set of one hundred unigram features derived from
all the unigrams1 in the gold standard data set using the algorithm. The gold
standard data itself was used as a source of biographical and non-biographical
instances. The result, was, as expected a classification accuracy score higher
than all other experiments at 83.90% (using 10 x 10 fold cross validation). This
theme is explored further in Chapter 10, where we examine the portability of
features identified by Z HOU ET AL . (2004).
A further alternative involves a separation between training and test data. For
example, using different sources for (A), (B) and (C). That is, derive features
from one corpus, train a classifier on another corpus, and finally test the trained
classifier on a third corpus.2 This approach has been shown to indicate good
results (although it most be stressed that this result is only provisional and requires further work). The 500 hundred most frequent unigrams in the DNB
(A) were used as a feature set, in conjunction with a training set consisting of
the 1000 sentence sample from the filtered TREC corpus (TREC-F) described on
1 There
2 Note
are 3504 unigram types, and 11,245 tokens in the “gold standard data”.
that when this kind of approach is used, cross-validation cannot be conducted.
159
C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES
page 115 and the 1000 sentence sample of the Chambers Biographical Dictionary
(again described on page 115). The TREC-F sample sentences and the CHA-A
sample sentences serve as non-biographical and biographical training data, respectively, and functioned as data source (B) (that is, training data). The model
trained using the TREC-F/CHA-A data (using the 500 most frequent unigrams
in the DNB as features) was then tested on all the gold standard data, achieving a classification accuracy of 75.64%. Although this result is interesting, this
thesis is focused on the identification of appropriate features for biographical
sentence classification. The topic of exportable models may be a fruitful area for
future research.
In line with good practice, the current chapter employs a 10 x 10 fold cross validation evaluation methodology (see Section 2.5.2). The corrected re-sampled test was used to compare algorithms for statistical significance (see Section 2.5.2 on
page 48 for a discussion of issues in, and methods for, assessing classification).
The Dictionary of National Biography was chosen as the main source for deriving
features (although others were used) as it is the largest corpus of biographical
text available for this work.
9.2 Syntactic Features
The text classification literature has consistently shown that the use of syntactic
features fails to improve classification accuracy (see Section 3.1.2 on page 57).
Indeed S COTT and M ATWIN (1999) states that “it is probably not worth pursuing simple phrase based representations further”. Contrary to this trend in the
topic-based text classification field, there is some evidence to show that syntactic features are appropriate for genre classification, as syntactic features, rather
than topical words — it is suggested — can capture the non topical style of a
text. S ANTINI (2004a) gained encouraging results from the use of part-of-speech
trigrams (that is, data was first part-of-speech tagged and then the sequences
of three tags most characteristic of each genre were used as features). Also,
S TAMATATOS ET AL . (2000a) found that syntactic features (noun phrases, verb
phrases, and so on) could improve accuracy for genre classification of modern Greek texts (Section 3.1.3 on page 59 describes this work more fully). It
is important to emphasise that both S TAMATATOS ET AL . (2000a) and S ANTINI
(2004a) are concerned with classifying at the document rather than the sentence
level. Indeed S TAMATATOS ET AL . (2000a) states that a lower bound of one
thousand words is desirable in order to increase accuracy. In this research
we are concerned entirely with sentence classification; a different but related
task.
This section – in line with the general hypothesis that “bag-of-words” style sentence representations augmented by syntactic features provide a more effective
representation for biographical sentence classification than “bag-of-words” style
160
C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES
representations alone — explores different syntactic and pseudo-syntactic feature sets, and compares them to the standard “bag-of-words” approach.
In this context, pseudo-syntactic features are word -grams where . They
are referred to as pseudo-syntactic features because it is hypothesised that bigrams, trigrams and so on, provide a computationally inexpensive method for
capturing syntactic information that does not require complex processing (for
example, part-of-speech tagging, chunking and so on).
The feature sets used in this experiment are described in Chapter 8, but briefly
summarised below:
The 2000 most frequent unigrams from the Dictionary of National Biography. This provided the baseline.
The 2000 most frequent bigrams in the Dictionary of National Biography.
The 2000 most frequent trigrams in the Dictionary of National Biography.
Syntactic features (that is, features identified from a statistical analysis of
the data presented in B IBER (1988)).
Syntactic features and 2000 most frequent Dictionary of National Biography
bigrams.
Syntactic features and 2000 most frequent Dictionary of National Biography
trigrams.
The 2000 most frequent unigrams in the Dictionary of National Biography
augmented with fifty most frequent bigrams in the Dictionary of National
Biography.
Syntactic features and the 2000 most frequent Dictionary of National Biography unigrams.
The last two — unigrams and syntactic features, and unigrams and fifty bigrams – are included to facilitate the testing of the central hypothesis, that
“bag-of-words” style representations augmented by syntactic features (or in
the case of bigrams, pseudo-syntactic features) are better representations for
the biographical classification task than “bag-of-words” representations alone.
Note that Chapter 8 on page 144 comprehensively describes the feature sets
used.
9.2.1 Results
It can be seen in Table 9.1 on page 163 that unigrams alone perform well at
78.78%,3 with the success of -gram (where ) declining sharply (see Figure 9.1 on the next page for a comparison of the performance of unigram, bigram and trigram representations). It is notable that classification accuracy
3 Note
that all percentages quoted were gained using Naive Bayes classification algorithm.
161
C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES
Figure 9.1: Comparison of the Performance of Unigrams, Bigrams and Trigrams.
Figure 9.2: Comparison of the Performance of Syntactic and Pseudo-Syntactic
Features.
162
C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES
Table 9.1: Performance of Syntactic and Pseudo-syntactic Features.
Feature Set
2000 DNB Unigrams
2000 DNB Bigrams
2000 DNB Trigrams
Biber Features
Biber Features and DNB Unigrams
Biber Features and DNB Bigrams
Biber Features and DNB Trigrams
2000 DNB Unigrams and 50 DNB Bigrams
Naive Bayes (%)
78.78
69.08
57.98
69.84
80.68
74.07
64.54
79.18
SVM (%)
78.18
71.28
61.45
66.61
77.72
72.42
69.30
77.30
Figure 9.3: Experimental and Null Hypotheses — Syntactic Features.
Experimental Hypothesis: There is a difference between “bag-ofwords” style feature representations augmented with syntactic
features and “bag-of-words” style representations alone for the
biographical categorisation task.
Null Hypothesis: There is no difference between the performance of
“bag-of-words” style representation and “bag-of-words” style
representations augmented by syntactic features.
achieved using the most frequent 500 unigrams in the DNB — reported in
Chapter 7 — yielded a result of 80.66%, almost 2% higher than that achieved
using four times as many unigram features.
It is notable that the two feature representations that augment “bag-of-words”
style representations with some syntactic representation — or in the case of bigrams, pseudo-syntactic representations — fare better than the “bag-of-words”
baseline (that is, unigrams). The resulting accuracy scores were 80.68% for
unigrams augmented with syntactic features, and 79.18% for unigrams augmented with pseudo-syntactic features. Neither of these accuracy scores, however, reaches a significance level that would allow strong conclusions to be
drawn (using the two tail corrected re-sampled -test) against the baseline unigram performance of 78.78%. In other other words, if we have the Experimental hypothesis and null hypothesis presented in Figure 9.3 (regarding syntactic
features) and 9.4 on the next page (regarding pseudo-syntactic features) then
we are not entitled to reject the null hypothesis in either case on the results
presented here.
163
C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES
Figure 9.4: Experimental and Null Hypotheses — Pseudo-Syntactic Features.
Experimental Hypothesis: There is a difference between “bag-ofwords” style feature representations augmented with pseudosyntactic features (in this case, bigrams) and “bag-of-words” style
representations alone for the biographical categorisation task.
Null Hypothesis: There is no difference between the performance of
“bag-of-words” style representation and “bag-of-words” style
representations augmented by pseudo-syntactic features (in this
case bigrams).
9.2.2 Discussion
While the difference between the performance of the two feature sets augmented by syntactic features was not statistically significant compared to the
unigram baseline, the performance of the unigram and syntactic features representation was almost 2% better than the unigram representation alone. Recall that the syntactic features consist of only ten features (including past tense
and attributive adjectives; see Section 8.3 on page 148 for a complete list). These
results are consistent with S ANTINI (2004a) and S TAMATATOS ET AL . (2000a)
in that they suggest that there is a small accuracy gain in using syntactic features (although S ANTINI (2004a) and S TAMATATOS ET AL . (2000a) did not report whether the difference where statistically significant). Note that the syntactic features (that is, Biber features based on an analysis of B IBER (1988)’s data)
performed better than the pseudo-syntactic features (80.68% and 79.18% respectively).
Unlike F ÜRNKRANZ (1998), we found that classification accuracy for -grams
markedly decreased when (see Figure 9.1 on page 162). F ÜRNKRANZ
(1998) saw trigrams as the optimal -gram representation, with sequences greater
than 3 resulting in a decrease in classification accuracy. The lack of success of
trigrams in the current work can perhaps be attributed to the nature of the corpus from which the trigrams were derived. The Dictionary of National Biography
contains much information that is specific to the culture in which it was produced. For instance, several of the most frequent trigrams refer to the British
monarchy and particular British institutions (for example, “of the king” and
“the british museum”) (see Section 8.3 on page 147).
This experiment has shown that, unlike the case of topic orientated text classification, in biographical text classification, “bag-of-words” style representations
augmented with syntactic features perform better than “bag-of-words” repre-
164
C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES
sentations alone. The difference was not statistically significant using the corrected, re-sampled -test. It remains as an open question whether this kind of
small increase in accuracy can be gained for genre classification more generally,
or whether it is confined to the special case of biographical text classification.
Also, other approaches to genre classification have focused on document classification, whereas sentence classification has been at the centre of the current
research. This work does indicate however that the claim that syntactic features
are unhelpful for text categorisation (made by M OSCHITTI and B ASILI (2004),
and S COTT and M ATWIN (1999)) may apply only to topical categorisation and
not to tasks (like genre orientated classification) where the stylistic elements of
a text are important.
9.3 Lexical Methods
This section explores whether the choice of frequent lexical items from a biographical corpus (in this case the 2000 most frequent unigrams in the Dictionary
of National Biography) produces better accuracy for the biographical classification task than other lexeme based methods. Three alternative lexeme based
methods are compared to a baseline — used in the previous section — of the
2000 most frequent unigrams in the Dictionary of National Biography.
The first alternative representation is based on the intuition that function words
can capture the non-topical content of text. Function words have been shown
to be suboptimal in the authorship attribution research tradition (see Section 2.2.3 on
page 27) compared to the use of synonym pairs (for example “while”/”whilst”),
and it has been suggested that this is because function words are characteristic of genre rather than individual authorial style within a genre (H OLMES and
F ORSYTH, 1995). Three hundred and nineteen function words were used as
feature representations.4
The second alternative representation requires the use of stemming; the reduction of inflected word forms to their stem (root) form (see Section 8.1 on
page 144). Stemming is a commonly used technique in the computational linguistics and information retrieval research traditions (W ITTEN ET AL ., 1999),
and the Porter algorithm is a widely used stemming algorithm (P ORTER, 1980).
Stemming allows inflected variants of the same stem (root) word to be represented by one feature. For example, instead of the two separate features “married” and “marry”, one feature will represent both unigrams (using the Porter
stemmer, this single feature is “marri”). This reduction of inflected variants to
a canonical form (it is suggested) will provide better classification accuracy for
the biographical categorisation task, as key biographical words (for example,
4 The list of English function words is available from the University of Glasgow, Department of
Computer Science:
http://www.dcs.gla.ac.uk/idom/ir resources/linguistic utils/stop word Accessed on 02-01-07.
165
C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES
Table 9.2: Performance of Alternative Lexical Methods.
Feature Set
2000 DNB Unigrams (baseline)
319 Function Words
2000 DNB Unigrams (stemmed)
1713 DNB Unigrams (no function words)
Naive Bayes (%)
78.78
75.43
79.93
72.37
SVM (%)
78.18
73.59
78.92
76.94
“work/worked”, “son/sons”, “live/lived/living”, and so on) will be represented by a single feature, and not “diffused” throughout the feature matrix.
Recent work in topic orientated text classification has shown that stemming
produces no advantages when compared to non-stemmed representations (for
example, T OMAN ET AL . (2006)). This result may not hold, however for genre
classification task, or the special case of biographical classification.
A third approach involves — in contrast to the first approach — the removal
of non-topical function words. This approach is commonly used in the topic
orientated text classification community where it is referred to as stopwording
(see Section 8.1), based on the intuition that topic neutral function words are unlikely to contribute to classification accuracy. In the case of biographical classification however, where the genre of the text is the target of the feature representation, classification accuracy (it is hypothesised) is likely to reduce with the
removal of stopwords, compared to a baseline which includes those functional
stopwords.
Four feature sets were used in this experiment. They are summarised below,
and described more extensively in Section 8.1:
The 2000 most frequent unigrams from the Dictionary of National Biography. This provided the baseline.
319 function words.
The 2000 most frequent unigrams from the Dictionary of National Biography in stemmed form.
The 1713 most frequent unigrams from the Dictionary of National Biography with function words removed (that is, stopworded).
9.3.1 Results
It can be seen from the data presented in Table 9.2 and Figure 9.5 on the following page that the stemmed DNB unigrams provided the best performance;
79.93% compared to the baseline DNB unigram representation of 78.92%. This
accuracy improvement however is not statistically significant with respect to
166
C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES
Figure 9.5: Comparison of the Performance of Differing Lexical Representations.
the corrected re-sampled -test, hence the null hypothesis (presented in Figure 9.6 on the next page) cannot be rejected.
The use of function words alone does not improve classification accuracy compared to the baseline for the biographical classification task. Rather, accuracy
actually decreases from 78.78% in the case of Dictionary of National Biography
unigrams, to 75.43 for function words. (see Table 9.2 on the preceding page).
The null hypothesis presented in Figure 9.7 on the next page is rejected, but
only because the use of function words alone decreases accuracy at a statistically significant level compared to the unigram baseline.
The absence of function words in a feature representation identical in other respects to the DNB unigram representation (that is, the DNB unigram feature set
with the function words removed) was shown to reduce categorisation accuracy
(compared to the original 2000 feature DNB representation). This difference
) using the corrected
was shown to be statistically highly significant ( re-sampled -test. Therefore, it is acceptable to reject the null hypothesis presented in Figure 9.8 on page 169.
167
C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES
Figure 9.6: Experimental and Null Hypotheses — Stemming.
Experimental Hypothesis: There is an accuracy difference between the
performance of stemmed and plain unigrams (derived from a biographical corpus) for the biographical sentence classification task.
Null Hypothesis: There is no accuracy difference between the performance of stemmed and plain unigrams (derived from a biographical corpus) for the biographical sentence classification task.
Figure 9.7: Experimental and Null Hypotheses – Function Words.
Experimental Hypothesis: There is an accuracy difference between
function word based features and frequent unigrams (derived
from a biographical corpus) for the biographical sentence categorisation task.
Null Hypothesis: There is no difference between the performance of
function word based feature representation, and frequent unigrams based representations (derived from a biographical corpus)
for the biographical sentence categorisation task.
9.3.2 Discussion
These results show that, for the biographical classification task the use of content neutral function words produces less accurate results than the use of unigrams derived from a biographical corpus (75.43% and 78.73%, respectively).
This could be for a number of reasons. Perhaps the presence of a few archetypal biographical words (for example, “born”, “died”, “married”, and so on)
is more strongly associated with biographical text than the use of a particular
biographical style that can be identified using function words. In other words,
while function words may do some of the work in biographical classification
(particularly prepositions for identifying place and time; see Section 2.1.2 on
page 11), archetypal biographical words are – it is suggested – helpful for identifying difficult cases. It is notable that the differences between the two accuracy scores is only 3.3%, a small difference when we consider that the function
word feature set consists of only 319 features, and the frequent unigram feature
set consists of 2000 features.
Further evidence for the view that function words are important for the bi-
168
C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES
Figure 9.8: Experimental and Null Hypotheses — Stopwords.
Experimental Hypothesis: There is a difference between the performance of a feature set based on the 2000 most frequent unigrams
in the Dictionary of National Biography with all function words removed, and the original, unmodified 2000 unigram representation, for the biographical categorisation task.
Null Hypothesis: There is no difference between the performance of a
feature set based on the 2000 most frequent unigrams in the Dictionary of National Biography with all function words removed and
the 2000 most frequent unigrams in the Dictionary of National Biography, for the biographical categorisation task.
ographical categorisation task is provided by the performance of the “stopworded” feature set (that is, the feature set that contains the most frequent
2000 unigrams in the Dictionary of National Biography minus function words).
The “stopworded” feature set had the worst performance compared to the baseline (72.37% and 78.78, respectively — see Figure 9.5 on page 167). The “stopworded” feature set performed worse than the function word feature set (72.37%
and 75.43%, respectively) despite the fact that the “stopworded” feature set
contained 1713 features, and the function word feature set only 319 features.
Cumulatively, this work would tend to support the suggestion made by H OLMES
and F ORSYTH (1995) that function words are important for genre classification.
It is possible that function words are important for capturing the stylistic content of text, this result supports hypothesis 2, that syntactic features — broadly
understood — are important for genre classification. This claim would however require further investigation as the scope of this research is confined to
biographical sentence classification.
The best result was gained through stemming (79.93%). This accuracy was
not however significantly better than the baseline (78.78% — gained using two
thousand frequent DNB unigrams). One possible reason for the slight increase
in accuracy achieved by the stemming algorithm is that the baseline feature set
consists of many inflected forms of the same base word (for example, “act”,
”acted”, ”acting”) which are reduced in the stemmed feature set, making for a
more compact and efficient representation where concepts are less “diffused”
through the feature matrix.
169
C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES
9.4 Keywords
T RIBBLE (1998) identified a methodology for selecting genre specific key-keywords.
An overview of the basic feature selection technique is provided in Section 2.1.2 on
page 21 and a description of the feature sets used is described in Section 8.4 on
page 152
Two related methods for identifying key-keywords are used in this work. First,
the naive key-keywords method. Second, the WordSmith key-keywords method
(note that this is the method used by T RIBBLE (1998) and X IAO and M C E NERY
(2005)). The important difference between the naive and WordSmith key-keyword
methods is that the naive method ranks keywords according to the number of
biographical documents in which the keyword occurs, and the WordSmith
method ranks keywords according to the number of biographical documents
in which the word is key.5 For the naive key-keyword method, if the keyword “born” occurs in forty-five biographical documents, it will be ranked
above “marry”, which occurs in twenty-five biographical documents. For the
WordSmith key-keyword method, if “lived” is a keyword in fifteen biographical documents, it will be ranked above “educated”, if “educated” is a keyword
in only four biographical documents.6 For each key-keyword identification
method, the five hundred top key-keywords (ranked by key-keyness) are retained.
Feature selection is a commonly used technique in machine learning (W ITTEN
and F RANK, 2005), and it has been shown that aggressive feature selection increases classification accuracy for some kinds of text classification tasks (YANG
and P EDERSEN, 1997). It is hypothesised that key-keywords based methods
will provide more genre representative features than the use of either frequent
unigrams, or derived keywords, alone. Note that feature selection was not
performed on the gold standard data. Rather, features were identified using
a corpus constructed from Wikipedia and Chambers data, in order that unigram features characteristic of the biographical genre in general could be identified.
The key-keyword methodology was utilised by T RIBBLE (1998) (and subsequently validated and explored by X IAO and M C E NERY (2005)) as a method of
genre analysis that avoids the statistical and computational overheads of multidimensional analysis (see Section 2.1.2 on page 11).7 However, the method can
easily apply to feature identification for text classification, as the aim of using the method is the same; identifying those features most representative of
5 The naive key-keyword method requires less significantly less processing than the WordSmith
method, as for the WordSmith method a distinct keywords list must be generated for each biographical document.
6 These are examples are for explanatory purposes only and do not describe actual frequencies.
7 Note
that that T RIBBLE (1998) applied the WordSmith key-keywords function to the genre problem.
The software has existed since 1996 (see
http://http://www.lexically.net/publications/publications.htm
Present
on 01-05-07.)
170
C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES
a given genre. In the case of T RIBBLE (1998), this analysis is performed to cast
light on genre differences, and in the current context, the analysis is performed
to identify those unigram features most characteristic of the biographical genre
to facilitate machine learning.
Note that all the feature sets used in this experiment were derived from biographical documents from Wikipedia and The Chambers Biographical Dictionary
— the biographical corpus — described in Section 8.4 on page 152. The B ROWN
corpus was used as a reference corpus. The feature sets used are fully described
in Chapter 8 on page 144, but summarised briefly below:
The 500 most frequent unigrams from the biographical corpus.
The 500 most discriminatory keywords identified by the algorithm.
The 500 most discriminatory key-keywords identified using the naive keykeywords method.
The 500 most discriminatory key-keywords identified using the WordSmith
key-keywords method.
The 500 most frequent unigrams are included as a baseline against which to
test the performance of the keywords and key-keywords representations. The
Dictionary of National Biography was not used as a biographical corpus because
of the need for biographical documents of similar lengths. Dictionary of National
Biography entries vary considerably in length.
9.4.1 Results
Table 9.3 and Figure 9.9 on the following page show the 500 most frequent unigrams feature set performed at 81.25%. The keyword feature set achieved
76.86%. The WordSmith key-keywords and naive key-keywords achieved
68.34% and 78.92%, respectively. The difference between the 500 frequent unigram and the 500 naive key-keywords feature sets was not statistically significant. The difference between the 500 frequent unigrams and the WordSmith
method was significant however, with the WordSmith key-keywords feature
set performance significantly worse than the frequent unigram feature set. For
the WordSmith key-keywords, the null hypothesis presented in Figure 9.10 on
the next page can be rejected. This was a surprising result, as it was expected
that both feature selection (that is, the keyword feature set) and both the
key-keyword feature sets would achieve better results than the simple frequent
unigram based representation. Indeed, the frequent unigram representation
outperforms both the (keyword) feature set and the the WordSmith key keywords feature set at a statistically significant level (
using the corrected re-sampled -test).8
8 Note that the two-tailed test was used despite the expectation that the keyword and keykeyword features would perform better than the baseline.
171
C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES
Figure 9.9: Comparison of the Performance of Keywords, Key-Keywords, and
Frequencies.
Table 9.3: Performance of Keyword and Key-keyword features Relative to a
Baseline.
Feature Set
500 Frequent Unigrams
500 Keywords
500 Naive Key-Keywords
500 WordSmith Key-Keywords
Naive Bayes (%)
81.25
76.86
78.92
68.34
SVM (%)
76.07
76.90
78.32
63.11
Figure 9.10: Experimental and Null Hypotheses – Key-Keywords.
Experimental Hypothesis: There is a difference between the performance of key-keyword based features and frequent unigrams (derived from a biographical corpus) for the biographical categorisation task.
Null Hypothesis: There is no difference between the performance of
key-keyword based features and frequent unigrams (derived
from a biographical corpus) for the biographical categorisation
task.
172
C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES
Figure 9.11: Comparison of Partial Decision Trees for Each Feature Set.
school
FREQUENT WORDS
no
bio
yes
york
bio
yes
no
bio
college
yes
no
son
bio
.
.
.
.
.
.
university
NAIVE
KEYKEYWORDS
bio
bio
phd
yes
no
bio
television
no
yes
bio
in
.
.
.
bio
yes
no
yes
married
no
edinburgh
bio
wife
yes
no
yes
no
no
career
WORDSMITH
KEYKEYWORDS
yes
no
best
son
bio
married
bio
no
yes
no
yes
wife
bio
won
yes
university
yes
no
bio
best
no
university
KEYWORDS
yes
no
bio
yes
bio
.
.
.
9.4.2 Discussion
These results show that, for the biographical categorisation task at least, the
use of key-keywords reduces classification accuracy. It is important to note that
feature selection was performed using external data (Wikipedia and Chambers
as a biographical corpus, and the B ROWN corpus as a reference corpus), in
order to avoid artificially inflating classification accuracy.
173
C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES
In order to gain insight into the differing performance of the four feature sets,
and the surprising success of the frequency feature set, the C4.5 decision tree
algorithm (see Section 2.5.1 on page 39 for more on decision trees generally)
was used to explore decision points in the four feature sets (although it is important to emphasise that the Naive Bayes algorithm does not depend on these
decision points). Figure 9.11 on the page before shows that — for the top levels and with the exception of the WordSmith key-keywords representation —
the trees are similar, with the major difference between the top performing
frequency feature set, and the keywords and naive key-keyword feature sets,
being that “school” is used as the root node of the frequency tree, and does not
occur in the top part of the other trees. The WordSmith key-keyword tree is
very different from the other three trees as there is little overlap between the
features selected by the WordSmith method, and those selected by the alternatives.
A partial explanation for the surprising results is that the “school” feature is a
key discriminator in the gold standard biographical data, and while frequent
enough in the Wikipedia and Chambers to warrant inclusion in the most frequent 500 unigrams, did not occur sufficiently frequently compared to the reference corpus to occur in the keywords list or in either of the the key-keyword
lists. The differences between American and British English may be crucial
here. The term school is used often in American English to describe what in
British English, would be described as “University”. Additionally, the word
“school” occurs frequently in compounds like “high school” and “elementary
school”. The B ROWN corpus — the reference corpus used in this work — is
a general corpus of American English and hence contains a higher proportion
of this extended sense of “school” (for example, “high school”, “elementary
school”) than a corpus of British English.
It is notable that there is a substantial difference between the frequency based
and naive key-keywords feature sets. Three hundred and fourteen words appear in the frequency list that do not appear in the naive key-keywords list (see
Section 8.4 for a list of features). While some biographically important function
words appear in the naive key-keywords list — for example, the preposition
“in”, the connective “and” and the pronoun “he” — many are absent. For example, “the”, “of” and “to” appear in the frequency list but not in the naive
key-keywords list. Similarly, words that we would intuitively regard as biographical appear in the frequency list — words like “lived” and “children” —
but do not appear in the naive key-keywords list.
The difference between the frequency based feature set and the WordSmith
key-keyword feature set is even more marked than the difference between the
frequency based feature set and the naive key-keywords feature set. 437 words
occur in the frequency based feature set that do not occur in the WordSmith
key-keywords feature set. Biographically relevant function words (like “the”
and “of”) are missing from the WordSmith feature set, as are more obviously
biographical words like “born” and “children”.
174
C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES
There are a number of possible reasons why both the key-keyword feature sets
failed to provide a better (in terms of classification accuracy) representation
than the the simple frequency list:
The differences between British and American English (discussed above).
The gold standard corpus is drawn from sources of British English, as
was the set of documents on which the frequency list was derived.9 Yet
the reference corpus used by the feature selection algorithm consisted
of American English. This may have effected the keywords selected by
the algorithm from which the key-keywords were selected.
The biographical corpus, consisting of the Wikipedia and Chambers data,
while large enough to provide a “biographical” frequency list, was not
large or varied enough to counter the inclusion of ostensibly non-biographical
unigrams (for example, “neoclassical”, or “lanarkshire”) that occurred towards the top of both key-keyword lists.
It is possible that the number of features used was too low for the benefits
of the key-keyword approaches to be clear. Perhaps if more features were
used in each case, key-keywords may outperform the simple frequency
list approach.
The frequency and keyword lists were derived from biographical documents rather than biographical sentences, whereas the classification task
involved the classification of biographical sentences (more specifically, biographical sentences identified using the annotation scheme outlined in
Chapter 5). The non-biographical sentences in the biographical documents counted equally with the biographical sentences in the frequency
calculations. It is possible that the key-keywords method discarded many
features that are characteristic of biographical sentences. This possibility
is weakened however if we consider the high level of biographical sentences in Wikipedia and Chambers (85%+).
It is possible that the key-keyword methods are capturing corpus specific
features, rather than genre specific features, and that frequency lists derived from corpora of a given genre of interest provide a better insight
into that genre. In other words, a frequency list derived from a corpus
of a given genre may reflect that genre’s characteristics, better than keykeywords, which are too specific to the topic orientated idiosyncracies of
the corpus.
it is possible that for the WordSmith key-keyword feature sets, the biographical texts may have been too short to generate expected frequencies
greater than or equal to five for each feature (necessary to ensure the reliability of the feature selection). This problem is addressed in the
“further work” section in the concluding chapter.
9 Wikipedia contains a variety of national types of English. See:
http://en.wikipedia.org/wiki/Wikipedia%3AManual of Style Accessed on 02-01-07.
175
C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES
Another interpretation of the results is that the nature of the current task
— sentence classification rather than document classification — may not
be suitable for a key-keyword approach as genre analysis techniques are
only appropriate at the document level. Indeed, the Systemic Functional
Linguistics tradition (see Section 2.1.1 on page 8) holds that single sentences cannot be described as having a genre. Rather (according to System Functional Linguistics theory), genre is a phenomenon of the discourse level.
On the basis of the work presented in this section, key-keyword methodologies
are not suitable techniques for the identification of unigram features for biographical sentence classification, as simple frequency counts provide better
performance. This result is surprising however, as the opposite result — that
key-keywords would prove to be a better feature set than frequent unigrams —
was expected. The topic needs further work, however, before definitive conclusions can be drawn.
9.5 Conclusion
This chapter has reported on investigations into feature sets for biographical
sentence classification. The investigation has centred around three themes; the
utility of syntactic representations, the utility of non-standard lexical representations, and the utility of “keyword” based methods for the biographical
sentence classification task. The most important findings are:
“Bag-of-Words” style features augmented by syntactic features increase
classification accuracy for the biographical sentence classification task,
compared to the use of “bag-of-words” features alone (although not at a
statistically significant level.)
Stemming increases classification accuracy compared to the use of plain
frequencies (although not at a statistically significant level).
The use of key-keyword based methods provides lower classification accuracy than the use of frequent unigrams alone.
The next chapter examines the exportability of features for the biographical classification task.
176
C HAPTER 10
Portability of Feature Sets
This chapter explores the portability of the biographical features identified by
Z HOU ET AL . (2004), who identified 5062 unigram features from the University of Southern California biographical corpus (USC) (see Section 4.3.6 on
page 94), and assessed these features using the USC corpus as a test/training
set. Classification accuracy for the USC derived features on the USC corpus of
biographical sentences was very high at 82.45%. This chapter explores whether
those 5062 unigrams are portable for use in classifying other biographical sentence corpora, in this case the “gold standard” data constructed as part of this
work (see Chapter 5 on page 101). The issue of whether unigram features ought
to be derived from the same corpus that is used for testing and training data,
is also addressed. The chapter is divided into five sections; motivation, experimental procedure, results, discussion and a brief conclusion.
10.1 Motivation
The identification of a feature set that performs well in a variety of different biographical sentence classification situations is important if we are to have confidence applying that feature set to the biographical sentence classification task
generally. Z HOU ET AL . (2004) tested various feature sets (using the USC corpus as a test/training set) for the binary biographical sentence classification task
using the Naive Bayes classification algorithm.1 The feature sets used included
bigrams and trigrams (all derived from the USC corpus). The best performing
feature set consisted of all those unigrams that occurred within the biographically tagged clauses of the USC corpus. These unigrams were used on the
intuition that they would provide exemplary biographical unigrams and that
limiting unigrams to those which occur in biographical clauses would reduce
1 Note that the USC ten class biographical annotation scheme can be reduced to a binary scheme
simply by regarding each sentence which contains a tagged biographical clause as biographical, and
a sentence that contains no such clause, as non-biographical.
177
C HAPTER 10: P ORTABILITY OF F EATURE S ETS
the number of features in the feature set that do not contribute to classification accuracy. One danger in such an approach however, is the possibility that
the unigrams harvested from the USC biographical clauses are too specific to
that corpus, and will not “port” well for classifying other biographical data.
The USC corpus, while it contains biographical text, and was designed as a
biographical corpus, is limited to short web biographies of only a few major
historical persons (for example, Marilyn Monroe, Martin Luther King, and so
on). It is possible that web biographies are a sub-genre of short biographies,
that do not represent the entire range of short biographies adequately. Additionally, the use of only several biographical subjects could result in derived
features that are too specific to those individuals. For example, Monroe is included in the 5062 unigram features, yet it is not obvious how the inclusion
of a “Monroe” unigram feature would aid a general purpose biographical sentence classifier.
In this chapter, in order to test the portability of the 5062 USC derived features
identified by Z HOU ET AL . (2004), we use these 5062 unigram features in conjunction with the Naive Bayes classification algorithm to classify the gold standard biographical sentences developed as part of the current research project
and described in Chapter 5 on page 101. In order to provide a point of contrast against which we can judge the performance of the USC derived features,
we use a frequency list of 5062 unigrams derived from biographical dictionaries.
10.2 Experimental Procedure
Our first step required identifying all those unigrams found within tagged biographical clauses in the USC corpus, which was achieved using UNIX text
processing utilities. See Figure 10.1 on the following page for an illustration
of the biographical unigram extraction process. The second step involved the
identification of a baseline against which the USC features could be assessed.
The baseline feature set consisted of the 5062 most frequent unigrams from a
set of texts constructed from two biographical dictionaries. The text collection
consisted of 320,000 word tokens and was collected from the following two
sources:
100,000 word tokens from the Chambers Biographical Dictionary (see Section 4.3.2 on page 89).
220,000 words from the Dictionary of National Biography. Only entries of
less than six hundred words were used, on the intuition that these entries
would contain less historical and political background information, and
more explicitly biographical material. See page 87 for a description of the
Dictionary of National Biography.
178
C HAPTER 10: P ORTABILITY OF F EATURE S ETS
Figure 10.1: Biographical Unigram Extraction from the USC Corpus
E XTRACT FROM USC C ORPUS :
Martin Luther King, Jr., bio (January 15,1929-April 4, 1968) /bio bio was born /bio Michael Luther King, Jr., but later had his name
changed to Martin. His grandfather began the family’s long tenure as pastors
of the Ebenezer Baptist Church in Atlanta, serving from 1914 to 1931; his
father has served from then until the present, and from 1960 until his death
Martin Luther acted as co-pastor. edu Martin Luther attended segregated
public schools in Georgia, graduating from high school at the age of fifteen /edu ; edu he received the B. A. degree in 1948 from Morehouse
College /edu , a distinguished Negro institution of Atlanta from which
both his father and grandfather had been graduated.
E XTRACTED U NIGRAM F EATURES :
january
15
1929
1968
was
born
attended segregated public
georgia
graduating from
at
the
age
he
received
his
1948
from
morehouse
April
martin
schools
high
of
BA
college
4
luther
in
school
fifteen
degree
A much larger text collection could be used as the basis for the frequency
counts (the Dictionary of National Biography consists of almost thirty four million word tokens), but only a small subset of the Dictionary of National Biography
was used in order that a high proportion of non-Dictionary of National Biography biographical text — text from the Chambers Biographical Dictionary — could
be included in the corpus. This decision was made to prevent the resulting frequency list reflecting the idiosyncracies of the Dictionary of National Biography,
rather than representative of short biographies more generally. A frequency list
from the two biographical corpora was obtained and the most frequent 5062
unigrams retained; to create an equal number of features to those derived from
the USC corpus. Each feature set was then used in a 10 x 10 stratified cross
validation using the Naive Bayes learning algorithm2 on the gold standard biographical data developed as part of this research project (see Chapter 5 on
page 101).
The WEKA implementation of the Naive Bayes learning algorithm was used.
Note that in this chapter we are interested in the portability of feature sets
for the biographical sentence classification task, rather than the portability of
179
C HAPTER 10: P ORTABILITY OF F EATURE S ETS
Figure 10.2: Comparison of the Performance of Unigrams Derived from USC
Annotated Clauses and Biographical Dictionary Unigram Frequency Counts.
trained classifiers (models).
10.3 Results
The mean results of the 10 x 10 fold stratified cross validation are presented
in Table 10.1 on the next page. Note that the means for the two feature sets
are almost identical for both the Naive Bayes and SVM algorithm. The accuracy score for each run of the 10 fold stratified cross validation is shown in
Figure 10.2, where it can be seen that the mean figures reported do not mask
highly deviated results. Table 10.1 shows that the classification accuracies for
each feature set are almost identical (only 0.03% separates them). The difference between the two feature sets was subject to statistical testing, using the
corrected re-sampled t-test, and it was found that there was no statistically significant difference between the two feature sets (see Figure 10.3 on the next
page for the experimental and null hypotheses).
2 The
Support Vector Machine (SVM) algorithm was also used.
180
C HAPTER 10: P ORTABILITY OF F EATURE S ETS
Table 10.1: Classification Accuracies of the USC and DNB/Chambers Derived
Features on Gold Standard Data.
Feature Set
USC Features
DNB/Chambers
Mean Accuracy Naive Bayes (%)
76.61
76.58
Mean Accuracy SVM (%)
79.33
79.32
Figure 10.3: Experimental and Null Hypotheses: USC and Biographical Dictionary Derived Features
Experimental Hypothesis: There is a difference in classification accuracy between a feature set based on 5062 unigrams derived from
biographical clauses in the USC corpus and the most frequent
5062 unigram features in a sample of the DNB and Chambers biographical dictionaries.
Null Hypothesis: There is no difference in classification accuracy between the 5062 unigram feature set derived from biographical
clauses in the USC corpus and the most frequent 5062 unigrams
features in a sample of the DNB and Chambers biographical dictionaries.
10.4 Discussion
The results gained in this chapter show that the feature identification strategy
adopted by Z HOU ET AL . (2004) — using only those unigrams that appear in
biographical clauses in the USC corpus as features — provides a similar level
of portability to the use of frequent unigrams derived from the Dictionary of
National Biography. In other words, the features identified by Z HOU ET AL .
(2004) when ported for use on other biographical data perform at an almost
identical accuracy level as “plain” frequent unigrams which can be identified
automatically from biographical corpora and require no intensive annotation
effort. There are at least four possible reasons for the fact that these “hand
identified” unigram features do not provide superior results:
1. The USC corpus consists of numerous biographies of the same small set
of individuals (for example, Marilyn Monroe, Einstein and so on). This
means that many person specific unigrams (for example; names, birthplaces and so on) would appear repeatedly in the biographical clauses,
reducing the variability of the resulting frequency list. That is, if we have
one hundred different individuals, we could conceivably have one hundred different birth places. In contrast, if we have one hundred biogra-
181
C HAPTER 10: P ORTABILITY OF F EATURE S ETS
phies of one individual, we are likely to have only one birthplace, thus
reducing the number and variety of features.
2. All biographies used in the USC corpus are harvested from the web. It
is possible that the particular constraints imposed by web publishing fail
to reflect the qualities of short biographies more generally.
3. The USC corpus is too small (at approximately 170,000 word tokens) and
the number of annotated biographical clauses too few, to provide a list of
representative biographical unigram features. It is possible that in order
to gain better features, and thus improve classification accuracy, the corpus it would be necessary to increase the size of the corpus, which in turn
would require more annotation effort. It is also notable that the text sample taken from DNB/Chambers was also relatively small (although approximately twice the size of the USC corpus). It is possible that increasing the size of both the USC corpus and the sample from DNB/Chambers
may effect classification accuracy.
4. It is possible that the inconsistent biographical tagging employed in the
USC corpus reduced the quality of the derived feature set for the purposes of biographical sentence classification. That is, in some biographies, only biographical words are tagged rather than clauses, resulting
in a unigram feature set that perhaps excludes biographically important
unigrams. See Figure 4.1 on page 95 for some examples of this inconsistent annotation taken from the Curie section of the USC Corpus).
The results obtained were surprising, as it was initially thought that there was
likely to be some increase in classification accuracy using Z HOU ET AL . (2004)’s
labour intensive feature identification process, compared to a simple unigram
frequency list derived from biographical dictionaries. It was also thought that
the very high accuracy score — 82.42% — achieved by Z HOU ET AL . (2004) on
the USC corpus may be reduced when the feature set was tested on the gold
standard data. That is, it was expected that when the 5062 features derived
from USC biographical clauses were applied to another corpus of biographical data, the classification performance would drop, but not to the extent that
classification performance provided was almost identical to the “base-line” biographical dictionary frequencies feature set.
On the face of it, the almost equal performance of both the feature sets – the
USC derived features and the biographical dictionary frequencies — could be
seen to indicate that there is some upper ceiling on the performance of unigram
features for biographical sentence classification. This theory is belied however
when we consider that previous chapters have shown that we can achieve classification accuracy above 76.2% on the gold standard data with unigram based
methods. For example, a feature set consisting of only five hundred frequent
unigrams from a corpus of Wikipedia/Chambers biographies achieved an accuracy of 81.25% on the gold standard data using the Naive Bayes learning
algorithm (see Section 9.3 on page 172).
182
C HAPTER 10: P ORTABILITY OF F EATURE S ETS
10.5 Conclusion
This chapter has compared the best performing feature set identified by Z HOU
ET AL . (2004) to an equally sized feature set consisting of frequent unigrams
derived from a sample of the Dictionary of National Biography and Chambers
Dictionary of Biography. When the two feature sets were compared using 10
x 10 fold cross validation, using the gold standard corpus developed in Chapter 5 on page 101 and the Naive Bayes algorithm, the performance of the feature sets was almost identical (with only 0.02% difference). This suggests that
the strategy adopted by Z HOU ET AL . (2004) for the identification of appropriate biographical features, while it delivers high classification accuracy for
the USC biographical corpus, does not endow any additional benefits above
and beyond the use of straightforward unigram frequencies derived from biographical dictionaries, when applied to alternative biographical data.
183
C HAPTER 11
Conclusion
This thesis presented and explored the general hypothesis that biographical
sentences can be reliably identified at the sentence level using automatic methods. This concluding chapter summarises the thesis in terms of contributions
made, before outlining areas for possible future work.
11.1 Contributions
The general claim that biographical writing can be identified at the sentence level
using automatic methods is broken down into two sub-hypotheses:
Hypothesis 1 Humans can reliably identify biographical sentences without
the contextual support provided by a discourse or document structure.
Hypothesis 2 “Bag-of-words” style sentence representations augmented with
syntactic features provide a more effective sentence representation for
biographical sentence recognition than “bag-of-words” style representations alone.
Hypothesis 1 is addressed in Chapters 5 and 6, while the machine learning
chapters — Chapters 7, 8, 9 and 10 — are concerned with Hypothesis 2 and the
general hypothesis that humans can reliably identify biographical sentences
without the contextual support provided by a discourse or document structure.
The contributions made by the thesis can usefully be divided into two main
groups, reflecting the two sub-hypotheses.
The main hypothesis (and two sub-hypotheses) provides a framework for the
thesis, but other research questions are addressed within that framework (for
example, the utility of the key-keywords methodology (see page 21).
184
C HAPTER 11: C ONCLUSION
11.1.1 Hypothesis 1, Annotation Scheme and Human Study
An annotation scheme for biographical sentences was developed (Chapter 5).
The scheme was heavily influenced by existing schemes (like the Text Encoding Initiative biographical scheme, and the biographical guidelines used to
construct the Dictionary of National Biography). The new scheme was specifically designed to identify the kind of biographical sentences that occur in short
biographical summaries (like Wikipedia biographical entries). It is demonstrated with numerous examples that the scheme delivers excellent coverage
for the texts of interest. The annotation scheme is also validated by the human study reported in Chapter 6, where it is shown that there is a good level
of agreement between annotators asked to classify sentences according to the
scheme.
A biographical corpus was developed as part of this work (Chapter 5), based
on the new annotation scheme. The corpus, although not large, is constructed
from various sources, including news text from the Guardian newspaper and
extracts from the STOP corpus. As with the annotation scheme, the biographical
corpus is annotated at the sentence level.
A human study (Chapter 6) was conducted which involved an online experiment with twenty-five participants. The study demonstrated that human classifiers can agree whether a sentence is biographical or non-biographical (given
the annotation guidelines developed in Chapter 5) with good reliability. That is,
agreement between participants in the study on the status of sentences as biographical or non-biographical was good, using an appropriate variant of the
K APPA agreement statistic.
The cumulative force of Chapters 5 and 6 is to provide strong evidence in
support of Hypothesis 1 (Humans can reliably identify biographical sentences
without the contextual support provided by a discourse or document structure). Chapter 5 describes a set of clear guidelines for identifying biographical
sentences (that is, the annotation scheme) and Chapter 6, validates that decision procedure, showing that people are able to identify biographical sentences
with good reliability.
11.1.2 Hypothesis 2, Automatic Biographical Sentence Classification
Chapter 7 explores the performance of six learning algorithms using a 10 x 10
cross validation methodology (see Section 2.5.2) employing “gold standard”
data derived from the biographically annotated corpus described in Chapter
5 (and validated in the human study reported in Chapter 6). The six different algorithms were Naive Bayes, a Support Vector Machine classifier, the C4.5
decision tree algorithm, the Ripper rule learning algorithm, the “One Rule”
algorithm, and a baseline that classified all test data as belonging to the most
185
C HAPTER 11: C ONCLUSION
frequent class in the training data (the “Zero Rule” algorithm). On the basis
of the experimental work in this chapter, Naive Bayes was the best performing algorithm, but not at a statistically significant level compared to the second
most accurate algorithm, the Support Vector Machine classifier. It should be
noted however, that the the Naive Bayes algorithm performed significantly
better than the other learning algorithms tested, apart from the Support Vector Machine classifier. Additionally, the two most successful algorithm (SVM
and Naive Bayes) are used in all the machine learning experiments, and while
Naive Bayes performs better in most instances, there are some feature sets
where the Support Vector Machine classifier performs better (for example, trigrams derived from the Dictionary of National Biography — see Table 9.1 on
page 163).
Chapter 9 explores a core theme of the thesis, that topic neutral syntactic features are useful for biographical sentence classification. The thesis has characterised biographical sentence classification as a genre classification problem,
where topic neutral features (in this case syntactic features) are useful. The
work reported in Chapter 9 shows that syntactic features (identified empirically from the data produced by B IBER (1988)) increase classification accuracy,
albeit not at a statistically significant level.
Chapter 9 provides some limited support for the contention that -grams (where
) increase classification accuracy for genre classification tasks, as -grams
provide a low effort strategy for encoding syntactic data (hence the description “pseudo-syntactic features”). This support is weak however. The difference between the baseline unigram representation and the same representation
augmented by bigrams was less that 1%; far too small to be judged significant using the corrected re-sampled t-test. Additionally, this chapter suggests
that S COTT and M ATWIN (1999)’s contention, that it is “probably not worth
pursuing simple phrase based representations further”, while it may apply to
topical text categorisation does not apply to biographical sentence classification.
Chapter 9 provides strong support for the view that non-topical features are
important for the biographical classification task. A baseline feature set of
2000 frequent unigrams was tested against the same feature set with all function words removed. The difference between the performance of the two feature
sets was highly significant. This result is in line with the view of biographical sentence classification as a genre classification task, where topic neutral
stylistic features (like function words) are very important. If function words
were irrelevant and only topic related words important, then there would be
no substantial difference between the classification performance of the two feature sets. Chapter 9 shows that for the biographical sentence classification task,
stemming (that is, reducing morphological complex words to a canonical form)
slightly improves classification accuracy, compared to the use of “plain” unigrams. The difference was not statistically significant, however.
186
C HAPTER 11: C ONCLUSION
Chapter 9 suggests that the key-keyword based methods (described in Section 8.4 on
page 152) do not provide the optimal feature selection method for the biographical classification task, as unigram frequencies derived from a biographical corpus performed better. The method was originally developed as a genre
analysis tool, yet the paucity of topic neutral features (that is, function words)
generated by the key-keywords method suggests that its use as a genre analysis tool may be overstated.
Chapter 10 shows that the lexical feature selection method adopted by Z HOU
ET AL . (2004) performs at a near identical level to a feature set consisting of frequent unigrams automatically derived from biographical dictionaries. Z HOU
ET AL . (2004)’s feature set was derived from the USC corpus, which as the
reader will recall, is annotated at the clause level for biographical information.
Z HOU ET AL . (2004) derived unigram features from biographical clauses alone.
That is, the only unigrams used were those that occurred within biographically
tagged clauses. Z HOU ET AL . (2004) achieved very high classification accuracy
with this method using USC test data. Chapter 10 shows however that when
the features identified by Z HOU ET AL . (2004) were used to classify the gold
standard data created for this work, classification accuracy is almost identical to simpler, frequency list based unigram feature sets which can be derived
automatically and do not require an extensive annotation effort.
11.2 Future Work
Taking the work reported in this thesis as a starting point, this section suggests
areas for possible future research. These areas fall into five broad areas:
1. Implementations of a biographical sentence classifier as a module within
a wider system.
2. Improving binary biographical sentence classification.
3. Extensions of the biographical sentence classification techniques described
in this thesis both to other genres, and to whole document classification.
4. Extending the use of empirically identified syntactic features to other
text classification problems (for example, gender and age based classification.)
5. Investigating methods of genre analysis in the light of issues raised by
the current work.
Each of these five areas for future work is now examined in turn.
187
C HAPTER 11: C ONCLUSION
11.2.1 Biographical Sentence Classifier Module
A biographical sentence classifier based on the optimal performing feature
set/classification algorithm combination (that is, unigrams and syntactic features) trained using the data created as part of this project (and perhaps additional data) could be used as part of a biographically orientated Multiple
Document Summarisation system. Of course, a biographical sentence classifier would be only a small component of an effective Multiple Document Summarisation system, as redundancy removal, temporal ordering of output and
so on, would still be required (see Chapter 3 on page 53).
A biographical sentence classifier could serve as a useful tool in the context
of journalistic research, where it is often important to identify biographical
sentences in vast amounts of text. A biographical sentence classifier could
be incorporated into a system that highlights sentences of biographical interest in electronic texts, allowing a journalist or researcher to identify sentences
of interest without reading an entire — perhaps lengthy — article or document.
11.2.2 Improving Biographical Sentence Classification
It is possible that a larger corpus of training data may improve results. Of
course, this would require a significant annotation effort. It would be interesting to see what kind of improvements (if any) were gained by trebling or
quadrupling the volume of training data.
It is possible that increasing or decreasing the number of features used may
increase classification performance. Note that a feature set of 500 unigrams
produced the greatest accuracy in this study (see Table 9.3 on page 172). It is
also possible that performance could be improved using a mixed model feature
set consisting of unigrams and syntactic features.
It is possible that the ratio of biographical to non-biographical sentences in the
training data may need to be changed in order to construct an effective model
for situations in which biographical sentences are very sparse. The current
training data consists of 50% biographical sentences and 50% non-biographical
sentences. It would be of interest to assess the performance of different ratios of biographical to non-biographical training data on varying sources of
text.
11.2.3 Extensions to the Biographical Sentence Classification
Task
Instead of biographical sentence classification, similar methods (that is, the use
of unigrams augmented by empirically identified syntactic features) could be
188
C HAPTER 11: C ONCLUSION
used for biographical document classification. As genre is primarily a discourse
level, rather than sentence level phenomenon according to System Functional
Grammar theory, and to a lesser extent Multi-dimensional Analysis (see Chapter 2), it is hypothesised that the use of syntactic features should increase classification accuracy at a higher level that achieved for sentence classification.
Documents could be given a biographical “score” reflecting their biographical
content which could be useful in the context of search engine results.
Instead of the binary classification analysed in the current work, the fine grained
biographical annotation scheme and tagged corpus developed during this work
and described in Chapter 5 could be used to assess the feasibility of classifying
sentences according to biographical type (that is into one of the six biographical categories, key information, fame, character, relationship. education and
work).
11.2.4 Other Text Classification Tasks
Although not directly applicable to biographical or genre classification, the use
of syntactic features as indicative of non-propositional (stylistic) content has
been important in this thesis. There are several directions in which stylistic
classification could be developed; the traditional case is stylometry (authorship attribution) outlined in Chapter 2. Two interesting applications are gender
classification (already explored to a certain extent by A RGAMON ET AL . (2003)),
and author age text classification (that is, the attempt at discerning stylistic features that discriminate between younger and older people). Both these applications would require the construction of an appropriate corpus.
11.2.5 Genre Analysis
Section 2.1.2 on page 21 suggests that the key-keyword method, used by T RIB BLE (1998) as an alternative to Biber style multi-dimensional analysis, does not
capture the distinctive features of a given genre as function words are underrepresented. It would be worthwhile to explore the usefulness of the key-keyword
approach in other genre classification tasks (for example, the identification of
news text) in order to test whether similar results were obtained across different genres. Additionally, different statistical methods (that is, feature selection
methods) could be used in conjunction with a larger biographical corpus consisting of lengthier biographical documents in order to more fully explore the
limits of key-keywords based techniques. It would also be useful to repeat the
key-keyword experiments with a multi-genre reference corpus consisting of
British rather than American English. This work could utilise the WordSmith
software.
189
C HAPTER 11: C ONCLUSION
11.3 Conclusion
In summary then, this thesis has addressed the issue of whether biographical
writing can be reliably identified at the sentence level using automatic methods, using first a human study, which established that people could perform
the task with good agreement, before going on to consider whether the task
could be performed automatically using machine learning algorithms and an
appropriate feature representation. The later chapters of the theses focused on
exploring possible feature representations for the task, and weak evidence was
discovered that “bag-of-words” style unigram features, augmented by syntactic features, perform better than “bag-of-words” style features alone. In other
words syntactic (stylistic) features may well be useful for biographical sentence
classification.
190
A PPENDIX A
Human Study: Pilot Study
A.1 Introduction
This appendix describes in detail the pilot web based human study experiment reported in Section 6.3 on page 130. Recall that the experiment was designed to determine human ability in distinguishing between biographical and
non biographical sentences from a number of different data-sources (including
The Dictionary of National Biography and the TREC corpus). Fifteen participants
were involved in the study. After reading the provided instructions, the participants were invited to categorise 100 sentences as either; core biographical,
extended biographical, or non-biographical. The first section reproduces task
instructions, the second section presents the 100 sentences used in the study,
and the final section sets forth the classification data collected.
A.2 Task Description
The online questionnaire should take less than 20 minutes to complete; that includes time for reading the instructions and answering 100 short, simple questions.
The task is to classify each of the 100 sentences below into one of three categories; 1) core biographical, 2) extended biographical, 3) non biographical.
Each question consists of a sentence, which the participant classifies as belonging to one of the three categories. The 3 categories are described in turn below
A.2.1 Core Biographical Category
If a person is mentioned, is that person the subject of the sentence? Is the
sentence about that person? Is the purpose of the sentence to inform us about
that specific individual? Is the sentence designed to give information about
who the person is? Relevant information here could be details of birth and
191
A PPENDIX A: H UMAN S TUDY: P ILOT S TUDY
death dates, education history, nationality, employment, achievements, marital
status, number of children, etc. If the central purpose of a sentence is to convey
information about an individual, then that sentence can be classified as core
biographical. Note that the person doesn’t have to be mentioned by name,
“he” or “she” is adequate, as long as it’s clear from the context that all the
“he”-s, “she”-s, “him”-s and “her”-s refer to the same person. Here are three
examples of sentences that have been classified as core biographical:
He was jailed for a year in 1959 but, given an unconditional pardon,
became Minister of National Resources (1961), then Prime Minister (1963), President of the Malawi (formerly Nyasaland) Republic
(1966), and Life President (1971).
His intellect, wit and love of France are reflected in his third novel,
Flaubert’s Parrot (1984), in which a retired doctor discovers the
stuffed parrot which was said to have stood upon Gustave Flaubert’s
desk.
Ann West was born at New Scone, Perthshire, Scotland, on 17 May
1825, the daughter of Mary Brough and her husband, John West, a
cotton handloom weaver.
A.2.2 Extended Biographical Category
If a person is mentioned, is that person incidental to the meaning of the sentence? Is the sentence about something else (say, an event or organisation)
and the person just mentioned in passing. The distinction between extended
biographical sentences and core biographical sentences is that in extended sentences, while a person is mentioned (either by name or by “he”, “she”, “him”
or “her”), the sentence isn’t about them. Here are two examples of sentences
that have been classified as extended biographical.
“This new consumer is a pretty empowered person,” said Wendy
Everett, director of a study commissioned by the Robert Wood Johnson Foundation.
At last year’s Conference on Retroviruses and Opportunistic Infections, Dr. David Ho and others from the Aaron Diamond AIDS
Research Center at Rockefeller University presented evidence that
the virus probably first infected humans in the 1940s or early ’50s.
A.2.3 Non Biographical Category.
Non biographical sentences are easy to identify because they don’t contain the
names of people or references to people (“he”, “she”, “him”, “her”). Here are
two examples of sentences that have been classified as non biographical:
Of the 6 million notebooks Taiwan turned out last year, Quanta produced 1.3 million sets, accounting for about 8 percent of the world
output.
192
A PPENDIX A: H UMAN S TUDY: P ILOT S TUDY
A.3 Task Questions
1. He lies buried in an obscure corner of the Little Neck burial-ground at Bullock’s Cove, Swansey, Rhode Island.
2. The Long Parliament tried to have him restored in 1641–2, but without effect,
and from 1644 onwards Sir Edmond Prideaux [q.v.], later attorney-general under the Commonwealth, was somewhat precariously ensconced as postmastergeneral.
3. Catastrophic coverage paying all costs would kick in after $4,000 in annual
out-of-pocket spending by a beneficiary.
4. Born in Widnes, Lancashire, he studied at the universities of Liverpool and
Cambridge
5. He shot dead six people and wounded another seven.
6. Daoud Mohammed, a 28-year-old soldier, was resting, surrounded by dozens
of Kalashnikov rifles, rocket launchers and boxes of ammunition.
7. He was taken prisoner by the Japanese when Singapore fell and died in a
prison camp in Formosa.
8. He was consular chaplain to the British residents at Monte Video from 6 May
1854 to 31 December 1858.
9. He must have survived his father, if at all, only a short time, as his widow
married Robert de Ros in 1191, and the date of his father’s death being uncertain it may be doubted whether he succeeded to Annandale.
10. Born in Karlsruhe, he developed a two-stroke engine from 1877 to 1879,and
founded a factory for its manufacture, leaving in 1883 when his backers refused
to finance a mobile engine.
11. On 3 May 1823 he was admitted commoner of St. Edmund Hall, Oxford.
12. His mother, who came of a Yorkshire family, was a foundation member of the Independent Labour Party and the British Communist Party, a Cooperator, and a member of the Ashton and District Weavers’ Association until
she died.
13. Young walked to force Buford home, and Sosa added two insurance runs
with a double to right. Rick Aguilera got his 17th save, while Daniel Garibay
(2-3) was the winner.
14. His family moved to England during the Franco-Prussian War, and settled
there in 1872.
15. He moved to Paris in 1829.
16. In 1679 he brought an accusation against the Duchess of Richmond, which
on investigation proved to be false, and he was forbidden to attend the court.
193
A PPENDIX A: H UMAN S TUDY: P ILOT S TUDY
17. It would eliminate some 240 miles of levees and canals as well as construct above ground reservoirs, underground aquifers and develop new wetlands.
18. He entered the corps of Royal Military Surveyors and Draughtsmen as
cadet on 20 Aug. 1808, and became a favourite pupil of John Bonnycastle, the
mathematician.
19. Johnson attributed this in large part to President Clinton’s silence on the
matter until recently.
20. After the Revolution he became curator of paintings at the Hermitage Museum, but in 1928 settled in Paris.
21. Bove said he would appeal any sentence and vowed to continue his battle
internationally
22. In 1649 and 1651 he was charged with conveying money, letters, and intelligence to the Royalists overseas, and acquitted on both occasions.
23. Of his sons, Henry St. Clair is noticed separately; another son, J. Murray
Wilkins, was the last rector of Southwell collegiate church before it became a
cathedral.
24. Born in Philadelphia, Pennsylvania, he worked as a journalist and magazine editor before turning to fiction.
25. Clinton had appointed Ward to a judgeship in 1989, and Ward also was a
Democratic state representative when Clinton was Arkansas governor.
26. He took part in local religious and philanthropic work, edited a controversial magazine, the Watchman’s Lantern and in 1849 entered the Liverpool
town council.
27. He argues strenuously against the mass, and inveighs against the medieval
practice of regarding the mass as a vicarious and solitary sacrifice, at each celebration, of the one atoning death, but always holds that Christ is present with
all His benefits in the sacrament, that the elements of bread and wine are not
bare and naked signs of the body and blood of Christ.
28. Here he played the title-part in Cyrano de Bergerac; but his excursions into
romance were not appreciated by the public.
29. The Post, saying it had obtained a copy of the report, said in Sunday
editions that the 200-page document makes a direct correlation between the
vulnerability of things like the Lincoln Memorial and Washington Monument
and funding for the U.S. Park Police, the law enforcement arm of the park service.
30. The profits from this enterprise enabled him to set up his own small ironworks at St Pancras in London
31. He was the adopted son of the astrologer, William Lilly, who constantly
makes reference in his works to Coley’s merit as a man and as a professor of
mathematics and occult science.
194
A PPENDIX A: H UMAN S TUDY: P ILOT S TUDY
32. At times he speaks as an eye-witness, especially in his account of the foreign
expeditions in which he took part. He quotes at some length the speeches of
the king, the petitions or remonstrances of the parliament, and other original
documents.
33. He devised equations which enabled both the thermal energy and that due
to baroclinicity to be calculated for a developing cyclone.
34. Two people have been killed and at least another 80 injured after a terrace
at an island winery in Lake Erie collapsed this afternoon.
35. It also could serve as a motto for the Tour, still trying to recover from a
doping scandal that nearly did in the 1998 competition and sullied the image
of a beloved summer ritual.
36. Had Curry been found guilty of the sexual assault charge, he would have
faced a possible 20-year-term.
37. Dubthach Maccu Lugir, 5th cent termed in later documents mac hui Lugair, was chief poet and brehon of Laogaire, king of Ireland, at the time of St.
Patrick’s mission
38. He was born in Örebro, a place he frequently satirized in later life, taking revenge for the humiliation he had suffered as a stout and painfully shy
youth.
39. Clergy living in concubinage within his diocese were to be deprived of
their benefices; all candidates for ordination were to take a vow of chastity; the
unworthy were to be excluded from ordination; charity and hospitality were
enjoined on rectors; tithes were to be paid regularly; detainers of tithes were
to be severely punished (cf. Ann. Tewkesbury, pp. 148, 149); vicars were to be
priests and hold only one cure; non-residence was condemned; deacons were
forbidden to hear confessions, impose penances, or baptise, save in emergencies; confirmation was to follow one year after baptism.
40. President Clinton was sued Friday by an Arkansas Supreme Court committee seeking to strip him of his law license.
41. This is probably the better likeness, bearing witness to his son-in-law’s
description of him he was of a fair, fresh, ruddy complexion, temperate in his
diet, fasting often.
42. The information about the Horman case was contained in a release of 505
previously classified documents, most from State Department files.
43. In early studies of the North Sea plaice population he noted its remarkable constancy, despite the high natural mortality rates of the early stages of
fish.
44. The draft program calls for a minimum of 5 percent growth annually,
which would lead to a 150 percent increase in the gross domestic product by
2010.
45. Fidel Castro’s government launched a new series of demonstrations Saturday in the wake of Elian Gonzalez’s return, calling out more than 300,000
195
A PPENDIX A: H UMAN S TUDY: P ILOT S TUDY
people from across eastern Cuba to protest U.S. policies that it says harm this
island’s citizens.
46. 1. A Synoptical Table of British Organic Remains, 1830, 8vo and 4to, in
which, for the first time, all the known British fossils were enumerated.
47. Children are washed infrequently in basins.
48. As a blindfold player he was not surpassed even by Blackburne, and as an
analyst he probably had no equal.
49. Pelling was a stout defender of the Anglican church against both Roman
catholics and dissenters.
50. Superseded in papal favour by the sculptor Alessandro Algardi, Bernini
concentrated on private commissions, the most famous of which is the Cornaro
Chapel in the Church of Santa Maria della Vittoria
51. Besides his wife and niece, survivors include a brother-in-law, two sistersin-law, and 17 other nieces and nephews
52. An interest in gunshot wounds led him to treat the wounded from the
Battle of Corunna (1809), and after Waterloo he organized a hospital in Brussels.
53. Airlines last year staved off legislative action by promising to treat customers better and to be more forthright with passengers all the way through
their travel experience.
54. Whitehead’s last imprisonment was at the Poultry Compter, London, whither
the lord mayor, Sir Robert Jefferies, sent him on 11 Feb. 1685, for preaching at
Devonshire House.was given to the world in an anonymous tract, Thoughts
on General Gravitation, and Views thence arising as to the State of the Universe.
55. While Clinton has been campaigning across New York for a year, Lazio
didn’t formally join the race until May 20, the day after Republican Mayor
Rudolph Giuliani quit the contest because of prostate cancer.
56. Grants with these objects in view were made by the commission.
57. He was buried at Brompton cemetery on 26 June, when most of the prominent British chess players were represented at his graveside.
58. Wills was an unusually brilliant conversationist, and some of his more
ambitious poems show much of the dramatic power which descended to his
son, William Gorman Wills.
59. Those with incomes between 135 percent and 150 percent of poverty –
about $12,600 for an individual and $16,900 for a couple – would have their
monthly premiums subsidized on a sliding scale.
60. That was among the recommendations included in the four-member panel’s
report aimed at improving the agency’s personal search procedures.
61. Government budget cuts were partly to blame for the high numbers, the
report said.
196
A PPENDIX A: H UMAN S TUDY: P ILOT S TUDY
62. Beneath the debate over policy differences, though, lie competing political
calculations.
63. He was educated at a school in Ayr and at the university of Edinburgh.
64. She is now receiving more attention from feminist critics, in the light of her
continual artistic struggle with the question of female experience.
65. He founded the Congress of Roman Frontier Studies in 1949, and was Professor of Roman-British History and Archaeology at Durham (1956-71), and
became founder-chairman of the Vindolanda Trust in 1970.
66. The state Legislature in 1995 allowed for the creation of charter schools,
which are outside the control of local boards of education and are free of many
state mandates and regulations.
67. Her conventional education at home was relieved by holidays with relations in Germany, during one of which visits she met Prince Aribert of Anhalt.
68. He was, however, prevented from proceeding further than Tirwill (probably Turovli on the Dwina), where he was imprisoned in irons for thirty-six
days, probably at the instigation of rival traders and ambassadors from Danzig,
Lubeck, and Hamburg, who, moreover, prevailed upon the king of Poland to
stop all traffic through his dominions of the English trading to Muscovy.
69. Witnesses reported seeing Carolyn Waldron reading a book and standing
in the middle of the platform moments before she fell onto the tracks about 2
a.m., police spokesman Alan Krawitz said.
70. Of several essays read by him before the Royal Irish Academy, one on the
Spontaneous Association of Ideas was said by Archbishop Richard Whately to
overturn Dugald Stewart’s theory on the same subject.
71. The villagers said they feared meeting Herrero inside of Eloxochitlan, a PRI
stronghold
72. His practical training started at his father’s mill, where he was given a lathe
and built small working steam engines.
73. He left a widow, Elizabeth, and three children, all under age.
74. His reputation as a preacher grew rapidly.
75. Twelve minutes earlier, Shui had won a penalty kick when she was hooked
by Simone Gomes.
76. His grandfather, William Blackman Ellis, artist, naturalist, and taxidermist,
who took him as a child for walks in Arundel Park, taught him much about
the flora and fauna of the area, and this background of a love of nature and
of skill in craftsmanship doubtless sowed the seed in him of a passion to perfect such love and skills in himself and, through teaching, to develop them in
others.
77. She first went to New York City in 1936, refused the offer of a staff position on Life magazine, and thereafter saw her work included in important
exhibitions in the USA.
197
A PPENDIX A: H UMAN S TUDY: P ILOT S TUDY
78. The word was the Supremes were getting back together.
79. Economists cautioned that the durable-goods data tend to be volatile, but
worried that the report signified that the manufacturing sector has not cooled
off as much as many analysts had believed.
80. During his six-year term, Zedillo has overseen a series of democratic reforms, the most important of which was his decision to abandon the longstanding practice of having the outgoing president handpick his successor.
81. After seven months of bitter emotions and plenty of political heat, the case
of Elian Gonzalez finally was resolved under long-standing rules on parents’
rights and immigration law – rules that some say need to change.
82. Pete Harnisch, just off the disabled list, got his first victory of the season
and drove in the go-ahead runs with a bases-loaded single Friday night as the
Cincinnati Reds beat the Arizona Diamondbacks 5-4.
83. Nothing is known of his education except that he did not lay claim to any
degree.
84. The central element of the design, the sculpture depicting The Ecstasy of
Saint Theresa, is one of the great works of the Baroque period.
85. Born in Jagtvejen, Copenhagen, Denmark, she went to Tasmania with her
parents in 1891.
86. In reality however, his stature was tall.
87. The legislation also gives the government the power to force other local
councils to accept their share of refugee claimants to alleviate the pressure on
port towns.
88. Yet Harry Potter is the type of fad a mother can love.
89. The report said changing people’s behavior saves more lives than spending
money on expensive institutions and equipment.
90. Born in Grantchester, Cambridgeshire, England, the son of geneticist William
Bateson, he studied physical anthropology at Cambridge, but made his career
in the USA.
91. Medicare would operate a standard prescription drug benefit, the same
for everyone, with some help from benefit management companies that many
private health plans use.
92. With respect to his great work it has been pointed out that in his specific
definitions he was loose and unsystematic, but that passages in his prefaces
and descriptions are fine, and at the same time simple and natural.
93. Democrats also complained that such a broad bill was probably unconstitutional and doomed for repeal. The GOP version included year-round nonprofit
activity, whereas Democrats wanted to limit it to activity within a month or two
of an election.
94. Attack dogs, though, tend to be favored by neo-Nazis and other young
toughs, usually in low-income areas where the dogs are brandished like weapons.
198
A PPENDIX A: H UMAN S TUDY: P ILOT S TUDY
95. But British bands Oasis and the Pet Shop Boys pulled out of their scheduled
Saturday night appearances
96. The Assembly, meanwhile, accepted a complex sex crimes bill containing
some provisions that its liberal Democratic members have had problems with
philosophically in the past.
97. Gaunt, Elizabeth, 1685, executed for treason, was the wife of William
Gaunt, a yeoman of the parish of St. Mary’s, Whitechapel. She was an anabaptist, and, according to Burnet, spent her life doing good, visiting gaols,
and looking after the poor of every persuasion.
98. It is said, but on no very certain authority, that he learnt engraving in
Denmark from Simon van den Passe, and in Holland from Hendrik Hondius,
and that he followed Hondius’s two sons to England.
99. Born in Sheffield, he started his career as a goalkeeper with Chesterfield and
Leicester City but was transferred to Stoke City because Peter Shilton (1949–)
was also on the Leicester staff.
100. When the Parliament was dissolved by military force, Allen was one of
the opponents bitterly attacked by Cromwell, and he was arrested by the army
for a short time.
A.4 Participant Responses
This section presents the responses of the fifteen participants. Rows represent
questions (that is, the one hundred sentences listed in the previous section)
and columns represent participants’ responses, with c corresponding to the
core biographical category, e to the extended category, and n referring to the
non-biographical category.
Table A.1: Pilot Study Data. Rows are Questions and Columns are
Participants.
Question
1
2
3
4
5
6
7
8
9
10
11
12
13
1
c
e
n
e
e
c
c
c
e
c
e
e
e
2
e
e
n
e
n
c
n
e
e
e
n
n
c
3
c
c
n
e
e
c
e
e
e
e
e
e
c
4
c
c
n
c
c
c
c
c
c
e
c
e
e
5
c
e
n
c
c
e
c
e
e
c
c
e
e
6
c
c
n
c
e
e
c
c
c
c
c
c
e
7
c
c
n
c
c
c
c
c
c
c
c
c
e
199
8
c
c
n
c
c
c
c
c
c
c
c
c
e
9
c
c
n
c
c
e
c
c
c
c
c
c
n
10
c
e
n
c
c
e
e
c
c
c
c
e
e
11 12 13 14 15
c
c
c
c
c
c
c
c
c
c
n
n
n
n
n
c
c
c
c
c
c
c
c
c
c
e
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
e
e
c
e
e
e
c
e
c
e
continued on next page
A PPENDIX A: H UMAN S TUDY: P ILOT S TUDY
Table A.1: continued from previous page
Question
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
1
n
c
e
n
e
e
c
c
c
e
c
c
c
e
e
n
c
e
c
c
n
n
c
c
c
n
c
e
e
n
n
n
n
n
e
e
c
e
c
n
e
e
n
c
2
e
e
e
n
e
c
e
c
e
c
e
e
e
e
e
n
e
e
e
e
n
n
c
c
e
n
c
n
e
n
n
c
n
n
e
c
n
e
e
n
c
c
n
e
3
e
e
e
n
e
c
e
c
e
e
e
c
c
e
e
n
c
e
e
e
n
n
c
c
e
e
c
e
e
n
n
e
n
n
e
e
c
e
e
n
c
c
n
e
4
c
c
c
n
c
c
c
c
c
e
c
e
c
e
c
n
e
e
c
e
n
n
c
c
c
n
c
c
n
e
n
n
n
n
c
c
e
e
c
n
e
e
n
c
5
e
c
c
n
c
e
c
c
c
e
c
e
c
e
c
e
e
e
c
e
n
n
c
c
c
e
e
c
n
e
n
e
n
e
c
c
e
e
c
n
e
e
n
c
6
c
c
c
n
c
e
c
e
c
e
c
e
c
e
e
n
e
c
e
e
n
n
e
c
c
n
e
e
n
n
n
e
n
n
e
c
e
c
c
n
e
e
n
c
7
c
c
c
n
c
e
c
c
c
c
c
c
c
c
c
n
c
c
c
c
n
n
c
c
c
e
c
c
e
c
n
e
n
n
c
c
c
c
c
n
c
c
n
c
200
8
c
c
c
n
c
e
c
e
c
e
c
c
c
e
c
n
c
c
c
c
n
n
c
c
c
n
c
e
n
e
n
n
n
n
c
c
c
c
c
n
c
e
n
c
9
c
c
c
n
c
e
c
e
c
c
c
c
c
e
c
e
c
c
c
c
e
n
c
c
c
c
e
c
n
e
n
e
n
n
c
c
c
c
c
n
c
e
n
c
10
e
c
e
n
c
e
c
c
e
e
c
e
c
e
c
e
e
e
c
c
n
n
e
c
c
e
c
n
n
c
n
e
n
e
c
c
e
e
c
n
e
n
c
c
11 12 13 14 15
c
e
c
c
e
c
c
c
c
c
c
c
c
c
c
n
n
n
n
n
c
c
c
c
c
e
e
e
e
e
c
c
c
c
c
e
c
c
c
e
c
c
c
c
c
c
e
c
c
e
c
c
c
c
c
c
e
c
e
e
c
c
c
c
c
c
e
e
e
e
c
c
c
c
c
n
n
n
n
n
c
c
c
c
c
c
e
c
c
c
e
c
c
c
e
c
c
c
c
c
n
n
e
n
n
n
n
n
n
n
c
c
c
c
e
c
c
c
c
c
c
c
c
c
c
n
e
e
n
n
c
c
c
c
c
c
e
c
c
e
n
n
e
n
n
e
e
e
c
e
n
n
n
n
n
n
e
e
n
n
n
n
n
n
n
n
n
e
n
n
c
c
c
c
e
c
c
c
c
c
c
c
c
c
c
e
e
c
c
e
c
c
c
c
c
n
n
n
n
n
c
c
c
c
c
c
c
e
c
e
n
n
n
n
n
c
c
c
c
c
continued on next page
A PPENDIX A: H UMAN S TUDY: P ILOT S TUDY
Table A.1: continued from previous page
Question
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
1
c
n
n
n
n
c
c
c
n
c
e
e
c
e
c
c
c
c
e
c
n
n
c
e
c
c
n
c
c
n
n
n
c
n
c
n
c
e
n
c
c
c
e
2
c
n
n
n
n
e
e
e
n
e
e
c
e
e
e
e
e
c
e
e
n
n
c
c
c
e
n
e
e
c
c
n
e
n
e
n
n
e
n
c
e
e
c
3
c
n
n
n
n
e
e
e
n
e
e
c
e
c
e
e
e
c
e
e
c
n
e
c
c
e
c
e
n
n
n
n
e
n
e
n
n
n
n
c
e
e
c
4
c
n
n
n
n
c
c
c
n
c
e
c
e
n
c
e
c
e
c
c
n
n
e
e
e
c
n
c
c
n
c
n
c
n
e
n
n
c
n
c
c
c
c
5
c
n
n
e
n
c
c
c
n
e
e
e
e
c
c
c
c
e
e
e
e
n
e
e
e
c
e
c
c
n
e
n
c
n
c
n
e
e
n
c
e
c
c
6
e
n
n
n
n
c
e
c
n
e
e
e
e
n
c
c
c
e
e
c
n
n
e
e
e
e
n
c
e
n
n
n
c
n
e
n
n
n
n
c
c
c
e
7
c
n
n
n
n
c
c
c
n
c
c
c
c
c
c
c
c
c
c
c
e
n
c
e
c
c
n
c
c
n
c
n
c
n
c
n
n
e
n
c
c
c
c
201
8
c
n
n
n
n
c
c
c
n
c
c
e
c
e
c
c
c
e
c
c
e
n
c
e
c
n
n
c
c
n
n
n
c
n
c
n
n
n
n
c
c
c
e
9
c
n
n
n
n
c
e
c
n
c
c
e
c
e
c
c
c
e
c
c
e
n
e
e
e
c
e
c
c
n
n
c
c
n
c
n
n
n
n
c
c
c
c
10
e
n
n
e
n
e
n
e
n
e
n
e
c
e
e
e
n
e
n
e
n
e
e
e
n
e
n
e
e
c
c
e
c
c
e
n
c
n
n
e
n
c
e
11
c
n
n
n
n
c
c
c
n
c
c
c
e
e
c
c
c
c
c
c
n
n
c
n
c
c
n
c
c
n
n
n
c
n
c
n
n
n
n
c
c
c
c
12
c
n
n
n
n
c
c
c
n
c
c
e
e
e
c
c
c
c
c
c
e
n
c
e
c
c
n
c
c
n
n
n
c
n
e
n
n
n
n
c
c
c
c
13
c
n
n
n
n
c
c
c
n
c
c
c
c
e
c
c
c
c
c
c
c
n
c
c
c
c
n
c
c
n
e
n
c
n
c
n
n
e
n
c
c
c
c
14
c
n
n
e
n
c
c
c
n
c
c
e
e
e
c
c
c
c
c
c
e
n
c
c
c
c
n
c
c
n
n
c
c
n
c
n
n
e
e
c
c
c
c
15
c
n
n
n
n
c
e
c
n
c
e
c
n
e
c
e
c
c
e
c
e
n
c
n
c
c
n
c
e
n
n
n
c
n
e
n
n
n
n
c
e
c
e
A PPENDIX B
Human Study: Main Study
This appendix provides further information about the human study described
in Section 6.4 on page 132. The first section reproduces the instructions provided to annotators. These instructions were provided as a HTML page and
a printable PDF file. The second section lists the sentences used, divided into
five sets of approximately one hundred sentences. The third section presents
tables of agreement data for the sentences.1
B.1 Instructions to Participants
B.1.1
Introduction
This study involves assessing human ability to judge whether a sentence is
biographical or non-biographical when that sentence is shown out of context.
You will be presented with 100 sentences, some of which are biographical, and
some of which are non-biographical, and asked to make a judgment about each
one. It is important that you read this document carefully, and perhaps refer to
it while making your decisions. If you think that a sentence does not fall into
any of the categories given, then please mark it as non-biographical.
A sentence is biographical if it contains information from one (or more) of
the six biographical categories mentioned on page 203. If it does not contain information from one of these six categories then it is to be marked
non-biographical. For example, the following sentence contains information about place of residence and education: “Born in England, he studied
at Cambridge, before becoming a naturalized American citizen and living in
New York City for most of his adult life.”
A sentence is biographical for these purposes if and only if it contains
biographical information according to the guidelines given on page 203
(that is, the six biographical categories). Just because a sentence contains
information about someone, doesn’t mean that it is biographical (according to this scheme).
1 This
data is available electronically by emailing [email protected]
202
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
A sentence is biographical if it contains biographical information. For instance, “Gordon Brown, the Chancellor of the Exchequer, attended a meeting
of European Finance Ministers today”, is biographical simply on the basis
that it provides job information about Gordon Brown (that is, that he is
Chancellor of the Exchequer). Similarly, a sentence is biographical even if
the biographical information (often embedded in a clause) is only a very
small part of the sentence (for example, “Former Daily Mirror journalist James Hipwell says voicemail hacking has long been widespread in tabloid
newspapers, and is lifting the lid on the dubious journalistic practices he observed during his time at the paper”).
The reference of the biographical information does not have to be explicitly named in the sentence — “he” or “she” is enough (for example, “She
attended Manchester University in the late 1960’s”).
Remember that a sentence can belong to one or more biographical categories. For instance, “Steven Irwin, noted Australian naturalist and television
presenter has been killed in a tragic accident off the Australian coast” gives information about nationality, death and job role (that is, key information,
and work information).
Remember that a sentence must refer to an individual to be biographical.
“He lost his life in a road traffic accident” is biographical. “12 people were
seriously hurt” is non-biographical for these purposes
Events that happen to a person after that person is dead are non-biographical.
(With the exception of honours awarded after death, like, for instance, the
Victoria Cross.)
Remember, the sentence may contain information about a person, but
unless that information falls into the six categories mentioned (that is,
key life facts, fame, character, relationships, education and work) then it
is not biographical. Whimsical or anecdotal information about a person,
unless it falls into one or more of the six biographical categories, is to be
classed as non-biographical. For example, “He then saw a tank, which was
carrying his substantial winnings blown to pieces before his eyes” would count
as non-biographical as it does not fall under any of the six biographical
categories.
Major surgery or abiding health concerns are to be classed as biographical
Remember, biographically relevant information (for example, job titles)
may be contained (or buried in) in very long sentences.
The day-to-day activities of politicians – meetings attended, conferences
addressed, etc. — are not to be considered biographical unless the sentences contain information that is biographical according to the six classes
identified (for example, the sentence mentions a job title or award).
B.1.2
Six Biographical Categories
Key Life Facts
These are central key facts about a person’s life, common to all people.
203
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
Information about date of birth, or date of death, or age at death.
Names and alternate names (for example, nicknames).
Place of birth: “Orr was born in Ann Arbor, Michigan but was raised in
Evansville, Indiana”.
Place of death: “He died of a heart attack while holidaying in the resort town
of Sochi on the Black Sea coast”.
Nationality: “He became a naturalized citizen of the United States in 1941”.
Cause of death: “He died of a heart attack in Bandra, Mumbai”.
Longstanding illnesses or medical conditions: “He stepped down from the
position on grounds of poor health in February 2004”.
Place of residence: “Sontag lived in Sarajevo for many months of the Sarajevo
siege”.
Physical appearance: “With his movie star good looks he was a crowd favourite”.
Major threats to health and wellbeing (for example, assassination attempts,
car crashes).
Fame
What a person is famous for. This kind of information can be broadly positive (for example, awards, prizes, honours) or negative (for example, scandal,
jail terms, and so on). For example: “His study of Dalton won him the Whitbread
prize” or “In 1976, heroin landed him in Los Angeles county jail”.
Character
Attitudes, qualities, character traits, and political or religious views. For example, “He was raised Catholic, the faith of his mother” or “Jones is recalled as a gentle
and unassuming man.”
Relationships
Information concerning relationships with intimate partners or sexual orientation. Relationships with parents, siblings, children, social acquaintances or
friends. For example: “His mother died when he was eleven” or “Nine people testified against him at the trial, including another wife he tried to set on fire”.
Education
Institutions attended, dates, evaluative judgements on time in education, educational choices, qualifications awarded. For example: “Corman studied for his
master’s degree at the University of Michigan, but dropped out when two credits short
of completion.”
Work
References to positions, starting jobs, resigning from jobs, job titles, affiliations
(for example, employer or organizations), personal wealth, areas of interest,
lists of publications, films, and so on. For example: “He returned to England
204
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
in 1967 to work for the offshore pirate radio station Wonderful Radio, London”, or,
“Gordon Brown, British Chancellor of the Exchequer”.
B.1.3
Example Sentences
Here is a list of pre-classified sentences (with a brief statement for each sentence explaining why it has been classified this way). There are 10 example
sentences.
1. A year later, on the evening of 7 June 1954, he killed himself.
biographical:
non-biographical
Explanation: This sentence refers to the date of a person’s death — a key life
event. It doesn’t matter that “he” is used instead of a proper name.
2. With an awareness of death and the miracle of life at the foundation of
his work, Saul Bellow’s novels brought him huge success, and both the
Nobel and Pulitzer Prize.
biographical:
non-biographical
Explanation: This sentence is biographical because it makes clear that a person — Saul Bellow — is the recipient of a prize. This counts as a fame event.
3. His ability to exude an almost violent enthusiasm, talk extremely loudly
and, seemingly, live a charmed life grabbing some of the world’s most
poisonous creatures out of the bush, spawned a growing cult for “red in
tooth and claw” wildlife television.
biographical:
non-biographical
Explanation: This sentence describes the qualities and abilities (“violent enthusiasm”, “talk extremely loudly”) of a person (“His”) and is thus a character
sentence and characterised as biographical.
4. During spring 1953 he was also being invited to the Greenbaums’ house
from time to time, for Franz Greenbaum, whom the Manchester intellectual establishment did not consider a very respectable figure, was not
bound by the strict Freudian view of relations between therapist and
client.
205
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
biographical:
non-biographical
Explanation: This sentence describes the relationship between “he” and Franz
Greenbaum and is hence biographical. Also, more importantly, “Fran Greenbaum” is described as someone not considered to be a respectable figure; a character sentence.
5. “I’ve come clean ’cause I need help.”
biographical:
non-biographical
Explanation: This sentence does not contain biographical information according to our definition. Although it does tell you something about the individual
(“I need help”) this information falls under none of the six categories (key, fame,
character, relationships, education and work).
6. The probation period ended in April 1953.
biographical:
non-biographical
Explanation: This example sentence does not contain information about an individual, although it mentions a “probation period” this could easily apply to a
company or organization, rather than a person.
7. It contained samples from a mysterious growth on the wall.
biographical:
non-biographical
Explanation: This example sentence does not contain information about an individual and is not biographical
8. While away in Canada, John had a letter from Alan.
biographical:
non-biographical
Explanation: This sentence is not biographical according to the six biographical categories. While it does refer to an event (that is, John receiving a letter),
this is not in itself noteworthy with respect to the six biographical categories.
206
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
9. In his acceptance lecture, Bellow criticised modern writers for presenting
a limited and distracted picture of mankind.
biographical:
non-biographical
Explanation: This sentence is not biographical. Although it mentions Saul Bellow, it does not fall into within the six categories outlined.
10. McClaren has also chosen to omit senior internationals David James and
Sol Campbell, while Theo Walcott and Scott Carson have been sent on
Under-21 duty.
biographical:
non-biographical
Explanation: This sentence is biographical as it contains work information (“senior internationals, David James and Sol Campbell”). This kind of construction
(that is, “job title, name” or “name, job title”) is particularly important, especially in news text.
B.2 Sentences
Sentences are divided into five sets of 100 (see Section 6.4 on page 132 for a
description of the methodology used). Those sentences are derived from the
small biographical corpus described in Chapter 5 and retain their biographical
tags. Note that tags were removed when the sentences were presented to study
participants, and also that the abscence of tags indicates that the sentence was
not classified as biographical using the scheme developed in Chapter 5.
B.2.1
Set One
1. He became pointed, intimate.
2. If you look at the names in Norfolk, there’s a lot that are the same.
3. I saw what he meant.
4.
work I was invited by Sean but sat deliberately out of the limelight
with my friend the English comedian Eric Sykes, who was also in the
picture. /work 5. Former president Amin Gemayel, a sharp critic of Hizbullah, described
parts of the speech as dangerous.
207
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
6.
relationships work I was particularly delighted when I came
across the following piece of dialogue in a book of dialogue-criticism (Invitation to Learning, New York, 1942) mainly by the American scholars
and writers, Huntington Cairns, Allen Tate (soon to be one of my closest
friends) and Mark van Doren. /work /relationships 7. It’s been superb, for myself and my family.
8. This could be Lisa’s chance.
9.
key . Home was Croydon, where she lived with her divorced mother
in a council flat, supported by social security, supplemented occasionally
by haphazard maintenance payments from her father, who was in the
Merchant Navy and had not been seen since Val was five. /key 10. To a hole in one.
11. Earlier, I made a lot of what I thought were beautiful shots with much
backlighting and many effects, absolutely none of which were motivated
by anything in the film at all.
12.
relationships Recently I asked Evelyn Waugh’s eldest son, Auberon
(Bron) if he remembers his father’s reaction on getting those proofs of The
Comforters while in the middle of writing his Pinfold . /relationships 13. The curtain was about to go up.
14.
character She was disappointed.
/character 15.
fame For more than 20 years Ronnie Barker was one of the leading
figures of British television comedy. /fame 16.
character In this poem we see their shared Jewishness, and the “irreverence” (as some would see it) they each had for the Tradition — at
least for that view of it which some espoused; we also see a shared disdain for rabbinic (and priestly) logic, to them both a form of mental death.
/character 17. “Trust Peach to exaggerate them.”
18.
education And so he had acquired an old-fashioned classical education, with gaps where teachers had been made redundant or classroom
chaos had reigned. /education 19.
fame With an awareness of death and the miracle of life at the foundation of his work, Saul Bellow’s novels brought him huge success, and
both the Nobel and Pulitzer Prize. /fame 20. Fatherly men patted her head admiringly and older ladies frowned at the
sight of her.
21.
character key Reports claimed that the elfin figured star’s weight
plunged terrifyingly until she tipped the scales at a mere five stones.
/key /character 22. But that was just not there.
23. It’s true what they say about that.
208
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
24. Under the new regime we would be the first province to disenfranchise
them.
25. Margaret had failed by four votes to win outright.
26.
key I took refuge first at Aylesford in Kent at the Carmelite monastery,
and next at nearby Allington Castle, near Maidstone, a Carmelite stronghold
of tertiary nuns. /key 27. She’s going out of her wits.
28. I worked 26 years in the same job and had to quit.
29. If he is happy, the boy thought, then I’m glad he went.
30. This, of course, made me feel very cheerful.
31.
key In the middle of 1955, before I had finished my first novel, I moved
back to London, fully restored and brimming with plans. /key 32.
character work Alan Maclean, who was the best-liked editor in
London, asked me to write a novel for his firm; they would commission it
(a thing unheard of, for first novels, in those days). /work /character 33.
education In 1986 he was twenty-nine, a graduate of Prince Albert
College, London (1978) and a PhD of the same university (1985). /education 34.
work It was to be directed by the great French director Ren Clement.
/work 35. In the past, Mr Blair has proclaimed himself as the change-maker.
36.
fame In the two short years that followed her first record Kylie became one of the entertainment phenomena of the 1980s. /fame 37.
relationships We were very attached to each other, there in the
office at 50 Old Brompton Road, with one light bulb, bare boards on the
floor, a long table which was the packing department, and Peter always
retreating to his own tiny office to take phone calls from his uncles; one
of them worked at Zwemmer the booksellers and gave us intellectual
advice, and the other was a psychiatrist. /relationships 38. “Working on his conference speech,” came the reply.
39. Peach felt a little awed by its grand rooms and suspicious of those fat
babies that Lais called cherubs peeking down at her from the ceilings.
40. It is a cabinet ’we’ . . .
41. I don’t care about the Superdome.
42.
key A year later, on the evening of 7 June 1954, he killed himself.
/key 43. Increasingly, Kate felt depressed by Toby’s sexual behaviour, which disgusted and bewildered her.
44. It’s just something I’ve eaten.
209
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
45.
relationships By spring 1953 he was also being invited to the Greenbaums’ house from time to time, for Franz Greenbaum, whom the Manchester intellectual establishment did not consider a very respectable figure,
was not bound by the strict Freudian view of relations between therapist
and client. /relationships 46. You’re asking me if I’m an accessory after the fact to first-degree murder.
47.
education She attended the Actors Studio in New York, famous for
teaching an intense style known as the Method, beloved of actors such as
Marlon Brando and James Dean. /education 48.
key Something strange was not surprising, because, foolishly, I had
been taking dexedrine as an appetite suppressant, so that I would feel
less hungry. /key 49.
work relationships I found a friend in Father Frank O’Malley,
a kind of lay-psychologist and Jungian. /relationships /work 50.
work The examination of Robin’s PhD thesis, on the logical foundations of physics, had to be postponed since Stephen Toulmin, the philosopher of science, had decided he could not undertake it after all. /work 51.
education The wrath of her disappointment had been the instrument of his education, which had taken place in a perpetual rush from
site to site of a hastily amalgamated three-school comprehensive, the
Aneurin Bevan school, combining Glasdale Old Grammar School, St Thomas
a Beckett’s C of E Secondary School and the Clothiers’ Guild Technical
Modern School. /education 52.
work Minutes of the University Council show that this had been decided by January or February 1953. to appoint him to a specially created
Readership in the Theory of Computing when the five years of the old
position ran out on 29 September. /work 53.
fame From 1944 and the publication of Bellow’s first novel Dangling
Man, the writer and teacher produced a body of work that ensured his
position as one of America’s most powerful voices. /fame 54. All she wore was 11 beads, and eight of them were perspiration.
55.
character He was a small man, with very soft, startling black hair
and small regular features. /character 56.
education Star Trek’s impact became apparent when he was awarded
an honorary doctorate in Engineering from the Milwaukee School of Engineering, after half the students there said that Scotty had inspired them
to take up the subject. /education 57. Several other motorists who had refused to pay “bail” were also given
their keys back.
58.
work Franco was the Fascist Dictator of Spain at the time.
/work 59. He lags it up well, but the US pair are made to pay moments later when
Donald takes advantage of Garcia’s accurate iron by rolling in a 12-footer.
210
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
60. Shame on you, Tony Blair.
61.
relationships Father O’Malley and his cousin Teresa Walshe had
found the place for me. /relationships 62. And it will be good for us all.
63.
character She thought him “obsessed by sex”
/character 64. “I take one 10mg tablet each night and I feel about 60% better.”
65.
work I have been in over seventy-three films in thirty years and by
the time you read this it will probably be seventy-six. /work 66. So Douglas and John should be released from their obligation to me and
allowed to stand, since either had a better chance than I did.
67. This is the way I work.
68.
character A keen and powerful debater, he was not amused at the
dreariness of the Executive’s meetings, the small talk and the administration (never his strong point). /character 69.
work In 1953 on my return from Edinburgh, feeling desperately weak,
I wrote a review, in the Church of England Newspaper, of T.S. Eliot’s play
The Confidential Clerk which was first performed at the Edinburgh Festival. /work 70. I looked up and Bardot was grinning as she dusted breadcrumbs from
her hands.
71.
key Blackadder, a Scot, believed British writings should stay in Britain
and be studied by the British. /key 72. Kate said, “Oh, do take it off, Toby!”
73.
fame In 1988 alone she sold a remarkable £25 million worth of records
around the world, earning herself around £5 million. /fame 74. He suggested that the raids could even have been timed to distract attention from criticisms of the government’s stance on the Lebanon crisis.
75.
education After the war, Doohan spent two years studying acting at New York City’s Neighborhood Playhouse, where he later taught.
/education 76. The two men had spoken “almost daily” in August as Mr Blair supported
the push for a UN resolution to end the Israeli offensive.
77. Panic ensued as such brands as Watney’s Red Barrel, Worthington E and
Whitbread Tankard rapidly dominated the market.
78.
fame She was garnering awards from Japan to Israel and Ireland, embarking on a movie career that seemed certain to lead to Hollywood stardom — she had even achieved the final confirmation of her status as a
member of the elite band entitled to call themselves superstars, a wax
image of herself at Madame Tussaud’s /fame 211
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
79. There is no doubt that a majority of her cabinet colleagues were glad to
see the back of her, but they gave their advice freely and frankly.
80.
education key Born in Bedford in 1929, Barker went to school
in Oxford, became an architecture student and even toyed with the idea
of becoming a bank manager, the archetypal middle-class profession he
would later parody so effectively in his comic sketches. /key /education 81.
relationships About my grandmother Flory Zogoiby, Epifania da
Gama’s opposite number, her equal in years although closer to me by
a generation: a decade before the century’s turn Fearless Flory would
haunt the boys’ school playground, teasing adolescent males with swishings of skirts and sing-song sneers, and with a twig would scratch challenges into the earth- step across this line . /relationships 82. “And we are fighting for the possibility that good and decent people
across the Middle East can raise up societies based on freedom, and tolerance, and personal dignity.”
83.
fame He received the National Book Award, his first of three, in 1954
for The Adventures of Augie March and, 10 years later, his international
reputation was assured with Herzog. /fame 84.
fame This brought him a Pulitzer Prize in 1975, and a year later, Bellow became the seventh American writer to receive the Nobel Prize for
Literature. /fame 85. “But this also will have to be checked into the hold.”
86.
key character Wilmot was a dissolute courtier at the Restoration court of Charles II, a lecher and a drunk, but also a poet who could
treat himself and his world with satiric coolness and who helped to establish the tradition of English satiric verse and assisted Dryden in the
writing of Marriage-a-la-Mode. /character /key 87. Throughout the 1950s, Hilda worked tirelessly to better the condition of
African women, despite being banned from 28 organisations.
88.
relationships One person who encouraged this development was
Lyn Newman, who became another of the small group of human beings
whom Alan could trust. /relationships 89.
fame He was much loved and admired for his appearances in the
long-running series The Two Ronnies, with Ronnie Corbett, as prison inmate Fletcher, in the series Porridge, and as Arkwright, the bumbling,
stuttering, sex-obsessed shopkeeper in Open All Hours. /fame 90.
fame It so happened that in 1954, in the crucial months of my illness,
my name was beginning to flourish in the literary world. /fame 91. “I have great respect for David; he was a fantastic captain, a great player
and still is.”
92. That was a Saturday.
212
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
93.
key Through the psychedelic years he was a schoolboy in a depressed
Lancashire cotton town, untouched alike by Liverpool noise and London
turmoil. /key 94.
education He had done what was hoped of him, always, had four
A’s at A Level, a First, a PhD. /education 95. “No,” George smiles, and his family burst into tears.
96.
relationships The house was owned by Mrs Lazzari (Tiny), a
wonderful Irish widow who had been married to an Italian cellist (“so
I understand the Artist”) /relationships 97. “Are you saying my sister is going to die?”
98. And at every turn is the ubiquitous Fred Scuttle, constantly at our service
with his peaked cap crazily askew, eager eyes blinking madly through
wire-rimmed glasses and fingers enthusiastically splayed in a ragged
salute.
99.
work The last three Best Actor Academy Awards have been won
by British stars: Daniel Day Lewis as a horribly deformed man in My
Left Foot; Jeremy Irons as a man on trial for the alleged attempted murder of his wife in Reversal of Fortune, and Anthony Hopkins, in 1992,
who played Hannibal Lecter the homicidal cannibal in The Silence of the
Lambs /work 100. “If you lit a match in our kitchen, it’d go up with a roar.”
101.
character education work He worked in menial jobs to
pay his way through college where he studied journalism and became a
radical. /work /education /character B.2.2
Set Two
1. Kendra and Maliyah were joined at mid-torso, with some shared organs
and just two legs.
2. Just a teensy little kiss, he said.
3. Along with using their laptops on board, business travellers have become
used to taking all their luggage on board with them, so they can get to
their meetings as quickly as possible, without having to wait to collect
their bags.
4.
work I worked at Peter Owen’s three days a week, and at home wrote
stories and my second novel, a kind of adventure story, Robinson . /work 5. He has a terrific sense of humour and he carries on running jokes from
the day before.
6.
character Her eyes, according to Alan Watkins of the Observer, took
on a manic quality when talking about Europe, while her teeth were such
as ’to gobble you up’. /character 213
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
7. Allardyce is aware that he and Redknapp will be under intense scrutiny
and he called a meeting of his senior players yesterday to discuss the
fall-out from the Panorama programme.
8. Ms Hewitt will not present it as a cut in investment, but that may be the
consequence of blocking uneconomic schemes.
9.
fame In that year, he picked up the English and European footballer
of the year awards. /fame 10.
work I was secretary, proof-reader, editor, publicity girl; Mrs Bool was
secretary, office manager and filing clerk; and Erna Horne, a rather myopic thick-lensed German refugee, was the book keeper. /work 11.
work Alan had been urged to look for new young talent, and got
my address from Tony Strachan, who was then working at Macmillan.
/work 12.
relationships key They had seen one some months earlier, a
puppy of fourteen weeks with a beautiful smoky fur, belonging to Raymond’s wife, Charlotte (the Raymond Greenes were then living in Oxford where Raymond had a medical practice), and this led Greene to buy
one /key . /relationships 13. But they all miss the point.
14.
relationships He lived with Val, whom he had met at a Freshers’
tea party in the Student Union when he was eighteen. /relationships 15.
character While Benny’s deliberately low key arrival to start making a new series may seem somewhat incongruous for a millionaire entertainer whose programmes are shown all over the world, it is typical
of the workaday beginnings from which a Benny Hill Show is produced.
/character 16. John said that he now doubted whether I could get the support of the
Cabinet.
17. “It has obviously caused a lot of offence and for that I unreservedly apologise,” he said but added: “Words like inbreeding and outbreeding are
very professional, genetic terms.”
18. Writing in The Times, Dunkley commented: “When I was about four my
mother managed to reduce me to an almost hysterical fit of giggling by
promising to show me her new water otter and then producing a kettle.”
19.
work It came from Alan Maclean, the fiction editor of Macmillan, London, a much larger publisher than any I had so far dealt with. /work 20.
character key Blackadder, a Scot, believed British writings should
stay in Britain and be studied by the British. /key /character 21. Fielding caught the unspectacular tab, leaving a twenty on the plate.
22.
key In 1930 he had begun writing the biography of John Wilmot, the
second Earl of Rochester, who was born on to April 1647 and died at the
age of thirty-three. /key 214
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
23.
education key Nicknamed Robin at school, he attended Aberdeen
Grammar School before studying English Literature at Edinburgh University. /key /education 24. “It is difficult to describe how it feels to get someone back who you were
told you had lost for ever.”
25. “To talk again.”
26. We were there for six days and only got out on medical grounds because
of the baby.
27. Here was a steadfast friend but, as I quickly saw, one in the deepest distress.
28. “Especially Maman and Gerard.”
29. I thought it was a reference to Through the Looking Glass, where Humpty
Dumpty says Inpenetrability.
30.
fame In 2004 Barker was honoured with a Bafta tribute award and
celebration evening for his contribution to comedy. /fame 31. Woods finally gets a birdie putt to drop, before Harrington sparks wild
celebrations among the crowd by following him straight in.
32.
fame Together with Denis Law and Bobby Charlton, Best formed a
triumvirate that inspired Manchester United to League Championships
in 1965 and 1967 and the European Cup in 1968. /fame 33.
work While waiting for my novel to appear, I worked part time at
Peter Owen the publisher. /work 34.
fame He is widely regarded as one of the greatest players to have
graced the British game. /fame 35. Now how do we market this.
36. There is an enormous amount of pressure on me.
37.
character He was to turn to Catholicism and make a death-bed repentance. /character 38. Shall I let you in?’
39. But don’t despair, my friends.
40.
work relationships I found a friend in Father Frank O’Malley,
a kind of lay-psychologist and Jungian. /relationships /work 41.
relationships character ’Benny’s not the kind to sweep up
to the studio in a huge limousine like some showbiz superstar,’ explained
Dennis Kirkland, a former floor manager who has been producing the
show for seven years and is one of Benny’s’ few close friends. /character /relationships 42. Nothing else of this correspondence has survived; my suspicion is that it
probably held the most revealing and sophisticated psychological comment that he ever put into letters.
215
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
43.
fame At 18, he won the first of 37 international caps for Northern
Ireland and was being hailed as the new Stanley Matthews. /fame 44. But she fervently hopes that you will give Lisa a chance and that you will
not be disappointed.
45.
fame 1984: Jailed for drink-driving offence /fame 46. What is she hoping for?
47. Donald finds the green in two to leave Garcia with a 25-footer for birdie.
2.50pm Betting update: The US, who drifted out as long as 11-2 during
the morning’s play, have come back in to 7-2, with Europe drifting out to
4-11 from 1-6 at one point earlier today.
48. “You must have had a very bad mother, if you do something like this,”
she told Mr Diop before stomping off.
49. Others, including Professor Kemp, are sure that the Mona Lisa was not
cut down.
50. I feel dizzy as well as sick. Why don’t you lie down? I’ve been lying
down.
51.
fame In the late 1970s he was three times the British Academy’s best
light entertainment performer, and in 1975 he took the Royal Television
Society’s award for outstanding creative achievement. /fame 52. He added: There has been a lot more intelligence.
53.
54.
character He was, as always, charming, thoughtful and loyal.
/character fame Although he rarely missed a game in his early career, he started
causing problems at Old Trafford, and in 1971 was suspended for a fortnight for failing to catch a train for a game at Chelsea. /fame 55. Once more Kate hit the bathroom floor.
56. Oh, how sad! Toby disappeared into the bathroom and emerged about
ten minutes later.
57. She was trying to hide the elation, but I could see it there.
58. “This is a common threat to all of us and we should respond with a common purpose and a common solidarity and common cause,” he said.
59. Play Dirty was due to be shot in a town called Almeria in southern Spain,
and as there was no airport there I had to fly to Madrid and then take a
train.
60.
education Charles himself was educated at St Albans school and
read natural sciences and law at Trinity College, Cambridge. /education 61. “I crushed it up and gave it to him in a bottle with a soft drink,” Sienie
recalls.
62.
work On that occasion, while I was seeing my agent, Tiny wandered
off by herself; she came back bringing with her for lunch my friend, Joe
216
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
McCrindle, owner and editor of Transatlantic Review, who had visited at
Baldwin Crescent. /work 63. It seems odd.
64. I said, “George”, and he said, “What?”
65.
fame Manchester United football legend George Best will be remembered for his dazzling skill on the pitch, and for his champagne lifestyle
away from it. /fame 66.
education relationships Hawthorne, one of identical twins,
was educated in Belfast, at the Methodist college and Queen’s University.
/relationships /education 67.
character Poor Geoffrey had just been unlucky in his seating for
Robin Maxwell-Hyslop had always been a man to avoid. /character 68.
education But John’s first love was sailing: he was educated at the
Nautical College in Pangbourne. /education 69.
relationships This might appear to indicate that Blackadder and
Cropper worked harmoniously together on behalf of Ash. /relationships 70.
work In the course of that year the proofs went round among literary people, one of whom was Gabriel Fielding, a very good novelist; his
real name was Alan Barnsley, a medical doctor, practising in Maidstone.
/work 71.
work I was particularly delighted when I came across the following
piece of dialogue in a book of dialogue-criticism (Invitation to Learning,
New York, 1942) mainly by the American scholars and writers, Huntington Cairns, Allen Tate (soon to be one of my closest friends) and Mark
van Doren. /work 72. However, those caught up in the Superdome misery, many from the city’s
mostly black and poorer areas, appear largely ambivalent about its reopening.
73.
education relationships Before long, her mother moved to
Ilfracombe, Devon, where Patricia went to school aged three, having already taught herself to read from newspapers.
/relationships /education 74. The probation period ended in April 1953.
75. The economy is very bad.
76.
education Either side of the second world war, John went to Corpus
Christi College, Cambridge, where he graduated with an honours degree
in economics. /education 77.
relationships key He visited Hinchingbrook House, home of
the Earls of Sandwich, one of whom had married one of Rochester’s
daughters. /key /relationships 78.
character I stressed his stamina, his integrity and his ministerial experience. /character 217
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
79.
education key She went on to secretarial college in London, defying her father, who did not want her to leave home, and in 1959 got a
job at JWT, soon becoming a copywriter. /key /education 80. It was reported that she disturbed the prowler when she arrived back
unexpectedly at her family’s Melbourne home.
81. He don’t half talk a lot of . . . nonsense.
82. And he said, “Yes.”
83. However, many - including Mr Olmert - have questioned whether such
a policy would work in the long term.
84. Sometimes it gets to me, I give so much time and energy to everyone else,
that there is nothing left for me. That is when I think What about me?
85. Over the next two hours or so, each Cabinet minister came in, sat down
on the sofa in front of me and gave me his views.
86. During her days on Neighbours, she recalled how people were only too
willing to vent their jealousies publicly.
87.
key character Wilmot was a dissolute courtier at the Restoration court of Charles II, a lecher and a drunk, but also a poet who could
treat himself and his world with satiric coolness and who helped to establish the tradition of English satiric verse and assisted Dryden in the
/key writing of Marriage-a-la-Mode. /character 88.
work relationships key Val left him for the first time since
they had set up house, and went briefly home. /key /relationships /work 89. He took Stilnox in 1999 and reported an improvement in balance, coordination, speech and hearing.
90.
key Alas not now, since Sir Hugh died in 1987.
/key 91.
education work Nykvist studied photography, and spent a year
at Cinecitta in Rome, before joining the Swedish production company
Sandrews in 1941 as assistant director of photography. /work /education 92.
character Producer Kirkland well remembers that the origins of one
Benny Hill Show lay in the arrival on his desk of a dog-eared piece of
cardboard covered with what looked like Egyptian hieroglyphics. /character 93.
work Matter-of-fact Hugo Manning, a night-journalist who worked
on Reuters, and also a poet and amateur philosopher, was a great source
of moral support. /work 94.
relationships But his favourite walking companion was always
Hugh, though in later years it was mostly for visits to secondhand bookshops. /relationships 95.
fame Best was a footballing genius.
218
/fame A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
96. Since silence was also Roland’s only form of aggression they would continue in this way for days, or, one terrible time when Roland directly
criticised Male Ventriloquism, for weeks.
97.
work education After studying at Columbia University he served
in the US army in Germany during the war and, returning to Columbia,
took another degree, in journalism. /education /work 98. On 10 May Alan sent a letter to Maria Greenbaum, describing a complete
solution to a solitaire puzzle, and ending: I hope you all have a very nice
holiday in Italian Switzerland.
99. At least 41 people were killed when a concrete building collapsed in Jinxiang, an industrial town in Cangnan county close to where the typhoon
hit land.
100. I said to Robert: It’s going to be all right, isn’t it? She is so like he said.
B.2.3
Set Three
1.
work She acquired an IBM golfball typewriter and did academic typing at home in the evenings and various well-paid temping jobs during
the day. /work 2.
relationships I now had a short talk with Alan Clark, Minister of
State at the Ministry of Defence, and a gallant friend, who came round
to lift my spirits with the encouraging advice that I should fight on at all
costs. /relationships 3.
key But the accompanying champagne and playboy lifestyle degenerated into alcoholism, bankruptcy, a prison sentence and, eventually, a
liver transplant. /key 4.
key Many of his novels were set in Chicago where his poor RussianJewish parents moved when he was a child. /key 5. By now I was in hysterics and Bardot noticed this and probably thought
I was laughing at her.
6.
character Well I have my problems too, sister, but I don’t have yours,
I’m not allergic to the twentieth century. /character 7. He smiled in innocent self-reproach, then swung sternly and made the
reverse V-sign at the watchful waiter.
8. A second report published yesterday by the London Resilience Forum,
representing the emergency authorities, concluded that not a single life
was lost because of poor planning.
9. Ibid., 12 June 1931.
10. The story was eventually made into a movie starring Johnny Depp and
Benicio Del Toro.
11.
fame Arthur Miller was America’s foremost post-war playwright.
219
/fame A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
12. We made our way to Green Park and as we were sitting there Roderick
suddenly said: What are we going to do, Noelle? Do? I mean, how much
longer are we going on meeting like this?
13.
work relationships character Benny’s not the kind to
sweep up to the studio in a huge limousine like some showbiz superstar,’
explained Dennis Kirkland, a former floor manager who has been producing the show for seven years and is one of Benny’s’ few close friends.
/character /relationships /work 14. Steinberg, introducing Klein’s brilliant novel The Second Scroll, draws
attention to, the obsessive theme of the discovered poetry (of New Israel)
is the miraculous, and the key image necessary to explain the remarkable
vitality, the rebirth evidenced in every aspect of life, is the miracle.
15. Many Blairites would agree with that assessment.
16. Val did very badly.
17. There’s no economic management in this country.
18.
work Significantly, he adduced the work of the American poet Wallace Stevens at this point, a man torn between the profession of law and
the poetic muse, whose view of lost faith and a disconnected tradition imbued his poetry with a wistfulness and a challenge that was taken very
seriously by Leonard and Layton; or, perhaps, viewed by them as a satisfactory replacement. /work 19.
character He had a way with words, and perhaps this had too easily convinced me that he and I always put the same construction upon
them. /character 20.
work key In 1930 he had begun writing the biography of John
Wilmot, the second Earl of Rochester, who was born on to April 1647
and died at the age of thirty-three. /key /work 21. That was for him to continue to fight for a place.
22. “The study suggests that with increasing sea surface temperatures, we
can expect more intense hurricanes,” Dr Gillett added.
23.
fame She was one of the most celebrated actresses of the 1960s and
1970s, winning five Academy Award nominations and an Oscar itself for
her role in The Miracle Worker. /fame 24.
fame Then, in 1984, he was convicted of drink-driving and assaulting
a policeman, and was jailed for 12 weeks. /fame 25.
work This was the year in which Leonard was elected president of
McGill’s Debating Society. /work 26.
relationships Peter was the Thatcherite brother of the ’wet’ Charlie Morrison, and son of the John Morrison who had been the chairman
of the ’22 when I was first elected in 1959. /relationships 220
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
27.
work He was now essentially unemployed, scraping a living on parttime tutoring, dogsbodying for Blackadder and some restaurant dishwashing. /work 28. “Is there a lot of fist-pumping, and yeah-shouting?” asks Ravi Motha.
29.
relationships key Although divorced four times, Saul Bellow’s own much-stated belief in the miracle of life was reinforced when
his fifth wife made him a father again at the age of 84. /key /relationships 30. He paused. Antonio Pisello, he said, Tony Cazzo — from Staten Island.
31.
education As king, he was instrumental in promoting the University of the South Pacific, and from 1970 was its first chancellor. /education 32.
fame A year later, he was dropped from the team again for failing
to attend training, and was ordered to leave the house he had built in
Cheshire and move into lodgings near Old Trafford. /fame 33.
education He was then sent to Australia to study: first at Newington College, in Stanmore, New South Wales, then at Sydney University
(1938-42), where he read arts and law - and became the first Tongan in
history to graduate. /education 34. But as a former Chief Whip - and how often in recent days had I wished
that he still held that office - he knew that support for me in the Cabinet
had collapsed.
35. “I was born and raised in New Orleans but I don’t want to go back, not
to the city and definitely not to the Superdome.”
36.
fame Yet, ask anyone their memory of Anne Bancroft and it’s the image of the bored housewife in The Graduate listening to Dustin Hoffman
asking the question “Mrs Robinson, you’re trying to seduce me, aren’t
you?” /fame 37.
work key As well as citing a decline in health for his reason for
retiring, Barker said he always felt he should quit while he was ahead,
and he had no further ambitions. /key /work 38. Next, I seemed to realize that this word-game went through other books
by other authors.
39. Have a good night.
40. On one occasion he refused to shake her hand, and on another lost his
temper and swore to “rope her” - the choice of verb was not lost on female voters - “like a heifer”.
41. Since the 7/7 bombings in London last year, ministers in the Home Office
had been “very actively engaged” in discussing with members of Muslim
communities the threat facing “all of us” and had already acted on nine
of the 12 points outlined in an anti-terrorism plan drawn up after the
London bombings, Mr Reid said.
42.
education work She kept in touch with Cornwall and strongly
supported projects for the county’s regeneration; was a vice president
221
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
of the History of Advertising Trust, a member of the Monopolies and
Mergers Commission, and on the council of Brunel University (they gave
her an honorary doctorate in 1996, the year she was made an OBE); was
involved with National Trust Enterprises, the English-Speaking Union
and the then Administrative Staff College, Henley ... the list goes on and
/education on. /work 43.
character But he was a man of the Left.
/character 44. Television critic Chris Dunkley groaned with displeasure when Benny
insisted on reviving one particular old chestnut in one of his early shows
for Thames Television.
45.
character His formative years were also steeped in his Jewish heritage, but he turned from this “suffocating orthodoxy” to enjoy the works
of such writers as Mark Twain and Edgar Allan Poe. /character 46.
work It seemed incongruous to them that the sweet teenager they
knew was not only surviving life in the toughest of all trades, but that
she was also winning something of a reputation as a tough cookie, a determined career girl refusing to be deflected from her dreams. /work 47.
relationships work Francis Maude, Angus’s son and Minister
of State at the Foreign Office, whom I regarded as a reliable ally, told me
that he passionately supported the things I believed in, that he would
back me as long as I went on, but that he did not believe I could win.
/work /relationships 48. They had the lights on us and it was so hot we were pouring water on to
the heads of the elderly to try to keep them cool, but they were passing
out.
49. “The money used has nothing to do with housing or individual allocations, so we’re not competing.”
50. The time came when nobody would cross the lines she went on drawing, with fearsome precision, across the gullies and open spaces of her
childhood years.
51.
fame Beyond the glare of stardom and the Pulitzer Prize which he
won for Salesman, he sought to provoke his audience into questioning
society and authority. /fame 52.
character I am addicted to the twentieth century.
53.
fame Arthur Miller became famous overnight /fame /character 54. Airey Neave and Margaret Thatcher have come to see me and we’re absolutely agreed that there should be no increase in your licence fee unless
you put things right...
55.
work The Ash Factory was funded by a small grant from London University and a much larger one from the Newsome Foundation in Albuquerque, a charitable Trust of which Mortimer Cropper was a Trustee.
/work 222
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
56.
character Able to deliver the great tongue-twisting speeches required
of his characters, Barker pronounced himself “completely boring” without a script. /character 57.
fame His work continued to be loved and admired in the UK and in
1995 his Broken Glass won the prestigious Olivier Award for best play.
/fame 58. McGinley decides to have a crack at the 16th green in two from the first
cut of the right rough, but comes up short to find a watery landing spot.
59.
key relationships His personal life became increasingly more
difficult, with bouts of alcoholism, bankruptcy and the failure of his first
marriage. /relationships /key 60. Fifty-six people, including the bombers, died in the attacks, with more
than 700 injured.
61. As they saw it, she was unlikely to defeat Michael in a second ballot.
62.
work relationships key Home was Croydon, where she
lived with her divorced mother in a council flat, supported by social security, supplemented occasionally by haphazard maintenance payments
from her father, who was in the Merchant Navy and had not been seen
/relationships /work since Val was five. /key 63. “I will stay with her,” said Leonie, walking to the severe white door behind which her granddaughter lay.
64. Perhaps I should have done.
65. At that time, the southern states’ rigid segregation laws, which had been
in force since the end of the Civil War in 1865, demanded separation of
the races on buses, in restaurants and other public areas.
66. By the time the proofs came to him in mid-August 1931 he still had not
found a satisfactory title and, in some desperation, chose one previously
put forward by his publisher for his preceding novel, a suggestion he had
not then taken up - Rumour at Nightfall .
67. She swept Peach off and plunged her into a bath of cool water, gradually
adding ice until the coolness penetrated Peach’s very bones.
68. On the surface it is a good action story, based on fact, with a moral to it
and some controversy.
69.
character You know, the thing I want more than anything else —
you could call it my dream in life — is to make lots of money. /character 70. One cabinet minister involved in bridge building between Mr Brown and
Tony Blair during the past fortnight put the challenge for Mr Brown like
this: What will matter is the language in which he speaks about Tony ...
71. But I was glad to have someone unambiguously on my side even in defeat.
223
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
72.
73.
74.
education After graduation she spent a year at the University of
Texas, Austin, to acquire a teaching certificate and taught history and
social studies for one year in state schools. /education character Barker was a man of contradictions.
/character fame Indeed, it is claimed that he once lost more than $6m in one
night at the Monte Carlo casino. /fame 75. He is linking giving up Hizbullah’s weapons to regime change in Lebanon
and ... to drastic changes on the level of the Lebanese government, Mr
Gemayel said.
76.
education key relationships She was born the only child
of hard-working, blue-collar parents, Ona and Cecil Willis, in the small
town of Lakeview in east central Texas, and attended high school in Waco.
/relationships /key /education 77.
education key Born in Bedford in 1929, Barker went to school
in Oxford, became an architecture student and even toyed with the idea
of becoming a bank manager, the archetypal middle-class profession he
would later parody so effectively in his comic sketches. /key /education 78.
work After Solomon’s desertion, Flory took over as caretaker of blue
ceramic tiles and Joseph Rabban’s copper plates, claiming the post with a
gleaming ferocity that silenced all rumbles of opposition to her appointment. /work 79.
education key Evans spent his infancy in Aberkenfig, Glamorgan, went to school in Suffolk, did his national service and, in 1953,
when he was 23, emigrated to New Zealand as a labourer.
/key /education 80. Last month Mr Bruce warned the campaign had stagnated and said the
country needed a coalition of angry people committed to slaying the
dragon immediately.
81.
character He was modest about his writing skills and often submitted his scripts under pseudonyms, in order for them to be judged on
their own merits. /character 82. Your last letter arrived in the middle of a crisis about ’Den Norske Gutt’,
so I have not been able to give my attention yet to the really vital part
about theory of perception....
83.
education work She left school at the age of 14, determined to
make a career as a dancer. /work /education 84. The materials used to make the overhead bins in airlines have been strengthened and so heavier items can be stored without harming the passengers,
said Mr Bowden.
85.
work relationships key They had seen one some months
earlier, a puppy of fourteen weeks with a beautiful smoky fur, belonging
to Raymond’s wife, Charlotte (the Raymond Greenes were then living in
224
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
Oxford where Raymond had a medical practice), and this led Greene to
/work buy one /key . /relationships 86.
work Council.
relationships His father was a minor official in the County
/relationships /work 87.
education His education as crown prince had begun at a school run
by the Free Wesleyan Church and continued at Tupou College, where, as
an academically bright teenager, he obtained his leaving certificate at the
age of 14. /education 88. The Ordeal of Gilbert Pinfold was the result, published in the summer of
1957.
89.
relationships work Ronnie Barker first worked with Ronnie
Corbett in The Frost Report and Frost on Sunday, programmes for which
/relationships he also wrote scripts. /work 90. He told his people, these forces are participating in joint exercises with
Saudi Arabia.
91. She recorded in her diary: “The fish shop sells china on one side and flies
on the other.”
92. I suppose reading must come in quite handy at times like these.
93. Fifteen of the hijackers were thought to have been Saudi nationals.
94.
fame character He claimed that the experience made him turn
over a new leaf, but in 1990 millions watched his infamous drunken performance on the Wogan television chat show /character . /fame 95. Don’t you think that is fascinating? Roderick said he did. Mr Claverham has something very ancient in his own home, I told Lisa. They have
found remains of a Roman settlement on the land. How wonderful! cried
Lisa.
96.
education character Her speaking ability revealed itself early
on and she entered Baylor University on a debating scholarship /character .
/education 97. He was also chatting with family and friends.
98. The draw is 12s. 2.48pm - Montgomerie/Westwood v Campbell/Taylor
a/s (4) The American pairing hit back nicely with a birdie at the par-five.
99. She would hold the beautiful earrings for her or slide sparkling rings on
to Lais’s white fingers, touching the long lacquered nails wonderingly,
her mouth copying Lais’s pout as she applied the lovely shiny red lipstick.
100. Linex liked the way I was thinking, but he said that you’d never get the
punters in and out quickly enough.
101.
work She worked in the City and in teaching hospitals, in shipping
firms and art galleries. /work 225
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
B.2.4
1.
Set Four
fame In 1999 Rosa Parks was awarded the Congressional Medal of
Freedom. /fame 2. For sheer incident it almost rivals the Arnold story.
3.
character A slight figure, 5ft 8in tall and weighing 10 stone, he
dazzled the crowds with his skill. /character 4.
work He anticipated Swift in his “Satyr Against Mankind” with its
scathing denunciation of rationalism and optimism, contrasting human
perfidy and the instinctive wisdom of the animal world. /work 5. They formed anagrams and crosswords.
6. ’And he never wastes anything.
7.
key James Montgomery Doohan (he shared a name with his most famous character) was not, in fact, a Scot but a Canadian. /key 8.
fame He is best-known for his 1972 account of a drug-addled Nevada
trip, Fear and Loathing in Las Vegas. /fame 9.
fame Rosa Parks was arrested for her refusal to give up her bus seat /fame 10. Last August, he said Mr Blair should learn the lessons of Iraq and make
a pledge to the party conference that he would not launch anymore preemptive strikes.
11.
work relationships character ”Benny’s not the kind to
sweep up to the studio in a huge limousine like some showbiz superstar,”
explained Dennis Kirkland, a former floor manager who has been producing the show for seven years and is one of Benny’s’ few close friends.
/character /relationships /work 12.
work He had taken the risk of giving up a secure and promising career with The Times; the risk of accepting a salary from his publishers on
the understanding that he would produce saleable novels; the risk, financially forced on him, of removing himself from the London literary scene
and into the country. /work 13. “She’s trying to fight it- she’s ignored me since the first day of shooting.”
14. No evidence.
15. What about my teeth? she asked, thinking of her mother.
16. There are also complaints that taxpayers’ money allegedly being sent to
Hizbullah in Lebanon would be better spent at home.
17.
character He also had a reputation as something of a diplomat,
who understood the intricacies of foreign policy, especially the importance of the Saudi dynasty’s relationship with the United States. /character 18.
relationships key George Best was born in Belfast, the son of
a shipyard worker. He was spotted by a Manchester United scout while
still at school. /key /relationships 226
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
19.
key As a captain in the Royal Canadian Artillery Regiment, he lost a
finger on the first morning of the D-Day landings in Normandy. /key 20. Another granny has been mob-raped in her sock by black boys and skinheads.
21.
character education work He worked in menial jobs to
pay his way through college where he studied journalism and became a
radical. /work /education /character 22.
relationships And in 2000, aged 80, Doohan boldly went into fatherhood for the seventh time when his then 43-year-old wife gave birth
to a daughter, Sarah. /relationships 23.
character But soon the shy, unworldly boy from Belfast was caught
up in the trappings of fame. /character 24. While fellow professionals may be understandingly tolerant of a comic
who can give old material a new shine, those less personally concerned
with the difficulties of creating new scripts are not always so charitable.
25. Five traffic policemen were taken for questioning, while one was reported
to have run away.
26. Several days previously I had counseled caution - Michael was doing a
trawl of those who were committed to him.
27. The audience applauded.
28. Nick Ridley, no longer in the Cabinet but a figure of more than equivalent
weight, also assured me of his complete support.
29. His eyes glittered strangely in his masklike makeup: Kate thought he
looked like a novelette villain. You’ve been reading too much Barbara
Cartland, she told herself.
30. He was following the route taken by the Parliamentary Army during the
Civil War - “over the final ridge of the Cotswolds, to Chipping Norton”,
and his personal experience of this journey appears in the opening of his
biography — the level wash of fields. . . divided by grey walls, lapping
round the small church and rising to the height of the gravestones in a
foam of nettles before dwindling out against the black rise of Wychwood.
31. Snuff movies — now this is evidence. And then his manner, the force
field he gave off, it changed, not for long.
32. It’s as simple as that.
33. Being in Northern Ireland, he was not closely in touch with parliamentary opinion and could not himself offer an authoritative view of my
prospects.
34. There is also the matter of Mr Brown’s personal style.
35.
relationships work King Fahd, who ascended the Saudi throne
in 1982, was one of seven sons of the founder of Saudi Arabia, King
Abdel-Aziz, and his favourite wife, Hassa. /work /relationships 227
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
36. Only mine drinks.
37. Maliyah was to begin kidney dialysis in preparation for receiving a kidney from her mother in three to six months.
38. Was I driving like a twat? asked Hammond, before walking gingerly to
the toilet.
39.
relationships He admired her courage, he said, but it did not last.
/relationships 40. Re: 4.44pm.
41. He was real shy.
42.
fame Her Oscar nominations were for The Pumpkin Eater (1964),
The Graduate (1967), The Turning Point (1977) and Agnes of God (1985).
/fame 43. One particular example of this, the American novel Finistre Fritz Peters,
Finistre (Gollancz, 1951). which had appeared in 1951, was much admired by Alan.
44. Ten per centum.
45.
key In March 2000 he spent several weeks in hospital with a liver
problem, almost certainly a result of his drinking. /key 46. Toms, sitting 12 feet to the right of the hole in three, misses what he expected to have for a win, and Europe are dormy two. 5.04pm - Casey/Howell
4up (10) Apologies for the lack of coverage on this match, but with Europe well in control, Sky have deemed it too boring to spend any broadcast time in the past 45 minutes or so.
47.
character The young Fahd was known as a technocrat and political wheeler-dealer. He knew about internal security and about how the
kingdom needed to be defended from within. /character 48. Bellow always said that of all his heroes, Henderson most resembled himself, and the book remained one of his favourites.
49. At least I hope we are. I hope so, too.
50.
character In due course his liberal views caught him in the McCarthy anti-communist witch-hunt. /character 51. Such ideas are buzzing through Benny’s brain months before a new show
goes into production.
52. I called on Janet Dare.
53. Remember, once you are here as a French citizen you may find it impossible to leave.
54.
work Max Bygraves, another entertainer who has watched Benny develop over more than 40 years, is similarly impressed, “he’s a very fine
comedian,” says Max. /work 228
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
55. I have to learn things or sign cards for fans. If I lived alone I would nearly
go crazy.
56.
fame Found guilty of breaking the law which required black people
to give up their bus seats to whites, Rosa Parks was fined $14. /fame 57. I felt, too, that the novel as an art form was essentially a variation of a
poem.
58. Kylie ran sobbing out of the studios and did what was still the most natural thing in the world for a 20-year-old girl — she ran home to mummy.
59.
relationships His personal life became increasingly more difficult, with bouts of alcoholism, bankruptcy and the failure of his first marriage. /relationships 60. The foreigners in Ottawa constitute an ominous threat to the integrity
and autonomy of our province.
61.
work The bearer of the ill-tidings was her newly appointed PPS Sir
Peter Morrison. /work 62. Typically, he had made a game out of the difficult process of breaking the
ice, so that when among friends, notably Robin and his friend Christopher Bennett, they would share what Alan chose to call ’sagas’ or ’sagaettes’.
63. There was chaos in several of the committee rooms which were packed
with Tory MPs, enraged that the press had been the first to know.
64. She has told colleagues that financial stringency would become even more
necessary under a Gordon Brown premiership, after five years of rapid
growth in the health service budget reaches an end in March 2008.
65. Mason and Ford also reckoned without Hawthorne’s note-taking.
66.
key relationships Though he had married again in 1995 and
had gained regular employment on television and as an after-dinner speaker,
his alcoholism continued to plague his mind and body. /relationships /key 67. She’d be still protecting the people of the city she loved, defending the
nation she loved, keeping it from harm.
68.
key He was born in New York in 1915. His father owned a garment
factory but faced financial ruin after the Great Crash of 1929. /key 69.
key fame She was born Rosa Louise McCauley on 4 February,
1913, in Tuskegee, Alabama, family illness interrupted her high school
education, but she graduated from the all-African American Booker T
Washington High School in 1928, and attended Alabama State College in
Montgomery for a short time. /fame /key 70. Carefully she sniffed the fruit, dug her nails in the skin, peeled it in one
long length; then she took a whole day to eat it, sucking each segment
carefully, savouring the fragrant juice that spurted into her mouth.
229
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
71. James Father Jim Deshotels, 50, is a nurse and Jesuit priest who tended to
the Superdome’s injured and sick refugees for five days.
72. I am certain I did not convert any colleague who might have been watching the box; but Michael did remain the public’s favourite throughout
both elections.
73. Using a card keyboard, she spells out answers to questions I have for her.
74.
character He claimed that the experience made him turn over a
new leaf, but in 1990 millions watched his infamous drunken performance on the Wogan television chat show /character 75.
work relationships key They had seen one some months
earlier, a puppy of fourteen weeks with a beautiful smoky fur, belonging
to Raymond’s wife, Charlotte (the Raymond Greenes were then living in
Oxford where Raymond had a medical practice), and this led Greene to
/relationships /work buy one /key 76.
fame Bancroft won an Oscar for her role in The Miracle Worker /fame 77. These were amazing admissions: practically all western interviews with
Chinese leaders before and since have been bland and dull, but Fallaci
got Deng to speak extraordinarily frankly by Chinese standards.
78.
work relationships Peter was the Thatcherite brother of the
’wet’ Charlie Morrison, and son of the John Morrison who had been the
chairman of the ’22 when I was first elected in 1959. /relationships /work 79. Israel has been carving out a five mile deep security zone north of the
Lebanese border over the past fortnight, but Wednesday’s security cabinet decision authorised the armed forces to extend the zone as far as the
Litani River, 18 miles north of the border, and beyond.
80.
work ’Along with Buckingham Palace and the Tower of London, our
Teddington studios appear to be well and truly on the American tourist
route,’ laughs Dennis Kirkland, producer and director of the Benny Hill
Show. /work 81. This will put all NHS trusts on the same footing as foundation hospitals,
whose investments are rigorously supervised by Monitor, their regulator.
82. (He was also a stringent scholar.)
83.
fame Bancroft secured parts on Broadway, and in 1958, won her first
Tony opposite Henry Fonda in Two for the Seesaw. /fame 84.
key Born in Vancouver, British Columbia, in 1920, his early life, like
that of his contemporaries, was dominated by World War II. /key 85. I was almost sure that he would be, all the same.
86. He tried to convince me that the Cabinet were misreading the situation,
that I was being misled and that with a vigorous campaign it would still
be possible to turn things round.
87. It has been a very moving experience.
230
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
88.
fame relationships His split and eventual divorce from his
wife - with Mr Cook revealing an affair with his secretary to his wife
Margaret as they prepared to head off on holiday after a phone call from
Downing Street - caused a welter of embarrassing headlines. /relationships /fame 89.
work Comedian Bob Monkhouse, who has no mean memory for jokes
himself, recalls writing for the radio show Calling All Forces back in 1951
when Benny, then an up-and-coming comic, appeared with film star Diana Dors and funnyman Arthur Askey. /work 90. It appeared that Cranley Onslow, the admirable chairman of the ’22, had
read out the results in the wrong committee room.
91.
key His liver was said to be functioning at only 20%.
/key 92. Research has shown that some 345 children in Norfolk suffer from type 1
diabetes - more than double the 160 predicted cases for the county.
93. No.
94.
relationships Following their breakup in 1961, Miller married the
renowned photographer Inge Morath, whom he met on the set of the film
The Misfits, which he wrote and which starred Monroe. /relationships 95.
relationships Many were astonished when he married Marilyn
Monroe in 1956. /relationships 96. I had a longing to do so, and a burning curiosity to see Lady Constance
even more than the Roman remains.
97.
work key In 1930 he had begun writing the biography of John
Wilmot, the second Earl of Rochester, who was born on to April 1647
and died at the age of thirty-three. /key /work 98.
character His heyday occurred during the swinging sixties, and,
with his good looks, he brought a pop star image to the game for the first
time. /character 99.
character He had a reputation as a playboy in his youth, with allegations of womanising, drinking and gambling to excess. /character 100. You need earnestness and all you’ve got to succeed in this profession, I
can assure you.
B.2.5
Set Five
1.
fame On previous occasions, Irwin, known worldwide for his Discovery Channel programmes, was allegedly killed by a black mamba and
a komodo dragon. /fame 2.
work Mrs Thatcher’s supporters, and especially her ’Court’, have found
it hard to come to terms with her resignation. /work 3. She’s not bad actually.
231
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
4.
character A natural political loner, he had built up a base of support on the backbenches that he had, perhaps, never enjoyed as a cabinet
minister. /character 5. Their first visitor, Hugh Greene, must have had impressed upon him the
primitiveness and isolation of the young couple’s living conditions: “We
haven’t too much room.”
6.
relationships His split and eventual divorce from his wife - with
Mr Cook revealing an affair with his secretary to his wife Margaret as
they prepared to head off on holiday after a phone call from Downing
Street - caused a welter of embarrassing headlines. /relationships 7. HSBC said the changes to its overdraft rules were designed to bring greater
clarity about what an overdraft service is, how customers apply for an
overdraft and how fees are charged, though it conceded they were also
in part about helping to reduce its bad debts.
8.
work He appeared in several more plays, and also broke into radio.
He was in 300 editions of The Navy Lark as A B Johnson. /work 9. He raced through the alleys of the Jewish quarter down to the waterfront
where cantilevered Chinese fishing nets were spread out against the sky;
but the fish he sought did not leap out of the waves.
10.
work The Thatcherites listened to their lost leader and voted for “dear
John” in ignorance of the fact that his political hero was Iain Macleod.
/work 11.
work However, they only really discovered each other in 1960 after
Bergman had become one of the world’s leading directors, though Nykvist
had co-shot the expressionistic Sawdust and Tinsel seven years previously. /work 12. They argue that litigation could follow if schools became too involved in
other areas.
13.
fame Ms Dworkin sparked international debate by arguing that pornography was a violation of women’s rights and a precursor to rape. /fame 14. In all 15 bodies had been recovered, with others thought to be trapped in
the wreckage of carriages left dangling in midair.
15.
key Mr Cook was born Robert Finlayson Cook on 28 February 1946 at
Bellshill, Lanarkshire. /key 16. I feel the same as I ever did, he said at the time, which is that I don’t
believe that a man has to become an informer in order to practice his
profession freely in the United States.
17. I will say that for her.
18. William Waldegrave, my most recent Cabinet appointment, arrived next.
19. To me, the Dome is a place I’ve been to twice to help people.
20. Being adequately provided for, he was able to book himself into a downtown hotel which cost him three dollars per night, though he often failed
232
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
to make it back to the hotel, finding the cosmopolitan and nocturnal life
of the town there entirely to his liking: consecration dismantled!
21. It was not an evasion, nor a disguised threat, nor a way of abandoning
my cause without admitting the fact.
22.
character Some in the media liked to picture her as tough and hard
and difficult, but she was soft and with a lovely voice and a good sense
of humour /character 23. Pyle remained bemused by obsessive adulation and his unwitting appropriation into the Canterbury scene: his lyrics to Richard Sinclair’s What’s
Rattlin quote One question we all dread/ What’s doing Mike Ratledge (a
reference to fans asking him about the Softs keyboard player, with whom
he never collaborated).
24.
relationships work Francis Maude, Angus’s son and Minister
of State at the Foreign Office, whom I regarded as a reliable ally, told me
that he passionately supported the things I believed in, that he would
back me as long as I went on, but that he did not believe I could win.
/work /relationships 25. I understood. Is she very bad? No.
26. In them, Riaan responds to questioning, nods and shakes his head, drinks
through a straw, often laughs and says, ’Hello.’
27.
fame But it was with Cries and Whispers (1972), for which Nykvist
won an Academy Award, that the real breakthrough came. /fame 28.
fame character key John Young, who has died aged 85, will
have a prominent place in the Brewers’ Hall of Fame, revered as the father of the real ale revolution, an iconoclast who believed in good traditional beer drunk in good traditional pubs. /key /character /fame 29. Really quite a treat, in many ways.
30. Roland did want this.
31.
fame work He first gained attention for his work on Barabbas (1953)
and Karin Mansdotter (1954) with Alf Sjoberg, Sweden’s most important
post-war director before the advent of Bergman. /work /fame 32. Val’s papers were bland and minimal, in large confident handwriting,
well laid-out. Male Ventriloquism was judged to be good work and discounted by the examiners as probably largely by Roland, which was doubly unjust, since he had refused to look at it, and did not agree with its
central proposition, which was that Randolph Henry Ash neither liked
nor understood women, that his female speakers were constructs of his
own fear and aggression, that even the poem-cycle, Ask to Embla, was
the work, not of love but of narcissism, the poet addressing his Anima.
33.
key Officially, after a relatively short break at the time, he resumed
many of his duties using a wheelchair and stick. /key 34. God was generous to us and granted us this victory against our enemy.
233
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
35. We sat up for two hours asking him questions and he answered all of
them.
36.
work Barker himself, however, was among many viewers who regarded his portrayal of Fletcher in Porridge as the best work he ever did.
/work 37. I slipped out near Times Square.
38. Once asked to reveal his favourite joke, he trotted out the well-worn tale
of the Member of Parliament who was visiting a mental home and was
amazed to discover that a beautifully designed flower bed was the work
of an inmate.
39. Robin’s interests were more uniformly distributed.
40.
work Chris and I had worked together for many years from the time
when he was Director of the Conservative Research Department until I
brought him into the Cabinet in 1989. /work 41. But Lais.
42.
relationships After an unhappy three-year marriage to builder
Martin May, Anne Bancroft married the comedian-director Mel Brooks
in 1964. /relationships 43. Yesterday’s announcement comes weeks after two of Britain’s biggest
credit card providers revealed they were tightening up their borrowing
rules.
44.
relationships key In a statement to the Aspen Daily News,
Thompson’s son, Juan, said: On February 20, Dr Hunter S Thompson
took his life with a gunshot to the head at his fortified compound in
Woody Creek, Colorado. /key /relationships 45. That’s collective responsibility.
46. Yes, I do have fundamental convictions . . . but we do have very lively
discussions because that is the way I operate.
47. He sat with his hands on his chin in a sage velvet wing chair on one side
of the fireplace, while she sat on the other, reporting to him twice a week.
48. HSBC is the latest big lender to amend its borrowing terms amid continuing concern about soaring consumer debt.
49. And I asked him, These movies — they exist? Sure.
50.
fame In 1996, she received the Presidential Medal of Freedom before
being awarded the United States’ highest civilian honour, the Congressional Gold Medal, in 1999. /fame 51. If no more a masterwork than the spin-off novel, it did - like Roots - alert
many people to something of which they were ignorant, especially when
eventually shown in Germany, where the statute of limitation on Nazi
criminals was then lifted.
234
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
52.
fame In 2001, Ms Dworkin won the American book award for writing
Scapegoat: the Jews, Israel and Women’s Liberation. /fame 53.
fame Irwin was criticised for holding his infant son near a crocodile
pool while feeding chickens to a four-metre long crocodile. /fame 54.
character As he said later: Mrs Parks was a married woman, she
was morally clean, and she had fairly good academic training. . . /character 55.
56.
relationships /relationships key Thompson’s son, Juan, found his body.
/key key His chosen successor, his half-brother Abdullah, is the head of
the National Guard, the tribal army largely responsible for the kingdom’s
internal security. /key 57. ’I ’ave a car outside and we will all go to my ’otel fur a drink.’
58. Her gifts of war came down to her from some unknown ancestor; and
though her adversaries grabbed her hair and called her Jewess they never
vanquished her.
59. I have sat there dumbstruck with admiration at the switches he has pulled.
60. All the copies I’ve seen are quite thinly painted; that is, the surface imitates what they think Leonardo did.
61. He said the government needed an overarching commitment to social
justice - not a leadership soap opera.
62.
relationships However, after eventually marrying his mistress,
Gaynor Regan, in a secret ceremony, many of Mr Cook’s troubles seemed
behind him as Labour approached the 2001 general election. /relationships 63. But Toby sat up, pouted and said in an odd, little-girl voice. Why can’t
Toby have nice things like you do? He pulled her on to the bed beside
him and murmured, “Toby loves looking pretty, Toby loves dressing up
like this, but promise it’s a secret between us, between two girl friends?”
64.
key He was diabetic, for many years a heavy smoker and suffered a
stroke in 1995. /key 65. Then he violently shoved her down the small flight of stairs that led off
their bedroom to the bathroom.
66.
fame In 1975 John Young was made a CBE to mark his work in brewing and for charity: he was chairman of the National Hospital for Nervous Diseases in Bloomsbury and raised millions of pounds to build new
wards and install modern equipment. /fame 67.
key She was born Anna Maria Louise Italiano in New York’s Bronx in
1931, and began acting as Anne Marno. But it was felt this name sounded
too ethnic, so she opted instead for Bancroft. /key 68. I was not surprised.
235
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
69.
character His ex-wife wrote a book in which she said of her former husband; his self-regard was easily punctured and his reaction was
protracted and troublesome. /character 70.
key Nicknamed Robin at school, he attended Aberdeen Grammar School
before studying English Literature at Edinburgh University. /key 71.
fame Later in the year he was attacked for allegedly filming too close
to penguins, seals and humpback whales in the Antarctic /fame 72. So they do that in the morning, then head here.
73. He walked unaided and didn’t need a wheelchair, the source said.
74.
key work Former Foreign Secretary Robin Cook, 59, has died after
collapsing while out hill walking in Scotland. /work /key 75. Haunting melodies are counterbalanced elsewhere by hard funk rhythms,
while lyrically the themes are often stark or angry, as Pyle alludes to the
political and cultural attitudes that led him to emigrate to France in the
early 1980s.
76.
relationships work Widowed in 1977, she founded the Rosa
and Raymond Parks Institute for Self Development a decade later, to develop leadership among Detroit’s young people. /work /relationships 77. But I had written off my next visitor, Malcolm Rifkind, in advance.
78. By the time the sale finishes on September 21, it is expected to have
smashed the UK’s eBay transaction record, the 103,000 raised by Margaret Thatcher’s handbag.
79.
character key On 1 December 1955, the 42-year-old seamstress,
and member of the Montgomery chapter of the National Association for
the Advancement of Colored People (NAACP), was sitting on a bus when
a white man demanded to take her seat. /key /character 80.
relationships key He was born in New York in 1915; his father owned a garment factory but faced financial ruin after the Great
Crash of 1929. /key /relationships 81. She didn’t mention it the next day, but that evening Toby, having had
rather a lot of brandy after the quiche aux pinards, said sarcastically, I
don’t think spinach tart is one of your stronger points, darling, and proceeded upstairs.
82. ’No jokes,’ warned the massive figure of Sir Peter Tapsell, another of
Michael’s inner circle.
83.
work However, he joined Aylesbury Repertory Company in 1948, while
still in his teens, before taking to the West End stage at the invitation of
Sir Peter Hall, where he appeared in Mourning Becomes Her in 1955.
/work 84.
character His stance on the Iraq war - and his resignation speech
- only enhanced his reputation as a man of principle and a great Parliamentarian. /character 236
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
85. At the bottom of Mud Lane was a pump, said to be haunted by the ghost
of a dancing bear.
86.
key The Labour MP for Livingston was considered one of the Commons’ most intelligent MPs and one of its most skilled debaters. /key 87. Marks and Spencer was rated the most ethical retailer, scoring 3.27 on a
scale from one to five.
88. No one yet knows exactly how a sleeping pill could wake up the seemingly dead brain cells, but Nel and Clauss have a hypothesis.
89. Professor Pacey referred to Layton as a poet of revolutionary individualism, and there can be no doubt that that individualism was a common
tie, and not merely religiously but in every way.
90.
relationships work Ronnie Barker first worked with Ronnie
Corbett in The Frost Report and Frost on Sunday, programmes for which
he also wrote scripts. /work /relationships 91.
key My digs in London were now 13 Baldwin Crescent, Camberwell,
in a less fashionable part than in my old Kensington haunts. /key 92.
character An austere and respected figure, Crown Prince Abdullah
is untainted by corruption, while being regarded by many as less enthusiastically pro-American than King Fahd. /character 93. It was like that shot in the arm they’d given her in the hospital, it made
her feel that she could do anything — and made her want to do something .
94.
character relationships After marrying Raymond Parks in
1932, she became involved in the NAACP, where she gained a reputation
as a militant and a feminist and was the driving force in campaigns to encourage black voter registration. /relationships /character 95.
relationships He was the fourth of his siblings to be king. Two of
his brothers lost power violently - one was deposed in a coup; the other
was assassinated. /relationships 96.
character Her agent of 30 years, Elaine Markson, said: some in the
media liked to picture her as tough and hard and difficult, but she was
soft and with a lovely voice and a good sense of humour. /character 97. From the aspect of method, I could see that to create a character who
suffered from verbal illusions on the printed page would be clumsy.
98. As we walked Brigitte introduced us to her two friends.
B.3 Agreement Data
Note that agreement data presented in the following tables does not map directly to the questions presented in Section B.2.
237
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
B.3.1
Set 1 Agreement Data
Table B.1: Agreement Data for Set 1
Sentence
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
Rater 1
non-bio
non-bio
non-bio
bio
bio
bio
non-bio
non-bio
bio
non-bio
non-bio
bio
non-bio
non-bio
bio
non-bio
non-bio
bio
bio
non-bio
bio
non-bio
non-bio
non-bio
non-bio
non-bio
non-bio
bio
non-bio
non-bio
bio
bio
bio
bio
non-bio
bio
bio
non-bio
non-bio
non-bio
non-bio
bio
Rater 2
bio
non-bio
non-bio
bio
bio
bio
non-bio
non-bio
bio
non-bio
non-bio
bio
non-bio
non-bio
bio
bio
bio
bio
bio
non-bio
bio
non-bio
non-bio
non-bio
bio
non-bio
non-bio
non-bio
bio
non-bio
bio
bio
bio
bio
non-bio
bio
bio
non-bio
non-bio
non-bio
non-bio
bio
238
Rater 3 Rater 4 Rater 5
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio
bio
bio
bio
bio
bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
non-bio
bio non-bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
bio non-bio non-bio
non-bio non-bio non-bio
bio
bio non-bio
bio
bio
bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio
bio non-bio
bio
bio
bio
continued on next page
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
Table B.1: continued from previous page
Sentence
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
Rater 1
non-bio
non-bio
bio
non-bio
bio
bio
bio
bio
bio
bio
bio
non-bio
bio
bio
non-bio
bio
non-bio
non-bio
bio
non-bio
non-bio
bio
bio
non-bio
non-bio
bio
bio
non-bio
bio
non-bio
bio
non-bio
bio
non-bio
non-bio
bio
non-bio
bio
bio
bio
bio
bio
bio
bio
Rater 2
non-bio
non-bio
bio
non-bio
bio
non-bio
bio
bio
bio
bio
bio
non-bio
non-bio
bio
non-bio
bio
bio
non-bio
bio
non-bio
non-bio
non-bio
bio
non-bio
non-bio
non-bio
bio
non-bio
bio
non-bio
bio
non-bio
bio
non-bio
non-bio
bio
non-bio
bio
bio
non-bio
bio
bio
non-bio
bio
239
Rater 3 Rater 4 Rater 5
non-bio non-bio non-bio
non-bio non-bio non-bio
bio non-bio
bio
non-bio non-bio non-bio
bio
bio
bio
bio
bio non-bio
bio non-bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
non-bio non-bio non-bio
bio
bio
bio
bio
bio
bio
non-bio
bio non-bio
bio
bio
bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio
bio
bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
bio
bio
bio
bio non-bio
bio
non-bio non-bio non-bio
bio
bio
bio
bio
bio
bio
non-bio non-bio non-bio
bio
bio
bio
continued on next page
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
Table B.1: continued from previous page
Sentence
87
88
89
90
91
92
93
94
95
96
97
98
99
100
B.3.2
Rater 1
bio
non-bio
bio
bio
bio
non-bio
bio
bio
non-bio
bio
non-bio
non-bio
bio
non-bio
Rater 2
bio
non-bio
bio
bio
non-bio
non-bio
bio
bio
non-bio
bio
non-bio
non-bio
bio
non-bio
Rater 3
bio
bio
bio
bio
bio
bio
bio
bio
non-bio
non-bio
non-bio
non-bio
bio
non-bio
Rater 4
bio
non-bio
bio
non-bio
bio
non-bio
bio
bio
non-bio
non-bio
non-bio
non-bio
bio
non-bio
Rater 5
bio
non-bio
bio
non-bio
non-bio
non-bio
bio
bio
non-bio
non-bio
non-bio
non-bio
bio
non-bio
Set 2 Agreement Data
Table B.2: Agreement Data for Set 1
Sentence
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Rater 1
bio
non-bio
non-bio
bio
bio
bio
non-bio
non-bio
bio
bio
bio
bio
non-bio
bio
non-bio
non-bio
non-bio
non-bio
bio
bio
non-bio
bio
bio
Rater 2
bio
non-bio
non-bio
bio
bio
bio
non-bio
non-bio
bio
bio
bio
bio
non-bio
bio
bio
non-bio
bio
non-bio
bio
bio
non-bio
bio
bio
240
Rater 3 Rater 4 Rater 5
bio
bio
bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio
bio
bio
non-bio
bio
bio
bio
bio
bio
non-bio non-bio non-bio
non-bio non-bio
bio
non-bio
bio
bio
bio
bio
bio
bio
bio non-bio
bio
bio
bio
non-bio non-bio non-bio
bio
bio
bio
bio
bio
bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio
bio non-bio
bio
bio
bio
bio
bio
bio
bio non-bio non-bio
bio
bio
bio
bio
bio
bio
continued on next page
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
Table B.2: continued from previous page
Sentence
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
Rater 1
non-bio
non-bio
non-bio
bio
non-bio
non-bio
bio
bio
bio
bio
bio
non-bio
non-bio
bio
non-bio
non-bio
bio
bio
non-bio
bio
non-bio
bio
non-bio
non-bio
non-bio
non-bio
non-bio
bio
non-bio
bio
bio
non-bio
non-bio
non-bio
non-bio
non-bio
bio
non-bio
bio
non-bio
non-bio
bio
bio
non-bio
Rater 2
non-bio
non-bio
non-bio
non-bio
non-bio
non-bio
bio
non-bio
bio
bio
bio
non-bio
non-bio
bio
non-bio
non-bio
bio
bio
non-bio
bio
non-bio
bio
non-bio
non-bio
non-bio
non-bio
non-bio
bio
non-bio
non-bio
bio
non-bio
non-bio
non-bio
non-bio
non-bio
bio
non-bio
bio
non-bio
non-bio
bio
bio
non-bio
241
Rater 3 Rater 4 Rater 5
non-bio non-bio non-bio
bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio
bio
bio
bio
bio
bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio
bio
bio
bio non-bio non-bio
bio
bio
bio
bio
bio
bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
non-bio
bio
bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio
bio
bio
bio
bio
bio
non-bio non-bio
bio
continued on next page
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
Table B.2: continued from previous page
Sentence
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
B.3.3
Rater 1
bio
non-bio
bio
bio
non-bio
bio
non-bio
non-bio
bio
bio
bio
bio
non-bio
non-bio
non-bio
non-bio
non-bio
non-bio
non-bio
bio
bio
bio
bio
bio
non-bio
bio
bio
bio
non-bio
bio
non-bio
non-bio
non-bio
Rater 2
bio
non-bio
bio
non-bio
non-bio
bio
non-bio
non-bio
bio
non-bio
bio
bio
bio
non-bio
non-bio
non-bio
non-bio
non-bio
non-bio
bio
bio
bio
bio
bio
non-bio
bio
non-bio
bio
non-bio
bio
non-bio
non-bio
non-bio
Rater 3
bio
non-bio
non-bio
bio
non-bio
bio
non-bio
non-bio
bio
bio
bio
bio
non-bio
non-bio
non-bio
non-bio
non-bio
non-bio
non-bio
bio
non-bio
bio
bio
bio
bio
bio
bio
bio
non-bio
bio
non-bio
non-bio
non-bio
Rater 4
bio
non-bio
bio
bio
non-bio
bio
non-bio
non-bio
bio
bio
bio
bio
non-bio
non-bio
non-bio
non-bio
non-bio
non-bio
non-bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
non-bio
bio
non-bio
non-bio
bio
Rater 5
bio
non-bio
bio
bio
non-bio
bio
non-bio
non-bio
bio
bio
bio
bio
non-bio
non-bio
non-bio
non-bio
non-bio
non-bio
non-bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
non-bio
bio
non-bio
non-bio
non-bio
Set 3 Agreement Data
Table B.3: Agreement Data for Set 3
Sentence
1
2
3
4
Rater 1
bio
bio
bio
bio
Rater 2
bio
bio
bio
bio
242
Rater 3 Rater 4 Rater 5
bio
bio
bio
bio
bio
bio
bio
bio
bio
non-bio non-bio
bio
continued on next page
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
Table B.3: continued from previous page
Sentence
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
Rater 1
non-bio
non-bio
non-bio
non-bio
non-bio
non-bio
bio
non-bio
bio
bio
non-bio
non-bio
non-bio
non-bio
non-bio
bio
non-bio
non-bio
bio
bio
bio
bio
bio
non-bio
bio
non-bio
bio
bio
bio
bio
bio
non-bio
bio
non-bio
non-bio
non-bio
non-bio
bio
bio
bio
bio
bio
bio
non-bio
Rater 2
non-bio
non-bio
non-bio
non-bio
non-bio
bio
bio
non-bio
bio
non-bio
non-bio
non-bio
non-bio
bio
bio
bio
non-bio
bio
bio
bio
bio
bio
bio
non-bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
non-bio
non-bio
bio
non-bio
bio
bio
bio
bio
bio
bio
non-bio
243
Rater 3 Rater 4 Rater 5
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
bio
bio
bio
bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio
bio non-bio
bio
bio
bio
non-bio non-bio non-bio
bio
bio
bio
bio non-bio non-bio
non-bio non-bio non-bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
bio non-bio
bio
bio non-bio
non-bio
bio
bio
bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio
bio
bio
bio
bio
bio
non-bio
bio
bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
continued on next page
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
Table B.3: continued from previous page
Sentence
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
Rater 1
non-bio
non-bio
bio
non-bio
bio
non-bio
bio
non-bio
bio
non-bio
bio
non-bio
non-bio
bio
non-bio
non-bio
non-bio
non-bio
non-bio
non-bio
non-bio
non-bio
non-bio
bio
non-bio
bio
non-bio
bio
bio
bio
bio
non-bio
bio
non-bio
bio
non-bio
non-bio
bio
bio
non-bio
bio
non-bio
non-bio
non-bio
Rater 2
non-bio
bio
bio
non-bio
bio
non-bio
bio
bio
bio
non-bio
bio
non-bio
non-bio
bio
bio
non-bio
non-bio
bio
non-bio
non-bio
bio
non-bio
non-bio
bio
bio
non-bio
non-bio
bio
bio
bio
bio
non-bio
bio
non-bio
bio
non-bio
bio
bio
bio
non-bio
bio
non-bio
non-bio
non-bio
244
Rater 3 Rater 4 Rater 5
non-bio non-bio non-bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio non-bio
bio
bio
bio
bio
non-bio non-bio non-bio
bio
bio
bio
non-bio
bio non-bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
non-bio non-bio
bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio
bio
bio
non-bio
bio
bio
bio
bio non-bio
non-bio non-bio non-bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
continued on next page
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
Table B.3: continued from previous page
Sentence
93
94
95
96
97
98
99
100
101
B.3.4
Rater 1
non-bio
bio
bio
bio
non-bio
non-bio
non-bio
non-bio
bio
Rater 2
non-bio
bio
non-bio
bio
non-bio
non-bio
non-bio
non-bio
bio
Rater 3
non-bio
non-bio
non-bio
bio
non-bio
non-bio
non-bio
non-bio
bio
Rater 4
non-bio
bio
non-bio
bio
non-bio
non-bio
non-bio
non-bio
bio
Rater 5
non-bio
bio
non-bio
bio
non-bio
non-bio
non-bio
non-bio
bio
Set 4 Agreement Data
Table B.4: Agreement Data for Set 4
Sentence
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Rater 1
bio
non-bio
bio
bio
non-bio
non-bio
bio
bio
bio
non-bio
non-bio
bio
non-bio
non-bio
non-bio
non-bio
non-bio
bio
bio
non-bio
bio
bio
non-bio
non-bio
non-bio
non-bio
non-bio
non-bio
Rater 2
bio
non-bio
bio
bio
non-bio
non-bio
bio
bio
bio
non-bio
bio
non-bio
non-bio
non-bio
non-bio
bio
bio
bio
bio
bio
bio
bio
bio
non-bio
non-bio
non-bio
non-bio
non-bio
245
Rater 3 Rater 4 Rater 5
bio
bio
bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio
bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
non-bio non-bio non-bio
non-bio
bio
bio
bio non-bio non-bio
non-bio non-bio non-bio
non-bio
bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio non-bio
bio
bio
bio
bio
bio
bio
bio
non-bio non-bio non-bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio non-bio non-bio
continued on next page
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
Table B.4: continued from previous page
Sentence
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
Rater 1
non-bio
bio
non-bio
non-bio
non-bio
non-bio
bio
non-bio
bio
non-bio
non-bio
non-bio
bio
bio
bio
non-bio
bio
non-bio
bio
bio
non-bio
non-bio
non-bio
non-bio
non-bio
bio
non-bio
bio
non-bio
bio
bio
non-bio
non-bio
bio
non-bio
non-bio
non-bio
bio
non-bio
bio
bio
non-bio
bio
non-bio
Rater 2
non-bio
non-bio
non-bio
non-bio
non-bio
non-bio
bio
non-bio
bio
non-bio
non-bio
non-bio
bio
bio
bio
non-bio
bio
non-bio
bio
non-bio
non-bio
bio
non-bio
non-bio
non-bio
bio
non-bio
bio
non-bio
non-bio
bio
non-bio
non-bio
non-bio
non-bio
non-bio
non-bio
bio
non-bio
bio
bio
non-bio
bio
non-bio
246
Rater 3 Rater 4 Rater 5
bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio
bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
non-bio
bio non-bio
non-bio non-bio non-bio
bio
bio
bio
bio
bio
bio
bio
bio non-bio
non-bio non-bio non-bio
bio
bio
bio
bio non-bio non-bio
bio non-bio
bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio non-bio
bio
non-bio non-bio non-bio
non-bio
bio non-bio
non-bio non-bio non-bio
bio
bio non-bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio non-bio
bio
non-bio non-bio non-bio
bio
bio
bio
non-bio
bio non-bio
non-bio
bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
bio
bio
bio
bio
bio
bio
non-bio non-bio
bio
bio
bio
bio
non-bio non-bio non-bio
continued on next page
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
Table B.4: continued from previous page
Sentence
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
B.3.5
Rater 1
non-bio
bio
bio
bio
non-bio
bio
non-bio
non-bio
non-bio
bio
bio
bio
non-bio
non-bio
non-bio
bio
bio
bio
bio
non-bio
non-bio
bio
bio
non-bio
bio
non-bio
bio
non-bio
Rater 2
non-bio
bio
non-bio
bio
non-bio
bio
non-bio
bio
bio
bio
bio
bio
non-bio
non-bio
non-bio
non-bio
bio
bio
bio
non-bio
non-bio
bio
bio
bio
bio
bio
bio
non-bio
Rater 3
non-bio
bio
non-bio
bio
non-bio
bio
non-bio
bio
non-bio
bio
bio
bio
non-bio
non-bio
non-bio
bio
bio
bio
bio
non-bio
non-bio
bio
bio
non-bio
bio
bio
bio
non-bio
Rater 4
non-bio
non-bio
bio
bio
non-bio
bio
non-bio
bio
non-bio
bio
bio
bio
non-bio
non-bio
non-bio
bio
bio
bio
bio
non-bio
non-bio
bio
bio
non-bio
bio
bio
bio
non-bio
Rater 5
non-bio
bio
non-bio
bio
non-bio
bio
non-bio
bio
non-bio
non-bio
bio
bio
non-bio
non-bio
non-bio
bio
bio
non-bio
bio
non-bio
non-bio
bio
bio
non-bio
bio
non-bio
bio
non-bio
Set 5 Agreement Data
Table B.5: Agreement Data for Set 4
Sentence
1
2
3
4
5
6
7
8
9
Rater 1
bio
non-bio
non-bio
bio
non-bio
non-bio
non-bio
bio
non-bio
Rater 2
bio
bio
bio
bio
bio
bio
non-bio
bio
non-bio
247
Rater 3 Rater 4 Rater 5
non-bio
bio
bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
continued on next page
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
Table B.5: continued from previous page
Sentence
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
Rater 1
bio
bio
non-bio
non-bio
non-bio
bio
non-bio
non-bio
non-bio
non-bio
non-bio
non-bio
bio
bio
bio
non-bio
non-bio
bio
bio
non-bio
non-bio
bio
non-bio
non-bio
non-bio
non-bio
bio
non-bio
non-bio
non-bio
bio
non-bio
bio
non-bio
bio
non-bio
non-bio
non-bio
non-bio
non-bio
bio
non-bio
bio
bio
Rater 2
bio
bio
non-bio
bio
non-bio
bio
bio
non-bio
bio
non-bio
bio
non-bio
bio
bio
bio
non-bio
non-bio
bio
bio
non-bio
non-bio
bio
bio
bio
non-bio
non-bio
bio
non-bio
non-bio
non-bio
bio
non-bio
bio
non-bio
bio
non-bio
bio
non-bio
non-bio
non-bio
bio
non-bio
bio
bio
248
Rater 3 Rater 4 Rater 5
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
non-bio
bio non-bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio
bio non-bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
bio non-bio non-bio
bio
bio
bio
bio
bio
bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
bio non-bio non-bio
non-bio non-bio non-bio
bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
bio non-bio non-bio
bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
bio
bio
bio
bio
bio
bio
continued on next page
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
Table B.5: continued from previous page
Sentence
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
Rater 1
bio
bio
bio
non-bio
non-bio
non-bio
non-bio
non-bio
bio
non-bio
bio
bio
bio
bio
non-bio
bio
bio
bio
bio
non-bio
bio
non-bio
bio
non-bio
non-bio
bio
bio
non-bio
non-bio
bio
bio
non-bio
non-bio
non-bio
non-bio
non-bio
bio
bio
bio
bio
bio
bio
bio
non-bio
Rater 2
bio
bio
bio
non-bio
non-bio
non-bio
non-bio
bio
bio
non-bio
bio
non-bio
bio
bio
non-bio
bio
bio
bio
non-bio
bio
bio
bio
bio
non-bio
non-bio
bio
bio
bio
bio
bio
bio
non-bio
bio
non-bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
249
Rater 3 Rater 4 Rater 5
bio
bio
bio
bio
bio
bio
bio
bio
bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
bio
bio
bio
non-bio
bio non-bio
non-bio
bio
bio
bio
bio
bio
non-bio non-bio non-bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
non-bio non-bio non-bio
non-bio
bio non-bio
bio
bio
bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
non-bio
bio non-bio
bio
bio
bio
bio
bio
bio
non-bio non-bio non-bio
bio
bio non-bio
bio non-bio
bio
bio non-bio
bio
non-bio non-bio non-bio
bio
bio
bio
non-bio non-bio non-bio
non-bio non-bio non-bio
bio
bio non-bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
non-bio non-bio non-bio
bio
bio
bio
bio
bio
bio
bio
bio
bio
non-bio non-bio non-bio
continued on next page
A PPENDIX B: H UMAN S TUDY: M AIN S TUDY
Table B.5: continued from previous page
Sentence
98
Rater 1
non-bio
Rater 2
non-bio
250
Rater 3
non-bio
Rater 4
non-bio
Rater 5
non-bio
A PPENDIX C
Identifying Syntactic Feature
This appendix presents data derived from the statistical data presented in B IBER
(1988). See Section 8.3 on page 148 for a description of the methodology used.
The first section presents features ranked by raw distance from the mean, and
the second section sets forth features ranked by standard deviations from the
mean. For descriptions of the individual features used by B IBER (1988) are
given in Appendix D.
C.1
Distance From the Mean
This reproduces in full the list of features most prevalent in, and characteristic of, the biographical genre according to the methodology described in Section 8.3 on page 148 of this thesis (based on data derived from B IBER (1988)). A
table listing results for all sixty-seven features (ranked by maximum distance
from the mean) is reproduced in Table C.1. Table C.2 shows all features ranked
numerically with respect to the mean for each feature.
Table C.1: Sixty-seven Features Ranked by Distance from the
Mean (Irrespective of Whether the Distance is Positive or Negative)
Rank
1
2
3
4
5
6
7
8
9
10
Distance
41.922
29.700
24.690
16.340
13.027
12.522
10.136
8.072
7.163
4.486
Feature Name
present tense
adverbs
past tense
prepositions
nouns
contractions
second person pronouns
first person pronouns
attributive adjectives
private verbs
continued on next page
251
A PPENDIX C: I DENTIFYING S YNTACTIC F EATURE
Table C.1: continued from previous page
Rank
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
Distance
4.263
3.695
3.354
2.754
2.740
2.586
2.318
2.259
2.213
2.036
1.918
1.895
1.722
1.609
1.427
1.409
1.386
1.354
1.318
1.290
1.250
1.136
1.131
0.968
0.827
0.795
0.686
0.572
0.554
0.490
0.477
0.468
0.427
0.350
0.327
0.309
0.309
0.290
0.240
0.231
0.227
0.222
0.195
0.181
Feature Name
BE as main verb
type/token ratio
demonstrative pronouns
pronoun IT
predictive modals
nominalisations
analytic negation
emphatics
non phrasal coordination
that deletion
possibility modals
predictive adjectives
DO as pro-verb
adv. sub. - condition
agentless passives
perfect aspect verbs
stranded prepositions
phrasal coordination
split auxiliaries
infinitives
place adverbials
discourse particles
THAT verb complements
demonstratives
adv. subordinator - cause
synthetic negation
necessity modals
public verbs
hedges
indefinite pronouns
time adverbials
WH relatives: obj. position
suasive verbs
WH relatives: subj. position
past prt. WHIZ deletions
WH relatives: pied pipes
third person pronouns
BY passives
existential THERE
WH questions
adv. sub. - concession
THAT relatives: subj. position
downturners
adv. sub. - other
continued on next page
252
A PPENDIX C: I DENTIFYING S YNTACTIC F EATURE
Table C.1: continued from previous page
Rank
55
56
57
58
59
60
61
62
63
64
65
66
67
Distance
0.168
0.150
0.104
0.100
0.086
0.054
0.054
0.036
0.036
0.018
0.009
0.004
0
Feature Name
THAT relatives: obj. position
amplifiers
sentence relatives
past participle clauses
wordlength
present prt. WHIZ deletions
THAT adj complements
conjuncts
gerunds
WH clauses
split infinitives
SEEM/APPEAR
present participle clauses
Table C.2: Sixty-seven Features Ranked by Distance from the
Mean
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Distance
24.69
16.34
13.02
7.16
3.69
2.58
1.42
1.40
1.35
1.31
1.29
0.96
0.79
0.46
0.42
0.35
0.30
0.30
0.29
0.18
0.15
0.08
0.05
0.05
0.03
0.00
Feature Name
past tense
prepositions
nouns
attributive adjectives
type/token ratio
nominalizations
agentless passives
perfect aspect verbs
phrasal coordination
split auxiliaries
infinitives
demonstratives
synthetic negation
WH relatives: obj. position
suasive verbs
WH relatives: subj. position
WH relatives: pied pipes
third person pronouns
BY passives
adv. sub. - other
amplifiers
wordlength
present prt. WHIZ deletions
THAT adj complements
conjuncts
SEEM/APPEAR
continued on next page
253
A PPENDIX C: I DENTIFYING S YNTACTIC F EATURE
Table C.2: continued from previous page
Rank
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
Distance
0
-0.00
-0.01
-0.03
-0.10
-0.10
-0.16
-0.19
-0.22
-0.22
-0.23
-0.24
-0.32
-0.47
-0.49
-0.55
-0.57
-0.68
-0.82
-1.13
-1.13
-1.25
-1.38
-1.60
-1.72
-1.89
-1.91
-2.03
-2.21
-2.25
-2.31
-2.74
-2.75
-3.35
-4.26
-4.48
-8.07
-10.13
-12.52
-29.70
-41.92
Feature Name
present participle clauses
split infinitives
WH clauses
gerunds
past participle clauses
sentence relatives
THAT relatives: obj. position
downturners
THAT relatives: subj. position
adv. sub. - concession
WH questions
existential THERE
past prt. WHIZ deletions
time adverbials
indefinite pronouns
hedges
public verbs
necessity modals
adv. subordinator - cause
THAT verb complements
discourse particles
place adverbials
stranded prepositions
adv. sub. - condition
DO as pro-verb
predictive adjectives
possibility modals
that deletion
non phrasal coordination
emphatics
analytic negation
predictive modals
pronoun IT
demonstrative pronouns
BE as main verb
private verbs
first person pronouns
second person pronouns
contractions
adverbs
present tense
254
A PPENDIX C: I DENTIFYING S YNTACTIC F EATURE
C.2
Standard Deviations from the Mean
An alternative approach to the problem of calculating features characteristic of
the biographical genre provided by B IBER (1988) (see Section 8.3 on page 148)
involves measuring the distance of each different feature for the biographical
genre from the frequency of that feature for all genres in terms of the number of standard deviations. This figure — the z-score (O AKES, 1998) — can be
calculated by first subtracting the mean frequency for each feature across all
genres from the frequency of the that feature for the biographical genre, and
then dividing the result by the standard deviation for all genre.1 A table listing results for all sixty-seven features (ranked by maximum distance from the
mean in terms of standard deviations) is reproduced in Table C.3. Table C.4
shows all features ranked numerically with respect to the mean for each feature. Note that this alternative method produces results very similar to the
method described in Chapter 6.
Table C.3: Sixty-seven Features Ranked by Number of Standard
Deviations from the Mean.
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
1 The
Distance
1.60
1.46
1.27
1.19
1.13
1.09
1.05
1.01
1.01
0.93
0.89
0.89
0.86
0.82
0.79
0.79
0.78
0.69
0.69
0.68
0.68
0.63
0.62
0.62
0.61
0.61
Feature Name
adv. sub. condition
present tense
predictive adjectives
split auxiliaries
predictive modals
possibility modals
type/token ratio
second person pronouns
synthetic negation
demonstrative pronouns
necessity modals
past tense
emphatics
contractions
stranded prepositions
prepositions
DO as proverb
WH questions
phrasal coordination
past participle clauses
adv. subordinator cause
place adverbials
discourse particles
BE as main verb
pronoun IT
hedges
continued on next page
calculation was performed using a Perl script.
255
A PPENDIX C: I DENTIFYING S YNTACTIC F EATURE
Table C.3: continued from previous page
Rank
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
Distance
0.59
0.58
0.58
0.57
0.57
0.55
0.53
0.52
0.50
0.50
0.50
0.48
0.48
0.45
0.45
0.44
0.43
0.42
0.40
0.39
0.33
0.31
0.31
0.30
0.29
0.29
0.26
0.22
0.21
0.20
0.20
0.19
0.15
0.08
0.05
0.04
0.04
0.01
0.01
0.01
0.00
Feature Name
non phrasal coordination
adv. sub. concession
WH relatives: pied pipes
that deletion
private verbs
analytic negation
sentence relatives
perfect aspect verbs
adv. sub. other
attributive adjectives
WH relatives: obj. position
downturners
nouns
indefinite pronouns
THAT verb complements
demonstratives
suasive verbs
BY passives
infinitives
first person pronouns
THAT adj complements
WH relatives: subj. position
THAT relatives: obj. position
agentless passives
wordlength
existential THERE
THAT relatives: subj. position
nominalizations
adverbs
split infinitives
public verbs
time adverbials
past prt. WHIZ deletions
amplifiers
present prt. WHIZ deletions
WH clauses
conjuncts
third person pronouns
gerunds
SEEM/APPEAR
present participle clause
256
A PPENDIX C: I DENTIFYING S YNTACTIC F EATURE
Table C.4: Sixty-seven features Ranked by Number of Standard
Deviations from the Mean.
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
Distance
1.19
1.05
1.01
0.89
0.79
0.69
0.58
0.52
0.50
0.50
0.50
0.48
0.44
0.43
0.42
0.40
0.33
0.31
0.30
0.29
0.22
0.08
0.05
0.04
0.01
0.01
0.00
-0.01
-0.04
-0.15
-0.19
-0.20
-0.20
-0.21
-0.26
-0.29
-0.31
-0.39
-0.45
-0.45
-0.48
-0.53
-0.55
Feature Name
split auxiliaries
type/token ratio
synthetic negation
past tense
prepositions
phrasal coordination
WH relatives: pied pipes
perfect aspect verbs
adv. sub. - other
attributive adjectives
WH relatives: obj. position
nouns
demonstratives
suasive verbs
BY passives
infinitives
THAT adj complements
WH relatives: subj. position
agentless passives
wordlength
nominalizations
amplifiers
present prt. WHIZ deletions
conjuncts
third person pronouns
SEEM/APPEAR
present participle clauses
gerunds
WH clauses
past prt. WHIZ deletions
time adverbials
public verbs
split infinitives
adverbs
THAT relatives: subj. position
existential THERE
THAT relatives: obj. position
first person pronouns
THAT verb complements
indefinite pronouns
downturners
sentence relatives
analytic negation
continued on next page
257
A PPENDIX C: I DENTIFYING S YNTACTIC F EATURE
Table C.4: continued from previous page
Rank
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
Distance
-0.57
-0.57
-0.58
-0.59
-0.61
-0.61
-0.62
-0.62
-0.63
-0.68
-0.68
-0.69
-0.78
-0.79
-0.82
-0.86
-0.89
-0.93
-1.01
-1.09
-1.13
-1.27
-1.46
-1.60
Feature Name
private verbs
that deletion
adv. sub. - concession
non phrasal coordination
hedges
pronoun IT
BE as main verb
discourse particles
place adverbials
adv. subordinator - cause
past participle clauses
WH questions
DO as pro-verb
stranded prepositions
contractions
emphatics
necessity modals
demonstrative pronouns
second person pronouns
possibility modals
predictive modals
predictive adjectives
present tense
adv. sub. - condition
258
A PPENDIX D
Syntactic Features
This appendix provides a brief description of each of the sixty-seven features
identified by B IBER (1988). The descriptions (and many of the examples) for
each of the features given below are based on Appendix II of B IBER (1988) and
the reader is referred to the source text for a fuller description.
Past Tense Any word identified as a past tense form in an electronic dictionary, or any word longer than six letters which ends in -ed.
Perfect Aspect “Have” indicates this feature.
Present Tense Any base form of a verb in an electronic dictionary.
Place Adverbials Gazetteer approach used (e.g. aboard, above, abroad, across,
ahead, etc.). Place adverbials with other major functions (e.g. in, on) are
excluded.
Time Adverbials Gazetteer approach used (e.g. afterwards, again, earlier, etc.).
First Person Pronouns Gazetteer approach used (I, me, we, us, my, our, myself,
ourselves).
Second Person Pronouns Gazetteer approach used (you, your, yourself, yourselves).
Third Person Personal Pronouns she, he, they, her, them, his, their, himself, herself, themselves).
Pronoun IT Gazetteer approach used.
Demonstrative Pronoun Gazetteer approach used, with the context also taken
into account (e.g. “this is silly”). Trigger words are: that, this, these and
those.
Indefinite Pronoun Gazetteer approach used (anybody, anyone, anything, everybody, everyone, everything, nobody, none, nothing, nowhere, somebody, someone, something).
Pro-verbs Do DO when not used as an auxiliary or a question.
259
A PPENDIX D: S YNTACTIC F EATURES
Direct WH-question Clause or sentence beginning with a WHO word (what,
where, when, when, how, whether, why, whoever, whomever, whichever, wherever, whenever, whatever, however) followed by an auxiliary (e.g. “Who is”).
Nominalisations All words ending in -tion, -ness, -ment, -ity.
Gerunds Verbal nouns (i.e. verbal forms serving nominal functions). These
were identified manually.
Total Other Nouns All nouns in the electronic dictionary.
Agentless Passives Clause in passive voice (e.g. “the cup was broken”). Partof-speech patterns were used to identify this construction.
By-passives Clause in passive voice with agent (e.g. “the cup was broken by
Bob”). Part-of-speech patterns were used to identify this construction.
Be as Main Verb Gazetteer approach (e.g. am, is, are, etc.) augmented with
part-of-speech patterns.
Existential There Gazetteer approach used (e.g. “there are several possibilities”).
That Verb Complements For example, “I said that he went”.
That Adjective Complements For example, “I’m glad that you like it”.
WH-clauses For example, “I believed what he told me”.
Infinitives Identified using pattern matching (“to”) and part-of-speech patterns.
Present Participle Clauses For example, “Stuffing his mouth with cookies, Joe
ran out the door”. These forms were identified manually.
Past Participle Clauses For example, “Built in a single week, the house would
stand for fifty years”. These forms were identified manually.
Past Participle WHIZ Deletion Relatives WHIZ deletions are defined by B IBER
(1988) as “[p]articipal clauses functioning as reduced relatives” (e.g. “The
solution produced by this process”). These forms were identified manually.
Present Participle WHIZ Deletion Relatives For example, “the event causing
this decline is . . . ”. These forms were identified manually.
That Relative Clauses on Subject Position For example, “the dog that bit me”.
Identified manually.
That Relative Clause on Object Position For example, “the dog that I saw”.
Identified manually.
WH Relative Clause on Subject Position For example, “the man who likes popcorn”.
WH Relative Clause on Object Position For example, “the man who Sally likes”.
Pied-Piper Relative Clauses Preposition followed by a WH-pronoun (e.g. who,
whom, which).
260
A PPENDIX D: S YNTACTIC F EATURES
Sentence Relatives For example, “Bob likes fried mangos, which is the most
disgusting thing I’ve ever heard of”. Indicated by the occurrence of “which”
at the beginning of a clause.
Causative Adverbial Subordinators: because Clauses beginning with because.
Causative Adverbial Subordinators: although, though Clauses beginning with
although or though.
Causative Adverbial Subordinators: if, unless Clauses beginning with if or unless.
Other Adverbial Subordinators For example, since, while, insofar as, etc.
Total Prepositional Phrases Gazetteer approach used (e.g. against, amid, amidst,
etc.).
Attributive Adjectives For example, “the big horse”. Identified using part-ofspeech patterns.
Predicative Adjectives For example, “the horse is big”. Identified using partof-speech patterns.
Total Adverbs Any adverb that occurs in the electronic dictionary, or is longer
than five letters and ends in -ly.
Type/Token Ratio Number of different lexical items in the text expressed as a
percentage.
Word Length Mean length of words in a text.
Conjuncts For example, alternatively, altogether, conversely, furthermore, etc.
Downtoners Downtoners diminish the force of a verb. They are identified
using a gazetteer (e.g. almost, hardly, slightly).
Hedges Informal expressions of probability. They are identified using a gazetteer
(e.g. at about, maybe, something like, etc.).
Amplifiers Amplifiers enhance the force of a verb. They are identified using a
gazetteer (e.g. very, absolutely, enormously, etc.).
Discourse Particles For example, well, now, anyhow, anyway.
Demonstratives That, this, these, those.
Possibility Modals Can, may, might, would.
Necessity Modals Ought, should, must.
Predictive Modals Will, would, shall.
Public Verbs Verbs that refer to external actions, e.g. proclaim, protest, reply, etc
Private Verbs Verbs that refer to internal actions, e.g. decide, conclude, understand, etc.
-suasive Verbs Verbs that persuade, e.g. ask, beg, propose, etc.
Seem/Appear Use of the verbs seem and appear indicates hedging in more formal or academic contexts.
261
A PPENDIX D: S YNTACTIC F EATURES
Contractions All contractions (with possessives excluded).
Subordinator-that Deletion For example, “I think that he went to . . . ”.
Stranded Preposition For example, “the candidate that I was thinking of.”
Split Infinitives Insertion of Adverbs(s) in infinitives.
Split Auxiliaries Insertion of Adverb(s) in auxiliaries.
Phrasal Coordination (Adverb OR Adjective OR Verb OR Noun) “and” (Adverb
OR Adjective OR Verb OR Noun).
Independent Clause Coordination Clauses can stand independently. Indicated
by use of the pattern “, and”.
Synthetic Negation no, neither, nor.
Analytic Negation not.
262
A PPENDIX E
Ranked Features
Table E on the following page lists the 100 hundred features with most discriminating power with respect to the Dictionary of National Biography (DNB)
and TREC corpora (according to the feature ranking method described in
Section 2.5.3 on page 50). A two megabyte sample of the DNB was used as
a biographical corpus, and a 10 megabyte sample of the TREC corpus was
used as a reference corpus. The features used consisted of the two thousand
most frequent unigrams from the DNB, the two thousand most frequent bigrams from the DNB, and the two thousand most frequent trigrams from the
DNB. Additionally, syntactic features derived from B IBER (1988) (described in
Section 8.3 on page 148) and general biographical features (for example, family name, pronoun, and so on).
Note that in Table E on the next page the presence of an underscore ( ) in the
feature name indicates that the feature is an -gram (for example, “was educated south”
refers to the trigram “was educated south”). The exception is those features
prefixed with “feature”, which refer to syntactic features (for example “feature pronoun”) and the “past tense” and “present tense” features.
263
A PPENDIX E: R ANKED F EATURES
Table E.1: 100 Most Discriminating Features with Respect to the DNB and
TREC corpora, Calculated using the Feature Selection Method. Features
are Presented in Alphabetical Order.
and
and his
and in
and was
appoint
appointed
are
as a
at
at the
be
became
born
born at
born in
brother
by his
cambridge
charles
college
daughter
daughter of
de
educated at
edward
english
family
father
feature familyname
feature familyrelationship
feature forename
feature month
feature pronoun
feature title
feature year
feature yearspan
first
george
govern
harry
has
have
he
he became
he was
he was educated
henry
her
him
his
264
his father
his wife
in
in london
it
james
john
king
london
not
of
of his
of john
of the
oxford
past tense
present tense
publish
published
royal
said
school
she
sir
son
son of
st
that
the government
their
they
thomas
university
was
was a
was appointed
was born
was born at
was born in
was educated
was educated at
where
where he
which he
wife
william
would
year
years
year he
A PPENDIX F
Coverage of New Annotation
Scheme
This chapter reproduces in full the annotated documents referenced in Chapter
5. The chapter is made up of two main sections. First, the longer biographical texts from various sources are reproduced. Second, the short Wikipedia
biographies are presented.
F.1 Four Biographies from Various Sources
Four annotated biographies are reproduced in this section, using the annotation scheme described in Chapter 5. The four biographical subjects and their
respective sources are:
Ambrose Bierce (Chambers Biographical Dictionary.
Phillip Larkin (Dictionary of National Biography (old)).
Alan Turing (Dictionary of National Biography (old)).
Paul Foot (Wikipedia.)
F.1.1 Ambrose Bierce
key Bierce, Ambrose Gwinnett, 1842-1914 /key . work US short-story
writer and journalist /work . work Born in Meigs County, Ohio, he grew
up in Indiana and fought for the Union in the Civil War. /work work key In the UK from 1872 to 1875, he wrote copy for Fun and other magazines, and in 1887 joined the San Francisco Examiner.
/key /work work He wrote Tales of Soldiers and Civilians (1892) and his most celebrated story, An Occurrence at Owl Creek Bridge, which is a haunted, neardeath fantasy of escape, influenced by Edgar Allan Poe and in turn influencing
Stephen Crane and Ernest Hemingway. /work work He compiled the
much-quoted Cynic’s Word Book (published in book form 1906), now better
265
A PPENDIX F: C OVERAGE OF N EW A NNOTATION S CHEME
key He moved to Washington
known as The Devil’s Dictionary. /work DC, and in 1913 went to Mexico to report on Pancho Villa’s army and disappeared. /key F.1.2 Phillip Larkin
relationships key Larkin, Philip Arthur 1922-1985, poet, was born in
Coventry 9 August 1922, the only son and younger child of Sydney Larkin,
treasurer of Coventry, who was originally from Lichfield, and his wife, Eva
Emily Day, of Epping. /key /relationships education He was educated at King Henry VIII School, Coventry (1930-40), and St John’s College,
Oxford, where he obtained a first class degree in English language and literawork Bad eyesight caused him to be rejected
ture in 1943. /education for military service, and after leaving Oxford he took up library work, becoming in turn librarian of Wellington, Shropshire (December 1943-July 1946),
assistant librarian, University of Leicester (September 1946-September 1950),
sub-librarian of the Queen’s University of Belfast (October 1950-March 1955),
and finally taking charge of the Brynmor Jones Library, University of Hull,
for the rest of his life. /work character Larkin, while always courteous and pleasant to meet, was solitary by nature; he never married and had
no objection to his own company; it was said that the character in literature
he most resembled was Badger in Kenneth Grahame’s The Wind in the Wilcharacter A bachelor, he found his substitute for family
lows. /character life in the devotion of a chosen circle of friends, who appreciated his dry wit
and his capacity for deep though undemonstrative affection. /character character His character was stable and his attitude to others considerate,
so that having established a friendship he rarely abandoned it. /character relationships Most of the friends he made in his twenties were still attached
to him in his sixties, and his long-standing friend and confidante Monica Jones,
to whom he dedicated his first major collection The Less Deceived (1956), was
with him at the time of his death thirty years later /relationships . work Larkin was a highly professional librarian, notably conscientious in his work,
and an active member of the Standing Conference of National and University
Libraries. /work work In the limited time this left him he did not undertake lecture tours, very rarely broadcast or gave interviews, and produced
(compared with most authors) very little ancillary writing; though his lifelong
interest in jazz led him to review jazz records for the Daily Telegraph, 196171. /work work Some of the reviews were collected in All What Jazz
(1970) /work . work In his forties he discovered a facility for book reviewing, of which he had previously done very little, and a collection of his
reviews, Required Writing (1983) reveals him as an excellent critic; though perhaps “reveal” is not the right word, for a decade earlier he had done much
to influence contemporary attitudes to poetry with his majestic and in some
quarters highly controversial Oxford Book of Twentieth-Century English Verse
(1973), prepared with the utmost care during his tenure of a visiting fellowship
at All Souls College in 1970-1. /work work He spent much time working on behalf of his fellow writers, as a member of the literature panel of the
Arts Council, helping to set up and then guide its National Manuscript Collection of Contemporary Writers in conjunction with the British Museum, and
266
A PPENDIX F: C OVERAGE OF N EW A NNOTATION S CHEME
serving as chairman for several years of the Poetry Book Society. /work work He was chairman of the Booker prize judges in 1977. /work To
this Dictionary he contributed the notice of Barbara Pym. Larkin’s early ambition was to contribute both to the novel and to poetry. work His first novel
Jill (1946), published by a small press (which paid him with only a cup of tea)
and not widely reviewed, did little to establish him, though its merits were
recognized when it was reprinted in 1964 and 1975; but the second, A Girl in
Winter (1947), attracted the attention of discerning readers, and the only reason he did not write more novels was that he found he could not, though he
tried for some five years before giving up and working entirely in poetry, an
art he loved but did not regard as necessarily “higher” than fiction. /work The poet, he said, made a memorable statement about a thing, the novelist
demonstrated that thing as it was in actuality. “The poet tells you that old age
is horrible, the novelist shows you a lot of old people in a room”. Why the second became impossible to him, and the first remained strikingly possible, it is
useless to speculate. work As a poet, Larkin’s early work, written when he
was about twenty, already shows a fine ear and an unmistakable gift; but the
breakthrough to an individual, and perfectly achieved, manner came some ten
years later, in the poems collected in The Less Deceived. /work work From that point on, his work did not change much in style or subject matter
throughout the thirty years still to come, in which he produced two volumes,
The Whitsun Weddings (1964) and High Windows (1974), plus a few poems
still uncollected at his death. /work There were surprises, but then there
had been surprises from the start, for Larkin’s range was much more varied
than a brief description of his work could hope to convey. He was restlessly
alive to the possibilities of form, and never seemed constricted by tightly organized forms like the sonnet, the couplet, or the closely rhymed stanza, nor
flaccid when he moulded his statement into free verse. It is instructive to pick
out any one individual poem of Larkin’s and then look through his work for
another that seems to be saying much the same thing in much the same manner. As a rule one finds that there is no such animal. Most poets repeat themselves; he did not, and this should qualify the frequently repeated judgement
that his output was small.Both in prose and verse, Larkin’s themes were those
of quotidian life: work, relationships, the earth and its seasons, routines, holidays, illnesses. He worked directly from life and felt no need of historical or
mythological references, any more than he needed the cryptic verbal compressions that were mandatory in the modern poetry of his youth. Where modern
poetry put its subtleties and complexities on the surface as a kind of protective
matting, to keep the reader from getting into the poem too quickly, Larkin always provides a clear surface, one feels confident of knowing what the poem is
about at the very first reading, and plants his subtleties deep down, so that the
reader becomes gradually aware of them with longer acquaintance. key The poems thus grow in the mind until they become treasured possessions;
this would perhaps account for the sudden explosion of feeling in the country at large when Larkin unexpectedly died at the Nuffield Hospital, Hull, 2
December 1985 (he had been known to be ill but thought to be recovering),
and the extraordinary number who crowded into Westminster Abbey for his
memorial service on St Valentine’s Day 1986. /key education Philip
Larkin was an honorary D.Litt. of the universities of Belfast, 1969; Leicester, 1970; Warwick, 1973; St Andrews, 1974; Sussex, 1974; and Oxford, 1984.
267
A PPENDIX F: C OVERAGE OF N EW A NNOTATION S CHEME
/education fame He won the Queen’s gold medal for poetry (1965),
the Loines award for poetry (1974), the A. C. Benson silver medal, RSL (1975),
the Shakespeare prize, FVS Foundation of Hamburg (1976), and the Coventry
award of merit (1978) /fame . fame In 1983 Required Writing won the W.
H. Smith literary award. /fame fame In 1975 he was appointed CBE and
a foreign honorary member of the American Academy of Arts and Sciences.
/fame fame education St John’s College made him an honorary fellow in 1973, and in 1985 he was made a Companion of Honour. /education /fame F.1.3 Alan Turing
relationships work key Turing , Alan Mathison 1912 - 1954 , mathematician , was born in London 23 June 1912, the younger son of Julius Mathison Turing, of the Indian Civil Service, and his wife, Ethel Sara, daughter of
Edward Waller Stoney, chief engineer of the Madras and Southern Mahratta
Railway. /key /work /relationships relationships G. J. and G. G.
Stoney were collateral relations. /relationships character education He was educated at Sherborne School where he was able to fit in despite his
independent unconventionality and was recognized as a boy of marked abil/character education He went as a
ity and character. /education mathematical scholar to King’s College, Cambridge , where he obtained a second class in part i and a first in part ii of the mathematical tripos (1932-4).
/education education He was elected into a fellowship in 1935 with
a thesis “On the Gaussian Error Function” which in 1936 obtained for him a
Smith’s prize. /education fame In the following year there appeared his
best-known contribution to mathematics, a paper for the London Mathematical Society “On Computable Numbers, with an Application to the Entscheidungsproblem” a proof that there are classes of mathematical problems which
cannot be solved by any fixed and definite process, that is, by an automatic machine. /fame fame His theoretical description of a “universal” computing machine aroused much interest. /fame work After two years (19368) at Princeton, Turing returned to King’s where his fellowship was renewed.
/work fame work But his research was interrupted by the war during which he worked for the communications department of the Foreign Office;
in 1946 he was appointed O.B.E. for his services. /work /fame work The war over, he declined a lectureship at Cambridge, preferring to concentrate on computing machinery, and in the autumn of 1945 he became a senior
principal scientific officer in the mathematics division of the National Physical Laboratory at Teddington. /work work With a team of engineers
and electronic experts he worked on his “logical design” for the Automatic
Computing Engine (ACE) of which a working pilot model was demonstrated
in 1950 (it went eventually to the Science Museum). /work work In
the meantime Turing had resigned and in 1948 he accepted a readership at
Manchester where he was assistant director of the Manchester Automatic Digital Machine (MADAM). /work He tackled the problems arising out of the
use of this machine with a combination of powerful mathematical analysis and
intuitive short cuts which showed him at heart more of an applied than a pure
mathematician. work In “Computing Machinery and Intelligence” in Mind
268
A PPENDIX F: C OVERAGE OF N EW A NNOTATION S CHEME
(October 1950) he made a brilliant examination of the arguments put forward
against the view that machines might be said to think. /work He suggested
that machines can learn and may eventually “compete with men in all purely
intellectual fields” fame In 1951 he was elected F.R.S., one of his proposers
being Bertrand (Earl) Russell. /fame work The central problem of all
Turing’s investigations was the extent and limitations of mechanistic explanations of nature and in his last years he was working on a mathematical theory
of the chemical basis of organic growth.
/work key But he had not
fully developed this when he died at his home at Wilmslow 7 June 1954 as
the result of taking poison. /key key Although a verdict of suicide was
returned it was possibly an accident, for there was always a Heath-Robinson
element in the experiments to which he turned for relaxation: everything had
to be done with materials available in the house. /key character This
self-sufficiency had been apparent from an early age; it was manifested in
the freshness and independence of his mathematical work; and in his choice
of long-distance running, not only for exercise but as a substitute for public
transport. /character character An original to the point of eccentricity,
he had a complete disregard for appearances and his extreme shyness made
him awkward. /character character But he had an enthusiasm and a
humour which made him a generous and lovable personality and won him
many friends, not least among children. /character relationships He
was unmarried /relationships .
F.1.4 Paul Foot
work key Paul Mackintosh Foot (November 8, 1937 - July 18, 2004)
was a British radical investigative journalist, political campaigner, author, and
long-time member of the Socialist Workers Party (SWP). /key /work relationships Paul Foot was the son of Hugh Foot, later Lord Caradon, who
was governor of Cyprus during the independence battle with Britain in the
1950s, and later represented the United Kingdom at the United Nations from
1964-1970. /relationships relationships Paul Foot was the nephew of
former leader of the Labour Party Michael Foot. /relationships education He was educated at Shrewsbury School and University College, Oxford. /education work He first joined the International Socialists, organisational forerunner of the SWP, when he was a cub reporter in Glasgow in the early 1960s.
/work work He wrote for Socialist Worker throughout his career and
was its editor in the late 1970’s until 1980 when he moved to the Daily Mirror.
/work work He left the Mirror in 1993 when the paper refused to print
articles critical of their management. /work work Latterly he returned
to Private Eye; he also wrote for The Guardian. /work work He fought
the Birmingham Ladywood by-election in 1977 for the SWP and was a Socialist
Alliance candidate for several offices from 2001 onwards. /work fame In the Hackney mayoral election in 2002 he came third, beating the Liberal
Democrat candidate into fourth. /fame work He stood in the London
region for the RESPECT coalition at the 2004 European elections /work .
fame He was Journalist of the Year in the What The Papers Say Awards
in 1972 and 1989, Campaigning Journalist of the Year in the 1980 British Press
Awards, won the George Orwell Prize for Journalism in 1994, won the Journal269
A PPENDIX F: C OVERAGE OF N EW A NNOTATION S CHEME
ist of the Decade in the What The Papers Say Awards in 2000, and the James
/fame fame His best
Cameron Special Posthumous Award in 2004.
known work was in the form of campaign journalism, including his exposure
of corrupt architect John Poulson and, most notably, his prominent role in the
campaigns to overturn the convictions of the Birmingham Six and the Bridgewater Four, which succeeded in 1991 and 1997 respectively. /fame work He took a particular interest in the conviction of Abdel Basset Ali al-Megrahi
for the Lockerbie bombing, firmly believing Megrahi to have been a victim of
a miscarriage of justice. /work work He also worked tirelessly, though
without success, to gain a posthumous pardon for James Hanratty, who was
hanged in 1962 for the A6 murder. /work work His books are Immigration and Race in British Politics (1965), The Politics of Harold Wilson (1968) The
Rise of Enoch Powell (1969) Who Killed Hanratty? (1971) Red Shelley (1981)
The Helen Smith Story (1983) Murder on the Farm, Who Killed Carl Bridgewater? (1986) Who Framed Colin Wallace? (1989) Words as Weapons (1990)
Articles of Resistance (2000) and The Vote: How It Was Won, and How It Was
Undermined (2005). /work key He died of a heart attack while waiting
at Stansted Airport to begin a family holiday in Ireland. /key work A
special tribute issue of the Socialist Review magazine, of which he was on the
editorial board for 19 years, collected together many of his articles. Private Eye
issue 1116 included a tribute to Foot from the many people whom he worked
work On October 10, 2004 – three months
with over the years. /work after Foot’s death – there was a full house at the Hackney Empire in London for
an evening’s celebration of the life of this much-admired and respected campaigning journalist. /work F.2 Four Biographies from Wikipedia
Four annotated biographies are reproduced in this section, using the annotation scheme described in Chapter 5. All four examples are short biographies of
people who died in December 2005 and harvested from Wikipedia. The four
biographical subjects are:
F.2.1 Jack Anderson
fame work key Jackson Northman Anderson (October 19, 1922 December 17, 2005) was an American newspaper columnist and is considered one
of the fathers of modern investigative journalism /key . /work /fame work fame Anderson won the 1972 Pulitzer Prize for National Reporting for his investigation on secret American policy decision-making between
the United States and Pakistan during the 1971 Indo-Pakistan War of 1971.
/fame /work fame work Jack Anderson was a key and often
controversial figure in reporting on J. Edgar Hoover’s apparent ties to the
Mafia, Watergate, the John F. Kennedy assassination, the search for fugitive exNazi Germany officials in South America and the Savings and Loan scandal.
/work /fame fame He discovered a CIA plot to assassinate Fidel
Castro, and has also been credited for breaking the Iran-Contra affair, though
270
A PPENDIX F: C OVERAGE OF N EW A NNOTATION S CHEME
he has said the scoop was “piked” because he had become too close to President Ronald Reagan /fame . character Anderson was a crusader against
corruption /character . character Henry Kissinger once described him as
“the most dangerous man in America.” /character key Anderson was
diagnosed with Parkinson’s disease in 1986. /key work In July 2004,
at the age of 81, Anderson retired from his syndicated column, “Washington
Merry-Go-Round.” /work key relationships He died of complications from Parkinson’s disease, survived by his wife, Olivia, and nine children.
/relationships /key A few months after his death, the FBI attempted
to gain access to his files as part of the AIPAC case on the grounds that the
information could hurt U.S. government interests.
F.2.2 Kerry Packer
key Kerry Francis Bullmore Packer AC (17 December 1937 26 December
2005) was an Australian publishing, media and gaming tycoon. /key character work fame He was famous for his outspoken nature, wealth, expansive
business empire and clashes with the Australian Taxation Office and the Costigan Commission. /fame /work /character fame At the time of
his death, Packer was the richest and one of the most influential men in Australia. /fame fame In 2004 Business Review Weekly magazine estimated
Packer’s net worth at AUD 6.5 billion ($6.5 billion; about USD 4.7 billion).
/fame F.2.3 Richard Pryor
key Richard Franklin Lennox Thomas Pryor III (December 1, 1940 December 10, 2005) was an American comedian, actor, and writer. /key character fame Pryor was a gifted storyteller known for unflinching examinations
of race and custom in modern life, and was well-known for his frequent use
of colorful language, vulgarities, as well as such racial epithets as “nigger,”
“honky,” and “cracker”. /fame /character He reached a broad audience with his trenchant observations, although public opinion of his act was
often divided. fame He is commonly regarded as one of the most important
stand up comedians of his time: Jerry Seinfeld called Pryor “The Picasso of our
profession.” /fame work His catalog includes such concert movies and
recordings as Richard Pryor: Live and Smokin’ (1971), That Nigger’s Crazy
(1974), Bicentennial Nigger (1976), Richard Pryor: Wanted Live In Concert
(1979) and Richard Pryor: Live on the Sunset Strip (1982). /work work He also starred in numerous films as an actor, usually in comedies such as the
classic Silver Streak, but occasionally in the noteworthy dramatic role, such as
Paul Schrader’s film Blue Collar. /work work He also collaborated on
many projects with actor Gene Wilder. /work fame He won an Emmy
Award in 1973, and five Grammy Awards in 1974, 1975, 1976, 1981, and 1982.
/fame fame In 1974 he also won two American Academy of Humor
awards and the Writers Guild of America Award. /fame 271
A PPENDIX F: C OVERAGE OF N EW A NNOTATION S CHEME
F.2.4 Stanley Williams
key fame Stanley Tookie Williams III (December 29, 1953 December
13, 2005), was an early leader of the Crips, a notorious American street gang
which had its roots in South Central Los Angeles in 1969. /fame /key key fame In December 2005 he was executed for the 1979 murders of Albert Owens, Yen-Yi Yang, Tsai-Shai Lin, and Yee-Chen Lin. /fame /key key fame While in prison, Williams refused to aid police investigations
with any information against his gang, and was implicated in attacks on guards
and other inmates as well as multiple escape plots. /fame /key character fame In 1993, Williams began making changes in his behavior, and became
an anti-gang activist while on Death Row in California. /fame /character fame Although he continued to refuse to assist police in their gang investigations, he renounced his gang affiliation and apologized for the Crips’ founding, while maintaining his innocence of the crimes for which he was convicted.
/fame fame He co-wrote children’s books and participated in efforts
intended to prevent youths from joining gangs.
/fame fame A 2004
biographical TV-movie entitled Redemption: The Stan Tookie Williams Story
/fame key On December 13, 2005,
featured Jamie Foxx as Williams.
Williams was executed by lethal injection amidst debate over the death penalty
and whether his anti-gang advocacy in prison represented genuine atonement.
/key 272
A PPENDIX G
The Corrected Re-sampled
-test
This appendix describes the corrected re-sampled -test, as presented by B OUCK AERT and F RANK (2004).1 The test was implemented in the Perl programming
language.2
G.1 An Outline of the Corrected Re-sampled -test
and
of algorithm A measured on run , where
Let, , be bethetheaccuracy
and
accuracy of algorithm B measured on run , where
. Let be the
total number of runs, the number of instances used for
training and
the number of instances used for testing. (that is,
equals the difference between the accuracy of Algorithm A and Algorithm
B on run ). is the estimate of the variance of the differences (that is, the
square of the standard deviation of the differences).
Equation G.1 shows the statistic for the corrected resampled -test.3
(G.1)
B OUCKAERT and F RANK (2004) point out that the difference between the cor
rected resampled -test and the “standard” -test is the substitution of in the
“standard” statistic by in the corrected version.
1 The
exposition follows B OUCKAERT and F RANK (2004) closely.
code is available at http://www.dcs.shef.ac.uk/ mac/t-test.pl
3 Note that the test is used in conjunction with the student distribution with
freedom.
2 The
273
degrees of
A PPENDIX H
Factor Analysis
This appendix provides further details on the process of factor analysis (FA)
used by B IBER (1988), as it was felt that a lengthy digression on statistics in
the main body of the thesis would be inappropriate. Note that this appendix
is only designed to give a flavour of the strengths and weaknesses of the technique.
FA has traditionally been grouped as a technique within multivariate statistics.
Multivariate statistics — as the name suggests — looks at the patterns of relationships between variables. Other techniques, along with FA, belonging to the
multivariate statistics group are cluster analysis and multidimensional scaling
(C HATFIELD and C OLLINS, 1980).
FA examines correlations between variables, and seeks to describe observed
correlations in terms of underlying factors. Factors – in this sense – are hypothetical variables on which individuals (individual people, individual texts,
and so on ) can differ. FA has been used extensively in psychology to explore
the underlying dimensions of personality; the so called “Big Five” personality
types (C OOLICAN, 2004)1 Often in this kind of research, participants were presented with a list of several hundred “trait adjectives” (for example, talkative,
quiet). Correlations between all these trait adjective variables were then calculated, forming a matrix, and factors then identified using matrix algebra. The
factors were then named (that is, interpreted by the researcher) by identifying
the particular characteristics of the identified variables. For example if we were
conducting personality research and identified a factor which had the positive
variables; diligent, tidy, punctual, and frugal, and if that same factor has the
negative variables; tardy, messy, spendthrift, and lazy, we may want to call the
factor “Responsible vs Irresponsible.”2
There are two main stages in FA. First, constructing a correlational matrix, and
second, manipulating that matrix to identify factors.
1 The “Big Five” types are: Extraversion, Agreeableness, Conscientiousness, Neuroticism and
Openness to Experience.
2 The earliest use of FA — S PEARMAN (1904) — was in identifying dimensions of intelligence
rather than personality.
274
A PPENDIX H: FACTOR A NALYSIS
H.1 Constructing a Correlational Matrix
B IBER (1988) identified sixty-seven features and investigated the frequency of
these features across different genre (see Figure 2.4 on page 17 for a selection
of the features used and Appendix D on page 259 for a more comprehensive
listing). Four hundred and eighty one text documents were used (that is, all the
genres). The correlation between each variable pair were then placed in a correlation matrix (that is, a matrix of size , consisting of 4489 correlations).
A correlation is a measure of the extent to which a change in one random variable corresponds to a change in another random variable. There are two scales
to use when talking about correlations, the strength of the correlation (positive
or negative) and the direction of the correlation. The correlation between two
random variables can range from +1 to -1. A +1 correlation is a positive perfect
correlation (that is, the two variables always appear together). A -1 correlation
is a negative perfect correlation (that is, they never occur together). An 0.8 correlation is a strong, positive correlation) (that is, the two variables are likely to appear
together). An -0.8 correlation is a strong, negative correlation (that is, the two
variables are unlikely to appear together). It is assumed that the features that
frequently occur together have one (or more) communicative function (that is,
that correlations indicate some underlying linguistic dimension, rather than
just being correlations).
Individual correlations were calculated using Pearson’s correlation, a statistic that assumes data is normally distributed.3 The procedure (described in
O AKES (1998) which this account follows closely) involves identifying the following from the genre frequency tables:
1. The sum of all values of variable 1 ( ).
2. The sum of all values of variable 2 ( ).
3. The sum of all squares of .
4. The sum of all squares of .
5. The sum of the products over all data 4489 data pairs.
6. The number of documents ( ).
Equation H.1 (again based on O AKES (1998)) shows how Pearson’s correlation
coefficient can be easily computed.
(H.1)
3 The
correlation matrix used by B IBER (1988) is published as an Appendix to that book.
275
A PPENDIX H: FACTOR A NALYSIS
H.2 Identifying Factors from a Correlational Matrix
There are numerous methods available for performing factor analysis (see C ATTELL (1979) for a review of techniques and applications). The method used in
B IBER (1988) was principal factor analysis (also known as common factor analysis). Note that as B IBER (1988) did not provide extensive details of the factor
analysis technique used, this account is rather generic. Further, FA is a complex technique and detailed descriptions of its operation are largely skirted
around in the literature concerned with Multidimensional analysis (see L EE
(1999)). A good, book length treatment, aimed at non-mathematicians is K LINE
(1994).
Factor analysis can be broken down into several steps:
1. A correlational matrix is produced (see above).
2. Identify (extract) factors using matrix manipulation techniques. There are two
approaches to identifying factors, one uses geometry (that is, treating the
relationships between co-efficients as angles) and one uses matrix algebra
(this is the method implemented in SPSS and other statistics software
(K LINE, 1994)).
3. Identify the optimum number of factors. Any number of factors (less than or
equal to the number of variables) can be extracted, but if the number of
factors is equal to the number of variables, then the explanatory power of
the identified factors is questionable. Normally, the first few factors are
most important, and the remaining factors discarded. There is no single
technique for identifying a cut off point for the number of factors, but
various heuristic techniques have been developed. Once the number of
factors has been decided upon, they are interpreted and named. .
B IBER (1988) used statistical software (SPSS) to produce the correlation matrix and perform the FA. SPSS automatically identifies the optimal number of
factors using a heuristic, although this decision can be overridden by the researcher if required. SPSS also provides several options for rotating the matrix,
including the varimax method employed by (B IBER, 1988).
276
Bibliography
A DAMNAN (c 690). Life of St Columba. 1995, Penguin, London.
A MIS , M. (1985). Money. Penguin, London.
A RGAMON , S., K OPPE , M., F INE , J., AND S HIMON , A. (2003). Gender, Genre
and Writing Style in Formal Written Texts. Text, 23(3).
A RISTOTLE (c 340 BC). Aristotle on the Art of Fiction: An English Translation of the
Poetics. 1962, Cambridge University Press.
A RMSTRONG , E. (1991). The Potential of Cohesion Analysis in the Analysis
and Treatment of Aphasic Discourse. Clinical Linguistics and Phonetics, 5(1).
A RTSTEIN , R. AND P OESIO , M. (2005). Bias Decreases in Proportion to the
Number of Annotators. In The Proceedings of the 10th Conference on Formal
Grammar and the 9th Meeting on Mathematics of Language, pages 141–150.
ATKINSON , D. (1992). The Evolution of Medical Research Writing from 1735
to 1985: The Case of the Edinburgh Medical Journal. Applied Linguistics,
13:337–374.
A UDEN , W. (1935). Collected Poems. 1991, Vintage, London.
B AL , M. (1985). Narratology: Introduction to the Theory of Narrative. University
of Toronto Press, Toronto.
B ATTLE , M. (2004). Library: An Unquiet History. Vintage, London.
B EDE (c 700). The Age of Bede. 1998, Penguin,, London.
B IBER , D. (1988). Variations Across Speech and Writing. Cambridge University
Press.
B IBER , D. (1989). A Typology of English Texts. Linguistics, 27:3–43.
B IBER , D., C ONRAD , S., R EPPEN , R., B YRD , P., AND H ELT, M. (2002). Speaking
and Writing in the University: A Multidimensional Comparison. TESOL
Quarterly, 36:9–48.
B IBER , D. AND F INEGAN , E. (1986). An Inital Typology of English Texts. In
A ARTS , J. AND M EIJS , W. (editors), Corpus Linguistics 2, pages 19–46. Rodopi.
277
B IBLIOGRAPHY
B IBER , D. AND H ARED , M. (1994). Linguistic Correlates of the Transition to
Literacy in Somali: Language Adaptation in Six Press Registers. In B IBER ,
D. AND F INEGAN , E. (editors), Sociolinguistic Perspectives on Register, pages
183–216. Oxford University Press.
B LACK (2004). Who’s Who 2004. A & C Black, London.
B LAIR -G OLDENSOHN , S., E VANS , D., H ATZIVASSILOGLOU , V., M C K EOWN ,
K., N ENKOVA , A., PASSONNEAU , R., S CHIFFMAN , B., S CHLAIKJER , A.,
AND A DVAITH (2004). Columbia University at DUC 2004. In Document Understanding Conference-2004.
B OESE , A., E.S. & H OWE (2005). Effects of Web Document Evolution on Genre
Classification. In Proceedings of the 14th ACM Fourteenth Conference on Information and Knowledge Management.
B ORKO , H. AND B ERNICK , M. (1963). Automatic Document Classification.
Journal of the Association for Computing Machinery, 10(2):151–161.
B OSWELL , J. (1791). The Life of Samuel Johnson, LL.D. 2005, Penguin, London,
London.
B OUCKAERT, R. AND F RANK , E. (2004). Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms. In Advances in Knowledge
Discovery and Data Mining, pages 3–12. Springer, Berlin.
B ROADFIELD , A. (1946). Philosophy of Classification. Grafton, London.
B URGES , C. J. C. (1998). A Tutorial on Support Vector Machines for Pattern
Recognition. Data Mining and Knowledge Discovery, 2(2):121–167.
C ARDIE , C. (1997). Empirical Methods in Information Extraction. AI Magazine,
18(4):65–80.
C ARLETTA (1996). Assessing Agreement on Classification Tasks: The Kappa
Statistic. Computational Linguistics, 22(2).
C ATTELL , R. (1979). The Scientific use of Factor Analysis in the Life Sciences.
Plenum Press, New York.
C HAMBERS (2004). Chambers Biographical Dictionary. Chambers-Harrap, Edinburgh.
C HATFIELD , C. AND C OLLINS , A. (1980). Introduction to Multivariate Analysis.
Chapman & Hall, London.
C IRAVEGNA , F. (2001). Adaptive Information Extraction from Text by Rule Induction and Generalization. In 17th International Joint Conference on Artificial
Intelligence, Seattle. Seattle.
C LEGG , B. (2007). The Man Who Stopped Time: The Illuminating Story of Eadweard Muybridge: Pioneer Photographer, Father Of The Motion Picture, Murderer.
Joseph Henry Press, Washington.
C OHEN , J. (1960). A Coefficient of Agreement for Nominal Scales. Educational
and Psychological Measurement, 20:37–46.
278
B IBLIOGRAPHY
C OHEN , W. W. (1995). Fast Effective Rule Induction. In Proceedings of the 12th
International Conference on Machine Learning. Morgan Kaufmann, Tahoe City,
CA.
C OOLICAN , H. (2004). Research Methods and Statistics in Psychology. Hodder
Arnold, London.
C ORTES , C. AND VAPNIK , V. (1995). Support Vector Networks. Machine Learning, 20:273–297.
C OWIE , J. AND L EHNERT, W. (1996). Information Extraction. Communications
of the ACM, 39(1):80–91.
C OWIE , J., N IRENBURG , S., AND M OLINO -S ALGADO , H. (2001). Generating
Personal Profiles. Technical report, New Mexico State University.
C RAGGS , R. AND M C G EE W OOD , M. (2005). Evaluating discourse and dialogue coding schemes. Computational Linguistics, 31:289–295.
C RAIG , H. (1999). Authorial Attribution and Computational Stylistics: If You
Can Tell Authors Apart, Have You Learned Anything About Them? Literary
and Linguistic Computing, 14.
C SOMAY, E. (2002). Variation in Academic Lectures. In R EPPEN , R., F ITZ MAURICE , S., AND B IBER , D. (editors), Using Corpora to Explore Linguistic
Variation, pages 203–224. John Benjamins, Amsterdam.
D ENNETT, D. (1992). Conciousness Explained. Penguin, London.
D I E UGENIO , B. (2000). On the Usage of Kappa to Evaluate Agreement on
Coding Tasks. In LREC-2000.
D I E UGENIO , B. AND G LASS , M. (2004). The Kappa Statistic: A Second Look.
Computational Linguistics, 30(1):95–101.
D IETTERICH , T. G. (1998). Approximate Statistical Test for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10(7):1895–
1923.
D UBROW, H. (1982). Genre: The Critical Idiom. Methuen, London.
D UNNING , T. (1993). Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 19(1):61–74.
E DMUNDSON , H. P. (1969). New Methods in Automatic Extracting. Journal of
the ACM, 16(2):264–285.
E GGINS , S. (1994). An Introduction to Systemic Functional Linguistics. Frances
Pinter, London.
E LLISON , J. (1967). Computers and the Testaments. In Computers in Humanistic
Research. Prentice-Hall, New Jersey.
E MAM , K. E. (1999). Benchmarking Kappa: Interrater Agreement in Software
Process Assessment. Empirical Software Engineering, 4:113–133.
279
B IBLIOGRAPHY
FABER , R. AND H ARRISON , B. (2002). The Dictionary of National Biography:
A Publishing History. In M YERS , R., H ARRIS , M., AND M ANDELBROTE , G.
(editors), Lives in Print: Biography and the Book Trade in the Middle Ages to the
21st Century, pages 171–92. Oakwell Press & The British Library, London.
F ENG , D. AND H OVY, E. (2005). Handling Biographical Questions with Implicature. In Proceedings of the Conference on Human Language Technology and
Empirical Methods in Natural Language Processing. HLT-2005. Morristown, NJ.
F ERGUSSON , J. (2000). Death and the Press. In G LOVER , S. (editor), The Penguin
Book of Journalism: Secrets of the Press. Penguin, London.
F INN , A. AND K USHMERICK , N. (2003). Learning to Classify Documents According to Genre. In Proceedings of the IJCAI-03 Workshop on Computational
Approaches to Style Analysis and Synthesis.
F ISHER , R. (1922). On the Interpretation of from Contingency Tables, and
the Calculation of P. Journal of the Royal Statistical Society, 85:87–94.
F LEISS , J. (1971). Measuring Nominal Agreement Among Many Raters. Psychological Bulletin, 76:378–382.
F ORMAN , G. (2002). Choose Your Words Carefully: An Empirical Study of
Feature Selection Metrics for Text Classification. In Proceedings of PKDD-02,
6th European Conference on Principles of Data Mining and Knowledge Discovery,
pages 150–162. Helsinki.
F ORMAN , G. (2003). An Extensive Empirical Study of Feature Selection Metrics
for Text Classification. Journal of Machine Learning Research, 3:1289–1305.
F ÜRNKRANZ , J. (1998). A Study Using -gram Features for Text Categorization. Technical Report OEFAI-TR-9830, Austrian Research Institute for Artificial Intelligence.
G AIZAUSKAS , R., G REENWOOD , M., H EPPLE , M., R OBERTS , I., S AGGION , H.,
AND S ARGAISON , M. (2003). The University of Sheffield’s TREC 2003 QA
Experiments. In Proceedings of the 2003 Text Retrieval Conference (TREC-2003).
G ITTINGS , R. (1978). The Nature of Biography. Heinemann, London.
G ROVE , W. M., A NDREASEN , N., M C D ONALD -S COTT, P., K ELLER , M., AND
S HAPIRO , R. (1981). Reliability Studies of Psychiatric Diagnosis. Theory and
Practice. Archive of General Psychiatry, 38:408–413.
G UYON , I. AND E LISSEEFF , A. (2003). An Introduction to Variable and Feature
Selection. Journal of Machine Learning Research, 3:1157–1182.
H ALLIDAY, M. (1961). Categories of the Theory of Grammar. Word, 17(3).
H ALLIDAY, M. (1966). Some Notes on ”Deep” Grammar. Journal of Linguistics,
2(1).
H ALLIDAY, M. AND H ASAN , R. (1976). Cohesion in English. Longman, London.
H ALLIDAY, M. AND M ATTHIESSEN , C. (2004). An Introduction to Functional
Grammar. Hodder Arnold, London, third edition.
280
B IBLIOGRAPHY
H ERODOTUS (c 440 BC). The Histories. 2003, Penguin, London.
H OLMES , D. AND F ORSYTH , R. (1995). The Federalist Revisited: New Directions in Authorship Attribution. Literary and Linguistic Computing, 10(2):111–
127.
H OLTE , R. C. (1993). Very Simple Classification Rules Perform Well on Most
Commonly Used Datasets. Machine Learning, 11(1):63–90.
H ONEYBONE , P. (2005). J. R. Firth. In C HAPMAN , S. AND R OUTLEDGE , C.
(editors), Key Thinkers in Linguistics and the Philosophy of Language. Oxford
University Press.
H OVY, E. (2003). Text Summarization. In M ITKOV, R. (editor), Oxford Handbook
of Computational Linguistics, pages 583–598. Oxford University Press.
J OACHIMS , T. (1998). Text Categorization with Support Vector Machines:
Learning with Many Relevant Features. In Proceedings of ECML-98, 10th European Conference on Machine Learning.
J OHANSSON , S., L EECH , G., AND G OODLUCK , H. (1978). Manual of Information To Accompany the Lancaster-Oslo/Bergen Corpus. Technical report,
Norwegian Computing Centre for the Humanities.
J OHN , G. AND L ANGLEY, P. (1995). Estimating Continuous Distributions in
Bayesian Classifiers. In Proceedings of the Eleventh Conference on Uncertainty
in Artificial Intelligence, pages 338–345.
J OHNSON , S. (1781). Lives of the Poets. 2006, OUP.
J UOLA , P., S OFKO , J., AND B RENNAN , P. (2006). A Prototype for Authorship
Attribution Studies. Literary and Linguistic Computing, 21.
K ARLGREN , J. (2004). The Wheres and Whyfores for Studying Textual Genre
Computationally. In Proceedings of the AAAI Fall Symposium of Style and Meaning in Language, Art and Music.
K ARLGREN , J. AND C UTTING , D. (1994). Recognizing Text Genres with Simple
Metrics using Discriminant Analysis. In Proceedings of the 15th International
Conference on Computational Linguistics - Volume 2.
K ENNY, A. (1982). The Computation of Style: An Introduction to Statistics for
Students of Literature and Humanities. Pergamon, Oxford.
K ESSLER , B., N UNBERG , G., AND S CH ÜTZE , H. (1997). Automatic Detection of
Text Genre. In Proceedings of the Thirty-Fifth Annual Meeting of the Association
for Computational Linguistics and Eighth Conference of the European Chapter of
the Association for Computational Linguistics.
K ILGARRIFF , A. AND R OSE , T. (1998). Measures for Corpus Similarity and
Homogeneity. In 3rd Conference on Empirical Methods in Natural Language Processing, Granada, Spain.
281
B IBLIOGRAPHY
K IM , S., H ARITH , A., H ALL , W., L EWIS , P. H., M ILLARD , D. E., S HADBOLT,
N. R., AND W EAL , M. J. (2002). Artequakt: Generating Tailored Biographies
with Automatically Annotated Fragments from the Web. In Proceedings of
the Semantic Authoring, Annotation and Knowledge Markup Workshop in the Fifteenth European Conference on Artificial Intelligence. Lyon.
K IM , Y.-J. AND B IBER , D. (1994). A Corpus-Based Analysis of Registry Variation in Korean. In B IBER , D. AND F INEGAN , E. (editors), Sociolinguistic
Perspectives on Register, pages 157–181. Oxford University Press.
K LINE , P. (1994). An Easy Guide to Factor Analysis. Routledge, London.
K RAEMER , H. (1992). Measurement of Reliability for Categorical Data in Medical Research. Statistical Methods in Medical Research, 1:183–99.
K RIPPENDORFF , K. (1980). Content Analysis: An Introduction to its Methodology.
Sage, Los Angeles.
L ABOV, W. (1975). The Boundaries of Words and their Meanings. In B AILEY,
C. AND S HY, R. (editors), New Ways of Analysing Variation in English. Georgetown University Press, Washington, D.C.
L AKOFF , G. (1987). Women, Fire, and Dangerous Things: What Categories Reveal
About the Mind. University of Chicago Press, Chicago.
L ANDIS , J. AND K OCH , G. (1977). The Measurement of Observer Agreement
in Categorical Data. Biometrics, 33:159–174.
L EE , D. (1999). Modelling Variation in Spoken and Written English. Ph.D. thesis,
Lancaster University.
L EECH , G. N. AND S HORT, M. (1981). Style in Fiction: A Linguistic Introduction
to English Fictional Prose. Longman, London.
L EWIS , D. D. (1992a). Feature Selection and Feature Extraction for Text Categorization. In HLT ’91: Proceedings of the Workshop on Speech and Natural
Language, pages 212–217. Association for Computational Linguistics, Morristown, NJ, USA.
L EWIS , D. D. (1992b). Representation and Learning in Information Retrieval.
Ph.D. thesis, Department of Computer Science, University of Massachusetts,
Amherst, US.
L INNAEUS , K. (1735). Systema Naturae. Leyden.
L OVE , H. (2002). Attributing Authorship. Cambridge University Press.
L UHN , H. P. (1958). The Automatic Creation of Literature Abstracts. IBM
Journal of Research and Development, 2(2).
M AI , J.-E. (2004). Classification in Context: Relativity, Reality and Representation. Knowledge Organization, 1:39–48.
M ALINOWSKY, B. (1935). The Language and Magic of Gardening. George & Allen,
London.
282
B IBLIOGRAPHY
M ANI , I. (2001a). Automatic Summarization. John Benjamins, Amsterdam.
M ANI , I. (2001b). Summarization Evaluation: An Overview. In Proceedings of
the Second NTCIR Workshop on Research in Chinese and Japanese Text Retrieval
and Text Summarization. Tokyo.
M ANI , I. AND B LOEDORN , E. (1999). Summarizing Similarities and Differences Among Related Documents. In M ANI , I. AND M AYBURY, M. (editors),
Advances in Automatic Text Summarization, pages 357–379. MIT Press, Cambridge, MA.
M ANNING , C. D. AND S CH ÜTZE , H. (1999). Foundations of Statistical Natural
Language Processing. The MIT Press, Cambridge, MA.
M ARCU , D. (1999). The Automatic Construction of Large-Scale Corpora for
Summarization Research. In Proceedings of the 22nd Annual International ACM
SIGIR Conference on Research and Development in Information Retrieval, pages
137–144.
M ARTIN , J., M ATTHIESSEN , C., AND PAINTER , C. (1996). Working with Functional Grammar. Edward Arnold, London.
M ARTIN , J., M ATTHIESSEN , C., AND PAINTER , C. (1997). Working with Functional Grammar. Arnold Press, London.
M AUROIS , A. (1929). Aspects of Biography. Cambridge University Press.
M AYBURY, M. AND M ANI , I. (2001). Automatic Summarization - A Tutorial.
Technical report, MITRE Corp.
M C C OLLY, W. AND W EIER , D. (1983). Literary Attribution and Likelihood
Ratio Tests. Computers and the Humanities, 17:65–75.
M C E NERY, A. AND O AKES , M. (2000). Authorship Attribution. In D ALE , R.,
S OMERS , H., AND M OISL , H. (editors), Handbook of Natural Language Processing. Dekker, NY.
M C E NERY, T.
sity Press.
AND
W ILSON , A. (1996). Corpus Linguistics. Edinburgh Univer-
M C K EOWN , K. (1998). Generating Patient-Specific Summaries of Online Literature. In Proceedings of 1998 Spring Symposium Series Intelligent Text Summarization. Stanford.
M C K EOWN , K., B ARZILAY, R., E VANS , D., H ATZIVASSILOGLOU , V., K LA VANS , J., S ABLE , C., S CHIFFMAN , B., AND S IGELMAN , S. (2002). Tracking
and Summarizing News on a Daily Basis with Columbia’s Newsblaster. In
Proceedings of the Second Human Language Technology Conference. San Diego.
M C K EOWN , K., K LAVANS , J., H ATZIVASSILOGLOU , V., B ARZILAY, R., AND
E SKIN , E. (1999). Towards Multidocument Summarization by Reformulation: Progress and Prospects. In Proceedings of the Sixteenth National Conference on Artificial Intelligence and the Eleventh Innovative Applications of Artificial
Intelligence Conference.
283
B IBLIOGRAPHY
M EYER ZU E ISSEN , S. AND S TEIN , B. (2004). Genre Classification of Web Pages.
In Proceedings of the 27th German Conference on Artificial Intelligence. Ulm.
M ILLARD , D. E., M OREAU , L., D AVIS , H. C., AND R EICH , S. (2000). FOHM:
A Fundamental Open Hypertext Model for Investigating Interoperability between Hypertext Domains. In Proceedings of the UK Conference on HyperText.
M INSKY, M. (1974). A Framework for Representing Knowledge. Technical
Report Memo 306, MIT-AI Laboratory.
M ITCHELL , T. (1997). Machine Learning. McGraw-Hill International, Singapore.
M OESSNER , L. (2001). Genre, Text Type, Style, Register: A Terminological
Maze? European Journal of English Studies, 5:131–138.
M ORTON , A. (1965). The Authorship of Greek Prose. Journal of the Royal Statistical Society, 128(2):169–233.
M ORTON , A. (1978). Literary Detection. Bowker, London.
M OSCHITTI , A. AND B ASILI , R. (2004). Complex Linguistic Features for Text
Classification: A Comprehensive Study. In Proceedings of the 26th European
Conference on Information Retrieval Research, pages 181–196.
M OSTELLER , F. AND WALLACE , D. (1984). Applied Bayesian and Classical Inference: The Case of the Federalist Papers. Addison-Wesley, Reading, MA, second
edition.
N ADEAU , C. AND B ENGIO , Y. (2003). Inference for the generalization error.
Machine Learning, 52(3):239–281.
O AKES , M. (1998). Statistics for Corpus Linguistics. Edinburgh University Press,
Edinburgh.
O AKES , M., G AIZAUSKAS , R., F OWKES , H., J ONSSON , A., WAN , V., AND
B EAULIEU , M. (2001). Comparison Between a Method Based on the ChiSquare Test and a Support Vector Machine for Document Classification.
In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR01).
OUP (2003). New Dictionary of National Biography: Notes for Contributors. Oxford
University Press.
OUP (2004). Dictionary of National Biography. Oxford University Press.
PAPE , S. AND F EATHERSTONE , S. (2005). Newspaper Journalism: A Practical
Introduction. Sage, London.
P LUTARCH (c 100). Roman Lives: A Selection of Eight Lives. 1999, Penguin, London.
P ORTER , M. (1980). An Algorithm for Suffix Stripping. Program, 14.
Q UINLAN , J. R. (1988). Simplifying Decision Trees. In G AINES , B. AND B OOSE ,
J. (editors), Knowledge Acquisition for Knowledge-Based Systems, pages 239–252.
Academic Press, London.
284
B IBLIOGRAPHY
Q UINLAN , R. (1993). C4.5: Programs for Machine Learning. Morgan Kauffman,
San Mateo, CA.
R ADEV, D. (1999). Generating Natural Language Summaries from Multiple Online
Sources. Ph.D. thesis, Columbia University, New York City.
R OPER , W. (1550). Life of St Thomas More. 2001, Ignatius Press, San Francisco,
San Francisco, CA.
R OSCH , E. H. (1973). Natural Categories. Cognitive Psychology, 4(3).
R OTHERY, J. (1991). Developing Critical Literacy: An Analysis of the writing task
in a year 10 reference test. DSP, Sydney.
S ANTINI , M. (2004a). A Shallow Approach to Syntactic Feature Extraction for
Genre Classification. In 7th Annual CLUK Research Colloquium.
S ANTINI , M. (2004b). State-of-the-Art on Automatic Genre Identification. Technical
Report ITRI-04-03, Information Technology Research Institute, University of
Brighton.
S CHANK , R. C. AND A BELSON , R. P. (1977). Scripts, Plans, Goals and Understanding. Lawrence Erlbaum Associates, Hillsdown, N.J.
S CHIFFMAN , B., M ANI , I., AND C ONCEPCION , K. J. (2001). Producing Biographical Summaries: Combining Linguistic Knowledge with Corpus Statistics. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics. Toulouse.
S COTT, M. (2005).
WordSmith Tools:
Online Manual.
URL:
http://www.lexically.net/downloads/version4/wordsmith.pdf
Accessed on 05-02-07.
S COTT, S. AND M ATWIN , S. (1999). Feature Engineering for Text Classification.
In Proceedings of the 16th International Conference on Machine Learning. Bled, SL.
S COTT, W. (1955). Reliability of Content Analysis: The Case of Nominal Scale.
Public Opinion Quarterly, 19:127–141.
S EBASTIANI , F. (2002). Machine Learning in Automated Text Categorization.
ACM Computing Surveys, 34(1):1–47.
S EMINO , E.
AND
S HORT, M. (2004). Corpus Stylistics. Routledge, London.
S HANNON , C. E. (1948). A Mathematical Theory of Communication. The Bell
System Technical Journal, 27:379–423 623–656.
S HELSTON , A. (1977). Biography. Methuen, London.
S HORT, M. (1996). Exploring the Language of Poems, Plays and Prose. Longman,
London.
S P ÄRCK J ONES , K. (1999). Automatic Summarizing: Factors and Directions.
In M ANI , I. AND M AYBURY, M. (editors), Advances in Automatic Text Summarization, pages 1–12. MIT Press, Cambridge, MA.
285
B IBLIOGRAPHY
S PEARMAN , C. (1904). The Proof and Measure of Association Between Two
Things. American Journal of Psychology, 15:88–103.
S TAMATATOS , E., FAKOTAKIS , N., AND K OKKINAKIS , G. (2000a). Automatic
Text Categorization in Terms of Genre and Author. Computational Linguistics,
26(4):471–495.
S TAMATATOS , E., FAKOTAKIS , N., AND K OKKINAKIS , G. (2000b). Text Genre
Detection Using Common Word Frequencies. In Proceedings of the 18th conference on Computational Linguistics. Morristown, NJ, USA.
S VARTVIK , J. (editor) (1990). The London-Lund Corpus of Spoken English: Description and Research. Lund University Press: Lund, Sweden.
TAN , C., WANG , Y., AND L EE , C. (2002). The Use of Bigrams to Enhance Text
Categorization. Information Processing Management, 38(4):529–546.
TAYLOR , J. (2003). Linguistic Categorization. Oxford University Press.
T EICH , E. (1999). Systemic Functional Grammar in Natural Language Generation:
Linguistic Description and Computational Representation. Cassell, London.
T HUCYDIDES (c 411 BC). History of the Peloponnesian War. 1970, Penguin, London.
T OMAN , M., T ESAR , R., AND J EZEK , K. (2006). Influence of Word Normalization on Text Classification. In Proceedings of the International Conference on
Multidisciplinary Information Sciences and Technologies. Madrid.
T RIBBLE , C. (1998). Writing Difficult Texts. Ph.D. thesis, University of Lancaster.
U NIVERSITY OF A UCKLAND P RESS (1998). Dictionary of New Zealand Biography, volume 1-5. Auckland University Press/New Zealand Department of
Internal Affairs.
V OORHEES , E. M. (2001). Overview of the TREC 2001 Question Answering
Track. In Proceedings of the 2001 Text Retrieval Conference (TREC-2001).
WALES , K. (editor) (1989). Dictionary of Stylistics. Longman, London.
WATKINS , R. (2003). Vertical Cup-to-Disc Ratio: Agreement between Direct
Ophthalmoscopic Estimation, Fundus Biomicroscopic Estimation, and Scanning Laser Ophthalmoscopic Measurement. Optometry and Vision Science,
80:454–459.
W HITELAW, C. AND A RGAMON , S. (2004). Systemic Functional Features in
Stylistic Text Classification. In AAAI Fall Symposium on Style and Meaning in
Language, Art, Music and Design.
W ILKS , Y. AND C ATIZONE , R. (1999). Can We Make Information Extraction
More Adaptive? In Proceedings of the SCIE99 Workshop, Proceedings of the
SCIE99 Workshop. Rome.
W ITTEN , I. AND F RANK , E. (2005). Data Mining: Practical Machine Learning
Tools and Techniques. Morgan-Kaufmann, San Francisco, second edition edition.
286
B IBLIOGRAPHY
W ITTEN , I., M OFFAT, A., AND B ELL , T. (1999). Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco.
W ITTGENSTEIN , L. (1953). Philosophical Investigations.
Translated by G.E.M. Anscombe.
Blackwell, Oxford.
W OOD , A. (1691). Athenae Oxonienses. Bennett, London.
W YNNE , M. (2004). Writing a Corpus Cookbook. Technical report, Oxford
Text Archive.
URL: http://ahds.ac.uk/litlangling/linguistics/IRCS.htm
Accessed on 01-02-07.
X IAO , R. AND M C E NERY, A. (2005). Two Approaches to Genre Analysis: Three
Genres in Modern American English. Journal of English Linguistics, 33:62–82.
YANG , Y. AND P EDERSEN , J. O. (1997). A Comparative Study on Feature Selection in Text Categorization. In Proceedings of ICML-97, 14th International
Conference on Machine Learning. Nashville, US.
YANGARBER , R. AND G RISHMAN , R. (2000). Machine Learning of Extraction
Patterns from Unannotated Corpora: Position Statement. In Proceedings of
the 14th European Conference on Artificial Intelligence: ECAI-2000 Workshop on
Machine Learning for Information Extraction.
Y ULE , G. (1944). The Statistical Study of Literary Vocabulary. Cambridge University Press.
Z HOU , L., T ICREA , M., AND H OVY, E. (2004). Multi-Document Biographical Summarization. In Proceedings of the Conference on Empirical Methods in
Natural Language Processing (EMNLP-2004). Barcelona, Spain.
287