Approaches to Automatic Biographical Sentence Classification: An Empirical Study By Michael Ambrose Conway S UBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY AT T HE U NIVERSITY OF S HEFFIELD May 2007 D EPARTMENT OF C OMPUTER S CIENCE Who’s Who A shilling life will give you all the facts: How Father beat him, how he ran away, What were the struggles of his youth, what acts Made him the greatest figure of his day; Of how he fought, fished, hunted, worked all night, Though giddy, climbed new mountains; named a sea; Some of the last researchers even write Love made him weep his pints like you and me. With all his honours on, he sighed for one Who, say astonished critics, lived at home; Did little jobs about the house with skill And nothing else; could whistle; would sit still Or potter round the garden; answered some Of his long marvellous letters but kept none. W.H.A UDEN (1935) i Soloman Grundy Solomon Grundy, Born on Monday, Christened on Tuesday, Married on Wednesday, Took ill on Thursday, Worse on Friday, Died on Saturday, Buried on Sunday. That was the end Of Solomon Grundy. T RADITIONAL ii In general, most biographical material about me on the Internet is seriously flawed, if not outright wrong, and I know other writers are experiencing the same problem with their own data - so it must be something to do with the way Google and Yahoo squeeze information and make it do odd tricks. Sometimes I’ll be introduced onstage at book events by a speaker saying, “Mr Coupland is German and once did an advertisement for Smirnoff vodka. He collects meteorites and lives in Scotland in a house with no furniture.” I know. What are you supposed to say when you hear something like this? D OUGLAS C OUPLAND (www.coupland.com) iii Abstract This thesis addresses the problem of the reliable identification of biographical sentences, an important subtask in several natural language processing application areas (for example, biographical multiple document summarisation, biographical information extraction, and so on). The biographical sentence classification task is placed within the framework of genre classification, rather than traditional topic based text classification. Before exploring methods for doing this task computationally, we need to establish whether, and with what degree of success, humans can identify biographical sentences without the aid of discourse or document structure. To this end, a biographical annotation scheme and corpus was developed, and assessed using a human study. The human study showed that participants were able to identify biographical sentences with a good level of agreement. The main body of the thesis presents a series of experiments designed to find the best sentence representations for the automatic identification of biographical sentences from a range of alternatives. In contrast to previous work, which has centred on the use of single terms (that is, unigrams) for biographical sentence representations, the current work derives unigram, bigram and trigram features from a large corpus of biographical text (including the British Dictionary of National Biography). In addition to the use of corpus derived -grams, a novel characteristic of the current approach is the use of biographically relevant syntactic features, identified both intuitively and through empirical methods. The experimental work shows that a combination of -gram features derived from the Dictionary of National Biography and biographically orientated syntactic features yield a performance that surpasses that gained using -gram features alone. Additionally, in accordance with the view of biographical sentence classification as a genre classification task, stylistic features (for example, topic neutral function words) are shown to be important for recognising biographical sentences. iv Acknowledgements First of all, I would like to thank my supervisor Rob Gaizausakas for his invaluable guidance and advice during the preparation and writing of this thesis. I would also like to thank Rob for his initial suggestion that the British Dictionary of National Biography would be an interesting corpus for examining the characteristics of biographical texts. The members of my thesis committee have provided valuable feedback on my research plan which I would like to thank them for. Finally, I would like to thank my family and friends for their support. v Contents 1 Introduction 1.1 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Biographical Sentence Classification: The Current Situation 1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Background Chapters . . . . . . . . . . . . . . . . . 1.4.2 Human Biographical Text Classification . . . . . . . 1.4.3 Automatic Biographical Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 3 3 4 4 5 2 Background Issues for Biographical Sentence Recognition 2.1 Genre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Systemic Functional Grammar . . . . . . . . . . 2.1.2 The Multi-Dimensional Analysis Approach . . . 2.2 Stylistics and Stylometry . . . . . . . . . . . . . . . . . 2.2.1 Stylistics . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Stylistic Analysis . . . . . . . . . . . . . . . . . . 2.2.3 Stylometrics: Authorship Attribution . . . . . . 2.3 Biographical Writing . . . . . . . . . . . . . . . . . . . . 2.3.1 Characteristics of Biographical Writing . . . . . 2.3.2 Development of Biographical Writing . . . . . . 2.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Automatic Text Classification . . . . . . . . . . . 2.5 Machine Learning . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Learning Algorithms . . . . . . . . . . . . . . . . 2.5.2 Evaluating Learning . . . . . . . . . . . . . . . . 2.5.3 Feature Selection . . . . . . . . . . . . . . . . . . 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 8 11 22 23 25 25 30 30 32 34 36 37 38 47 50 52 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Review of Recent Computational Work 3.1 Automatic Genre Classification . . . . . . . . . . . . . . . . . . . 3.1.1 Recent Work on Genre in the Computational Linguistics Tradition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Feature Selection for Topic Based Text Classification . . . 3.1.3 Feature Selection for Genre Classification . . . . . . . . . 3.2 Systems that Produce Biographies . . . . . . . . . . . . . . . . . 3.2.1 The Summarisation Task . . . . . . . . . . . . . . . . . . . 3.2.2 Multiple Document Summarisation . . . . . . . . . . . . 3.2.3 New Mexico System . . . . . . . . . . . . . . . . . . . . . vi 53 53 53 57 59 65 65 67 69 C ONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 76 78 82 83 4 Methodology and Resources 4.1 Methodology . . . . . . . . . . . . . . . . . . . . 4.2 Software . . . . . . . . . . . . . . . . . . . . . . . 4.3 Corpora . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Dictionary of National Biography . . . . 4.3.2 Chambers Biographical Dictionary . . . . 4.3.3 Who’s Who . . . . . . . . . . . . . . . . . 4.3.4 Dictionary of New Zealand Biography . . 4.3.5 Wikipedia Biographies . . . . . . . . . . . 4.3.6 University of Southern California Corpus 4.3.7 The TREC News Corpus . . . . . . . . . 4.3.8 The B ROWN Corpus . . . . . . . . . . . . 4.3.9 The STOP Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 84 86 87 87 89 90 91 92 94 95 96 97 5 Developing a Biographical Annotation Scheme 5.1 Existing Annotation Schemes . . . . . . . . . . . . . . 5.1.1 Text Encoding Initiative Scheme . . . . . . . . 5.1.2 University of Southern California Scheme . . . 5.1.3 Dictionary of National Biography Scheme . . . 5.2 Synthesis Annotation Scheme . . . . . . . . . . . . . . 5.2.1 Developing a New Biographical Scheme . . . 5.2.2 A Synthesis Biographical Annotation Scheme 5.2.3 Assessing the Synthesis Annotation Scheme . 5.3 Developing a Small Biographical Corpus . . . . . . . 5.3.1 Text Sources . . . . . . . . . . . . . . . . . . . . 5.3.2 Issues in Developing a Biographical Corpus . 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 101 102 103 107 109 109 112 113 118 118 122 123 6 Human Study 6.1 Introduction . . . . . . . . . . . 6.2 Agreement . . . . . . . . . . . . 6.2.1 Percentage Based Scores 6.2.2 The KAPPA Statistic . . . 6.3 Pilot Study . . . . . . . . . . . . 6.4 Main Study . . . . . . . . . . . . 6.4.1 Motivation . . . . . . . . 6.4.2 Study Description . . . 6.4.3 Results . . . . . . . . . . 6.4.4 Discussion . . . . . . . . 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 124 125 125 126 130 132 133 133 134 135 136 3.3 3.2.4 Mitre/Columbia System . . 3.2.5 Southern California System 3.2.6 Southampton System . . . . 3.2.7 Other Relevant Work . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Learning Algorithms for Biographical Classification 137 7.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 7.2 Experimental Procedure . . . . . . . . . . . . . . . . . . . . . . . 138 vii C ONTENTS 7.3 7.4 7.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 8 Feature Sets 8.1 Standard Features . . . . . . . . . . . . . . 8.2 Biographical Features . . . . . . . . . . . . . 8.3 Syntactic Features . . . . . . . . . . . . . . . 8.4 Key-keyword Features . . . . . . . . . . . . 8.4.1 Naive Key-keywords Method . . . . 8.4.2 WordSmith Key-keywords Method 8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 144 146 148 152 153 154 157 9 Automatic Classification of Biographical Sentences 9.1 Procedure . . . . . . . . . . . . . . . . . . . . . . 9.2 Syntactic Features . . . . . . . . . . . . . . . . . . 9.2.1 Results . . . . . . . . . . . . . . . . . . . . 9.2.2 Discussion . . . . . . . . . . . . . . . . . . 9.3 Lexical Methods . . . . . . . . . . . . . . . . . . . 9.3.1 Results . . . . . . . . . . . . . . . . . . . . 9.3.2 Discussion . . . . . . . . . . . . . . . . . . 9.4 Keywords . . . . . . . . . . . . . . . . . . . . . . 9.4.1 Results . . . . . . . . . . . . . . . . . . . . 9.4.2 Discussion . . . . . . . . . . . . . . . . . . 9.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 159 160 161 164 165 166 168 170 171 173 176 10 Portability of Feature Sets 10.1 Motivation . . . . . . . . 10.2 Experimental Procedure 10.3 Results . . . . . . . . . . 10.4 Discussion . . . . . . . . 10.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 177 178 180 181 183 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Conclusion 184 11.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 11.1.1 Hypothesis 1, Annotation Scheme and Human Study . . 185 11.1.2 Hypothesis 2, Automatic Biographical Sentence Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 11.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 11.2.1 Biographical Sentence Classifier Module . . . . . . . . . 188 11.2.2 Improving Biographical Sentence Classification . . . . . 188 11.2.3 Extensions to the Biographical Sentence Classification Task188 11.2.4 Other Text Classification Tasks . . . . . . . . . . . . . . . 189 11.2.5 Genre Analysis . . . . . . . . . . . . . . . . . . . . . . . . 189 11.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 A Human Study: Pilot Study 191 A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 A.2 Task Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 A.2.1 Core Biographical Category . . . . . . . . . . . . . . . . 191 viii C ONTENTS A.2.2 Extended Biographical Category A.2.3 Non Biographical Category. . . A.3 Task Questions . . . . . . . . . . . . . . A.4 Participant Responses . . . . . . . . . . B Human Study: Main Study B.1 Instructions to Participants . . . . . B.1.1 Introduction . . . . . . . . . B.1.2 Six Biographical Categories B.1.3 Example Sentences . . . . . B.2 Sentences . . . . . . . . . . . . . . . B.2.1 Set One . . . . . . . . . . . . B.2.2 Set Two . . . . . . . . . . . . B.2.3 Set Three . . . . . . . . . . . B.2.4 Set Four . . . . . . . . . . . B.2.5 Set Five . . . . . . . . . . . . B.3 Agreement Data . . . . . . . . . . . B.3.1 Set 1 Agreement Data . . . B.3.2 Set 2 Agreement Data . . . B.3.3 Set 3 Agreement Data . . . B.3.4 Set 4 Agreement Data . . . B.3.5 Set 5 Agreement Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 192 193 199 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 202 202 203 205 207 207 213 219 226 231 237 238 240 242 245 247 C Identifying Syntactic Feature 251 C.1 Distance From the Mean . . . . . . . . . . . . . . . . . . . . . . . 251 C.2 Standard Deviations from the Mean . . . . . . . . . . . . . . . . 255 D Syntactic Features 259 E Ranked Features 263 F Coverage of New Annotation Scheme F.1 Four Biographies from Various Sources F.1.1 Ambrose Bierce . . . . . . . . . . F.1.2 Phillip Larkin . . . . . . . . . . . F.1.3 Alan Turing . . . . . . . . . . . . F.1.4 Paul Foot . . . . . . . . . . . . . . F.2 Four Biographies from Wikipedia . . . . F.2.1 Jack Anderson . . . . . . . . . . . F.2.2 Kerry Packer . . . . . . . . . . . F.2.3 Richard Pryor . . . . . . . . . . . F.2.4 Stanley Williams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 265 265 266 268 269 270 270 271 271 272 273 G The Corrected Re-sampled -test G.1 An Outline of the Corrected Re-sampled -test . . . . . . . . . . 273 H Factor Analysis 274 H.1 Constructing a Correlational Matrix . . . . . . . . . . . . . . . . 275 H.2 Identifying Factors from a Correlational Matrix . . . . . . . . . . 276 ix List of Figures 2.1 2.2 2.3 2.4 2.5 Relationship Between Genre and Register (based on E GGINS (1994)). 10 Example Registers. . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Possible and Impossible Registers. . . . . . . . . . . . . . . . . . 12 Linguistic Characteristics of Dimensions in B IBER (1988). . . . . 17 Two Examples of Stylistically Different Texts from (A MIS, 1985) and (D ENNETT, 1992), Respectively. . . . . . . . . . . . . . . . . 26 2.6 Examples of “Deep” and “Contingent” Features Described in M C E NERY and O AKES (2000). . . . . . . . . . . . . . . . . . . . . 28 2.7 Inverted Pyramid for Biographies. . . . . . . . . . . . . . . . . . 33 2.8 Genre Features Decision Tree. . . . . . . . . . . . . . . . . . . . . 40 2.9 Decision Tree Rules Example. . . . . . . . . . . . . . . . . . . . . 41 2.10 Example Decision Tree for 3000 Instances. . . . . . . . . . . . . . 42 2.11 Rule Based Learning Example. . . . . . . . . . . . . . . . . . . . 44 2.12 Constituents of a Contingency Table for . . . . . . . . . . . . . 51 3.1 3.2 3.3 3.4 3.5 3.6 3.7 System Network from W HITELAW and A RGAMON (2004) . . . . Conversion of Documents to Document Vectors . . . . . . . . . Genre Categories Used in K ARLGREN and C UTTING (1994). . . . New Mexico System (C OWIE ET AL ., 2001). . . . . . . . . . . . . Sample Output from S CHIFFMAN ET AL . (2001). . . . . . . . . . System Architecture for Z HOU ET AL . (2004) MDS system. . . . Sample Output from ARTEQUAKT System (K IM ET AL ., 2002) 4.1 Discrepancies in Annotation Styles in the USC (Curie Biographies). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 B ROWN Corpus Hierarchy of Text Types . . . . . . . . . . . . . . 98 Truncated Example Entry from the STOP Corpus (S EMINO and S HORT, 2004): Michael Caine’s Autobiography . . . . . . . . . . 99 Hierarchy of Texts Included in the STOP Corpus (S EMINO and S HORT, 2004) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.2 4.3 4.4 5.1 5.2 5.3 56 57 60 72 73 77 79 Dictionary of National Biography Opening Schema. . . . . . . . . . 109 Relationship Between the Dictionary of National Biography and University of Southern California Biographical Schemes. . . . . 110 Relationship Between New Synthesis Scheme, Text Encoding Initiative Scheme, and University of Southern California Scheme. 111 x L IST OF F IGURES 5.4 5.5 5.6 5.7 7.1 7.2 Entry for Alan Turing in the Dictionary of National Biography Annotated Using New Six Way Scheme. . . . . . . . . . . . . . . . . Wikipedia biography for Richard Pryor Annotated Using New Six Way Scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sources of Documents Used From the Literary Genres of the STOP Corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Types of Document Used in Biographically Tagged Corpus. . . 115 116 121 121 Mean Performance of Learning Algorithms with 10 x 10 CrossValidation on “Gold Standard” Data using a Unigram Based Feature Representation. . . . . . . . . . . . . . . . . . . . . . . . . . 141 Root section of a C4.5 Decision Tree Derived From the Gold Standard Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . 143 9.1 Comparison of the Performance of Unigrams, Bigrams and Trigrams. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 9.2 Comparison of the Performance of Syntactic and Pseudo-Syntactic Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 9.3 Experimental and Null Hypotheses — Syntactic Features. . . . 163 9.4 Experimental and Null Hypotheses — Pseudo-Syntactic Features.164 9.5 Comparison of the Performance of Differing Lexical Representations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 9.6 Experimental and Null Hypotheses — Stemming. . . . . . . . . 168 9.7 Experimental and Null Hypotheses – Function Words. . . . . . 168 9.8 Experimental and Null Hypotheses — Stopwords. . . . . . . . . 169 9.9 Comparison of the Performance of Keywords, Key-Keywords, and Frequencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 9.10 Experimental and Null Hypotheses – Key-Keywords. . . . . . . 172 9.11 Comparison of Partial Decision Trees for Each Feature Set. . . . 173 10.1 Biographical Unigram Extraction from the USC Corpus . . . . . 179 10.2 Comparison of the Performance of Unigrams Derived from USC Annotated Clauses and Biographical Dictionary Unigram Frequency Counts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 10.3 Experimental and Null Hypotheses: USC and Biographical Dictionary Derived Features . . . . . . . . . . . . . . . . . . . . . . . 181 xi List of Tables 2.1 2.2 2.3 2.4 2.5 Genres Used by B IBER (1988). . . . . . . . . . Text Typology Derived by B IBER (1989). . . . Example Training Sentence Representations. Example Test Sentence Representation. . . . . Contingency Table. . . . . . . . . . . . . . . . . . . . . 16 18 45 45 51 4.1 Descriptive Statistics for Biographical Corpora. . . . . . . . . . . 96 5.1 5.2 116 5.4 5.5 Coverage of New Annotation Scheme Using Different Sources. . Coverage of New Annotation Scheme on Short Wikipedia Biographies (Deaths in December 2005.) . . . . . . . . . . . . . . . . . . Percentage of Biographical Sentences Based on 1000 Sentence Sample. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Descriptive Statistics for Biographical Corpora. . . . . . . . . . . Average Number of Biographical Tag Types per Text. . . . . . . 118 122 122 6.1 6.2 6.3 6.4 6.5 Raw Agreement Scores (Idealised Example Data) . . . . . . . . . Types of KAPPA; Methods for Calculating Expected Probability. Idealised Data for KAPPA Example. . . . . . . . . . . . . . . . . . Inter-classifier Agreement Results. . . . . . . . . . . . . . . . . . KAPPA Scores for Each Sentence Set . . . . . . . . . . . . . . . . . 126 127 128 132 135 7.1 Six Learning Algorithms Compared using “Gold Standard” Data and a Feature Representation Based on the 500 Most Frequent Unigrams in the DNB: 10 x 10 Fold Cross Validation . . . . . . 140 Six Learning Algorithms Compared using “Gold Standard” Data and a Feature Representation Based on the 500 Most Frequent Unigrams in the DNB: 100 x 10 Fold Cross Validation . . . . . . 140 5.3 7.2 8.1 8.2 8.3 8.4 8.5 8.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 100 Most Frequent Unigrams from the DNB. . . . . . . . . . . . 145 100 Most Frequent Bigrams From the DNB. . . . . . . . . . . . . 147 100 Most Frequent Trigrams From the DNB . . . . . . . . . . . . 147 Syntactic Features Used by B IBER (1988). . . . . . . . . . . . . . . 149 Twenty Syntactic Features Most Characteristic of Biography Ranked by Maximum Distance from Mean. . . . . . . . . . . . . . . . . . 150 Twenty Syntactic Features Characteristic of Biography Ranked by Positive Association with Biographical Genre. . . . . . . . . . 150 xii L IST OF TABLES 8.7 Twenty Syntactic Features Characteristic of Biography Ranked by Negative Association with Biographical Genre. . . . . . . . . 8.8 Frequent Unigrams in the Biographical Corpus . . . . . . . . . . 8.9 Unigrams in the Biographical Corpus Ranked by Naive Keykeyness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.10 Unigrams in the Biographical Corpus Ranked by WordSmith Key-keyness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 9.2 9.3 151 154 155 156 Performance of Syntactic and Pseudo-syntactic Features. . . . . 163 Performance of Alternative Lexical Methods. . . . . . . . . . . . 166 Performance of Keyword and Key-keyword features Relative to a Baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 10.1 Classification Accuracies of the USC and DNB/Chambers Derived Features on Gold Standard Data. . . . . . . . . . . . . . . . 181 A.1 Pilot Study Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 B.1 B.2 B.3 B.4 B.5 Agreement Data for Set 1 Agreement Data for Set 1 Agreement Data for Set 3 Agreement Data for Set 4 Agreement Data for Set 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.1 Sixty-seven Features Ranked by Distance from the Mean (Irrespective of Whether the Distance is Positive or Negative) . . . . C.2 Sixty-seven Features Ranked by Distance from the Mean . . . . C.3 Sixty-seven Features Ranked by Number of Standard Deviations from the Mean. . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.4 Sixty-seven features Ranked by Number of Standard Deviations from the Mean. . . . . . . . . . . . . . . . . . . . . . . . . . . . . E.1 100 Features Identified by 238 240 242 245 247 251 253 255 257 for DNB and TREC data. . . . . . 264 xiii C HAPTER 1 Introduction 1.1 Hypotheses This thesis presents and defends the claim that biographical writing can be reliably identified at the sentence level using automatic methods. In other words, biographical sentences can be identified as such, independently of the context in which they occur (that is, the document or surrounding text). As sub-hypotheses to the more general hypothesis, this thesis explores two secondary hypotheses: Hypothesis 1 Humans can reliably identify biographical sentences without the contextual support provided by a discourse or document structure. Hypothesis 2 “Bag-of-words” style sentence representations augmented with syntactic features provide a more effective sentence representation for biographical sentence recognition than “bag-of-words” style representations alone. The first hypothesis seeks to identify whether the research programme is feasible. That is, if humans can identify isolated biographical sentences without the aid of a supporting discourse structure, then it is likely that a machine learning algorithm will be able to perform the same task. Once we have established that humans are able to identify biographical statements with good agreement (see Chapter 6) we can move to Hypothesis 2, which claims that the topic orientated text representations commonly used in text classification (and information retrieval) research alone, are less useful than a combination of topic orientated and stylistic (non-topical) features for the biographical sentence classification task. Hypothesis 2 is designed to explore the usefulness of syntactic and stylistic features for genre classification. The main hypothesis (and two sub-hypotheses) provides a framework for the thesis, but other research questions are addressed within that framework (for 1 C HAPTER 1: I NTRODUCTION example, the utility of the key-keywords methodology (see page 21) for identifying biographical features). The remainder of this introductory chapter provides some motivation for the research work, focusing on potential uses for a successful biographical sentence classifier and provides a chapter-by-chapter outline of the thesis, concentrating on how each chapter addresses the hypotheses identified. 1.2 Applications Biographical orientated summarisation or biographically orientated multiple document summarisation is dependent on the reliable identification of biographical sentences; sentences that may not be judged salient using standard information retrieval techniques and which may be “buried” in otherwise event orientated text. In situations where gigabytes of data are analysed for biographical sentences, more linguistically orientated natural language processing approaches to identifying biographical sentences — say, full parsing — are inappropriate due to the significant overhead of producing a linguistic representation of the document and its constituent sentences. In this situation, standard “bag-of-words” representations combined with highly focused syntactic features may prove a reliable and scalable solution, able to identify biographical writing hidden among potentially huge document collections. Potential uses for a biographical sentence classifier could include: A sentence filter in a person orientated information extraction system. That is, sentences classified as biographical with respect to the person of interest could be selected for further processing. A research tool for journalists and writers seeking only biographical information from information retrieval results on the web. For example, a writer constructing a biography for a named historical individual may be interested in filtering out those sentences that are not biographical with respect to the individual of interest, even if this causes some textual incoherences/disfluencies. A module in a more general genre identification system. For example, a system that classified web pages with respect to their primary information function, with the proportion of biographical sentences above a given threshold leading to the document being tagged as biographical as opposed to some other genre. More potential uses of a biographical sentence classifier (along with areas of further research) are outlined in the concluding chapter. 2 C HAPTER 1: I NTRODUCTION 1.3 Biographical Sentence Classification: The Current Situation While recent work on identifying biographical writing has been carried out in the context of the development of architectures and algorithms for biographically orientated automatic summarisation systems (S CHIFFMAN ET AL ., 2001; K IM ET AL ., 2002; C OWIE ET AL ., 2001; Z HOU ET AL ., 2004) little work in the computational linguistics tradition has been directed towards the elaboration of a theoretical framework for the identification of biographical writing. Previous work has concentrated on the identification of heuristics for the identification of biographical phrases or sentences (for example, the assumption that apposition is a strong indicator of biographical clauses (S CHIFFMAN ET AL ., 2001)), or on the construction of traditional domain specific ontologies (for example, a system that is limited to the domain of artists’ biographies (K IM ET AL ., 2002)). Z HOU ET AL . (2004) describes a system developed at the University of Southern California, that uses corpus evidence to construct biographical schemas (that is, to identify the kinds of information that form a biography). However, their sample is limited to 120 biographies of 10 individuals, as a preliminary study during the construction of a biographical multiple document summarisation system. Neither the text classification nor genre identification communities have focused research efforts directly on the identification of biographical sentences, although the identification of biographical sentences can easily be reframed as a sentence level text classification problem and also as a genre identification problem if we accept that biographical writing constitutes a distinct genre. 1.4 Thesis Outline The thesis can be divided into three parts. Chapters 2, 3 and 4 discuss background issues necessary for understanding the work as a whole. Chapters 5 and 6 address Hypothesis 1 (Humans can reliably identify biographical sentences without the contextual support provided by a discourse or document structure). Chapters 7, 8, 9 and 10 address Hypothesis 2 (“Bag-of-words” style sentence representations augmented with syntactic features provide a more effective sentence representation for biographical sentence recognition than “bagof-words” style representations alone) and also present results pertaining to the automatic classification of biographical sentences more generally. The concluding chapter summarises the results of the thesis. Additional material is presented in appendices, which are referenced when appropriate in the text of the thesis. 3 C HAPTER 1: I NTRODUCTION 1.4.1 Background Chapters These three chapters contain the background necessary to understand the thesis as a whole. Chapter 2: Background Issues for Biographical Sentence Recognition This chapter provides essential background to the central themes of the thesis. The first section explores the notion of genre, and discusses two recent theoretical approaches to the study of genre: Systemic Functional Grammar and Multidimensional Analysis. The second section sets forth a survey of stylistics (a subdiscipline of linguistics relating to the study of style) and its statistically influenced intellectual cousin, stylometry. The third section presents a summary of the history of biographical writing in English, along with a discussion of some the characteristics of biographical writing. The fourth section discusses classification theory and especially issues associated with text classification. The fifth and final section reviews several of the important machine learning algorithms used in the work, along with a discussion concerning methods of evaluating the performance of learning algorithms. Chapter 3: Review of Recent Computational Work This chapter aims to describe recent computational work relevant to the biographical sentence classification task. The chapter is divided into two sections. The first section describes work in automatic genre classification (focusing on the different strategies used in topic categorisation and genre categorisation). The second section describes several working systems that produce biographies from unstructured data. Chapter 4: Methodology and Resources The chapter is divided into three sections. The first section describes the methodology used in the work. The second and third section outlines the two kinds of resources — software and corpora, respectively — used in the research. 1.4.2 Human Biographical Text Classification These two chapters address the issue of whether people are able to reliably identify biographical sentences. The two chapters describe the formulation of a biographical annotation scheme (that is, a criterion for deciding whether a sentence is biographical or non-biographical), the development of a biographical corpus based on this scheme, and the utilisation of that scheme in a human study which assesses the extent to which participants agree on what is and is not a biographical sentence. 4 C HAPTER 1: I NTRODUCTION Chapter 5: Developing a Biographical Annotation Scheme This chapter first reviews some existing annotation schemes for biographical text (including the scheme developed under the auspices of the Text Encoding Initiative). The second section describes a new syntactic annotation scheme, along with an initial assessment of that scheme. The final section describes a biographical corpora based on the new annotation scheme. Chapter 6: Human Study This chapter is divided into three main sections. First, some necessary background on agreement statistics is presented, with special reference to the K APPA statistic. Second, a pilot study is reported which uses a three way biographical sentence classification scheme. Finally, the main study is presented, which shows that the biographical annotation scheme developed in Chapter 5 yields good levels of agreement between annotators. The high levels of agreement obtained in this study supports the claim made in Hypothesis 1, that people are able to distinguish reliably between biographical and non-biographical sentences. 1.4.3 Automatic Biographical Text Classification Chapters 7, 8, 9 and 10 present results pertaining to the automatic classification of biographical sentences. The corpus of biographical sentences developed in Chapter 5 and validated in Chapter 6 forms the basis of a gold standard corpus of five hundred and one biographical and non-biographical sentences, which is utilised in these machine learning chapters as training and test data. While Chapters 5 and 6 established that people are able to reliably distinguish between biographical and non-biographical sentences, this part of the thesis addresses, among other issues pertinent to automatic biographical sentence classification, Hypothesis 2: “Bag-of-words” style sentence representations augmented with syntactic features provide a more effective sentence representation for biographical sentence recognition than “bag-of-words” style representations alone. Chapter 7: Learning Algorithms for Biographical Classification This chapter compares the performance of five popular text classification algorithms when applied to the biographical sentence classification task. Each learning algorithm was tested using the gold standard biographical data and a unigram based feature set consisting of the 500 most frequent words in the Dictionary of National Biography. The chapter serves as a “first pass” of the data, allowing indicative results to be drawn about the usefulness of different machine learning algorithms for the biographical sentence classification task. Later chapters (that is, Chapters 9 and 10) concentrate on varying the feature sets used, rather than the machine learning algorithms. The Naive Bayes algorithm generated the best results. 5 C HAPTER 1: I NTRODUCTION Chapter 8: Feature Sets In order to provide the necessary background to Chapters 9 and 10, which focus on evaluating the performance of various feature sets, this chapter outlines the different feature sets used in the work. These features include standard features (including, “bag-of-words” style features), biographical features, syntactic features, and (so-called) key-keyword based features. Chapter 9: Automatic Classification of Biographical Sentences This chapter is divided into three sections. Section one considers the performance of syntactic features versus “bag-of-words” style sentence representations for the biographical text classification task. It is discovered that augmenting a “bagof-words” style representation with syntactic features improves classification accuracy, but not at a statistically significant level, lending limited support to Hypothesis 2. Section two considers the impact of function words on classification accuracy, in line with the intuition (borrowed from stylometry) that topic neutral function words are important in representing the non-topical content of a text. It was discovered that accuracy decreases at a statistically significant level when function words are removed from feature sets. The third part of the chapter assesses the key-keyword methodology for selecting genre specific features and finds that selecting features using the method, which was developed as an alternative to Multi-Dimensional analysis (see Chapter 2), yields worse results than simply using frequent unigrams. Chapter 10: Portability of Feature Sets This chapter compares the performance of a feature set identified by Z HOU ET AL . (2004) (using a method that required a considerable annotation effort) to a feature set derived automatically from frequent unigrams in a biographical corpora. It is shown that both approaches yield almost identical classification accuracy scores, suggesting that there is little gain in using this labour intensive feature identification method. Chapter 11: Conclusion The concluding chapter is divided into two sections. First, contributions made by the thesis work are described. Second, possible future directions, based on the work conducted for this thesis, are outlined. 6 C HAPTER 2 Background Issues for Biographical Sentence Recognition This chapter is designed to provide essential background to the central theme of the thesis, that biographical sentence classification can be viewed as a genre classification task where non-topical sentence representations are useful (hence the extended discussion of genre and stylistics — fields of study which stress the non-topical characteristics of text — in Sections 2.1 and 2.2, respectively.). The chapter consists of five sections. First, different approaches to genre are discussed (as the thesis places biographical text recognition within the wider area of genre recognition). Second, a brief overview of stylistics is presented, as stylistics is traditionally concerned with non-topical criteria for distinguishing texts. Third, salient characteristics of biographical writing are outlined, including a brief history of the biographical genre. Fourth, an outline of classification theory is presented, concentrating on the particular problems associated with text classification. Fifth, a review of several of the important machine learning algorithms used in this work is presented, including a discussion of the important subject of how to evaluate learning. 2.1 Genre A RISTOTLE (c 340 BC) was the first to develop a systematic theory of genre. Although his classification framework is rooted in the culture of antiquity, it remains important as it established a framework for subsequent developments. D UBROW (1982) borrows Whitehead’s description of the history of philosophy as a series of footnotes to Plato, and characterises the history of genre studies as a series of footnotes to Aristotle. 7 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION This section focuses on two active research traditions that have genre as a core concern; Systemic Functional Linguistics and Multi-Dimensional Analysis. 2.1.1 Systemic Functional Grammar Systemic Functional Grammar1 (SFG) is a sociolinguistic theory of language use first developed by Michael Halliday in the 1960’s (H ALLIDAY, 1961, 1966), and extended by Halliday and others throughout the 1970’s, 1980’s and 1990’s (H ALLIDAY and H ASAN, 1976; M ARTIN ET AL ., 1996). The current theory is presented comprehensively in the most recent edition of Introduction to Functional Grammar (H ALLIDAY and M ATTHIESSEN, 2004). This section aims to provide a brief overview of SFG, before going on to give some details on SFG grammatical analysis, and finally describes the importance of SFG to genre. Note that we do not leap into a discussion of the function of genre in SFG for two main reasons. First, SFG is a very dense and complex theory, and requires some exposition before an analysis of genre in relation to the theory can be helpful. Second, SFG uses an extensive technical vocabulary, sometimes with words (for example, the term “genre” itself) used in surprising ways. A Brief Overview of Systemic Functional Theory SFG is a functional theory of grammar, emphasising how language is embedded in a social context of use. This emphasis on language use distinguishes SFG from formal — Chomskyan — theory, which concerns itself with the ways in which minds shape and constrain possible grammars. SFG is a socio-linguistic theory of grammar, whereas formal linguistics is a psychological theory (M AR TIN ET AL ., 1997). In this way, SFG can be viewed as an intellectual descendant of the later Wittgenstein (W ITTGENSTEIN, 1953), and more directly, Firth’s approach to linguistics (H ONEYBONE, 2005). Malinowski’s work in anthropology, which stressed the importance of language in negotiating, developing and consolidating human relationships (M ALINOWSKY, 1935), was also very influential in the development of SFG. The core concern of SFG (and its chief point of contrast with formal linguistics) is its preoccupation with the questions: What do people do with language, and how do they do it? These questions are grounded in a view of language based around four theoretical assumptions (articulated by E GGINS (1994)). 1. Language use is functional. 2. The function of language is to make meanings. 3. Meanings are influenced by social and cultural contexts. 4. Language use is a process of making meanings by choosing.2 1 Systemic Functional Grammar is also known as Systemic Functional Linguistics. uses the word semiotics to describe this process. 2 E GGINS (1994) 8 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION The social orientation of SFG and its view of language as a “strategic, meaning making resource” (E GGINS, 1994, p. 1) means that SFG does not only provide a method for grammatical analysis, but can aptly be applied to various practical problems. E GGINS (1994) lists several areas in which SFG has been successfully applied, including language education (for example, R OTHERY (1991)), speech pathology (for example, A RMSTRONG (1991)) and natural language generation (T EICH, 1999). Ideas from SFG have also been applied to textual genre classification (for example, W HITELAW and A RGAMON (2004) described on page 54). A distinctive feature of SFG is that it provides a theoretical structure at multiple levels of granularity. It is both able to provide fine grained grammatical analysis at the level of parts-of-speech within clauses, and to interpret the broader communicative act in its social context, unified by the idea of meaning as a function of linguistic choice. These meanings — the meanings that a text produces — can be divided into three categories (H ALLIDAY and M ATTHIESSEN, 2004): Ideational meaning: Roughly “aboutness”. Ideational meaning is concerned with processes and relationships. Interpersonal meaning: The social dynamic assumed in the text. Note that even expositionary text can be viewed as interpersonally meaningful if construed as a dialogue between writer and reader. Textual meaning: The textual meaning (as opposed to contextual or interpersonal meaning) is that component of the text that is “about” the text itself. That is, how the text is organised. For example, are persons or abstract nouns referred to? SFG is grounded in the social use of language and as such, it focuses (as we have mentioned) on entire texts, rather than isolated sentences as, according to SFG theory, communication has a beginning, a middle and an end, and can only properly be analysed in its entirety. A further feature of SFG, a natural development of the theory that meaning is a function of linguistic choice, is the system network, a formalism for representing linguistic choice (see Figure 3.1 on page 56 for an example of a system network). Systemic Functional Grammar: Register and Genre “Genre” and “register” are semi-technical3 terms within SFG. Register is typically defined as “the context of situation” and genre “the context of culture” (H ALLIDAY and M ATTHIESSEN, 2004). What this means is that the context of a text can be described at two levels of abstraction: the lower level (register) describes the situation, and the higher level (genre) explains the situation in terms 3 These terms were identified by M ALINOWSKY (1935) 9 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION of purpose. The relationship between genre, register and language is represented diagrammatically in Fig 2.1. Figure 2.1: Relationship Between Genre and Register (based on E GGINS (1994)). Context of Situation Context of Culture mode LANGUAGE field REGISTER tenor GENRE As register is “nearer” to language, and the basis on which genre is formed, we will begin with an analysis of how register and text are related. Register can be divided into three aspects (note that these are analogous to the three meanings of a text, ideational, interpersonal and textual): Field is the topic of the text and the particular linguistic patterns associated with that topic. For example, consider a weather forecast and the typical linguistic features associated with it. Note that weather forecast phrases are a subset of weather related phrases generally and that the phrase “persistent heavy rain” is a likelier phrase in the context of weather forecasts than “raining buckets”.4 Tenor is the effect of the relationship between language producer and addressees. For example, we are unlikely to talk to a child in the same manner (vocabulary, level of formality, and so on) as we would to a potential employer in a job interview situation. 4 The set of acceptable phrases for describing the weather in the context of a weather forecast are severely constrained, and breaking these rules very noticeable to the audience. Recently in the UK a weather forecaster was severely criticised after making the (presumably off guard) prediction that it would “piss it down”. 10 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION Mode refers to the channel of communication and the way it shapes interaction. For instance, text message (SMS) writing is highly truncated and information dense compared to face-to-face interaction. The register of a text is its situational context, that is, a description of the situation in which the text occurs defined according to the three parameters, Field, Tenor and Mode (see Figure 2.2 on the following page for some examples of situations.) This situational context (register) provides a basis for genre — the cultural context — or the purpose of the text. The purpose of a text — its genre — can be discerned from register information (field, tenor and mode) only through a knowledge of culture. In other words, genre is a “meta” level of analysis above descriptive (situational) register information. For example, we may know the field, tenor, and mode of a situation (newspaper, shop worker - customer, face-to-face, respectively), but only because we know about shops, newspapers and customers (and the way these entities interact in a particular culture) can we correctly ascertain that the genre here is face-to-face routine transaction. E GGINS (1994) identifies two ways in which genre is “mediated” through register. First, register helps to “fill in the slots” for a particular genre. To use Eggin’s example, if we use “university essay” as a model genre, register serves the function of “filling in the specifics relevant to a particular situation of use of that genre” (E GGINS, 1994, p . In other words, while all instances of the genre share the same structure (statement of thesis, presentation of evidence, and so on), the “field” will change according to the discipline of the essay (anthropology, sociology, and so on). Second, E GGINS (1994) presents the concept of genre potential, which is described as all possible register configurations that are culturally possible (possible in a given culture). For instance, Figure 2.3 on the next page shows two register configurations, one of which belongs to the distance learning (or correspondence) course genre, and one which cannot belong to any genre (Karate cannot be learned through distance learning). Different genres are characterised by distinct schematic structures. These schematic structures are similar in structure to the frame and script structures associated with traditional artificial intelligence (for example M INSKY (1974) and S CHANK and A BELSON (1977)). E GGINS (1994) uses the example of the recipe genre. The form of a recipe is highly predicable (recipe title, ingredients, instructions) and any deviation from this norm is surprising to the reader. Not all genres are characterised by this ideational function. Spoken conversations can be primarily concerned with the consolidation of social relationships — the interpersonal function — rather than the transmission of ideas. 2.1.2 The Multi-Dimensional Analysis Approach Biber’s work on genre and text types – and especially his multi-dimensional methodology, where factor analysis is used to identify salient differences between 11 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION Figure 2.2: Example Registers. Television weather forecast field: weather tenor: television presenter - anonymous audience mode: television broadcast Internet weather forecast field: weather tenor: corporate author - anonymous audience mode: electronic text Buying a newspaper at a newsagent field: newspaper tenor: shop worker - customer mode: face-to-face Figure 2.3: Possible and Impossible Registers. Possible register: accountancy distance learning course field: accountancy tenor: lecturer - student mode: written Impossible register: karate distance learning course field: karate tenor: lecturer - student mode: written genres and text types5 — began in the 1980s in collaboration with his doctoral supervisor, Edward Finegan (B IBER and F INEGAN, 1986) and was developed through the late 1980s (especially in the book length treatment Variations Across Speech and Writing (B IBER, 1988), which was based on Biber’s PhD thesis). The research programme focused on identifying dimensions of difference between texts in English (B IBER, 1988) and then using these dimensions as a basis for the development of an empirically based theory of text types (B IBER, 1989). 6 5 The notion of text types like the notion of genre does not have a settled definition and is used in different ways in different research traditions. See M OESSNER (2001) for a discussion of this and other terminological issues. 6 The multi-dimensional approach can be viewed as part of the development of corpus linguistics. Instead of intuitively analysing the features associated with different genre, multi- 12 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION The themes and methodologies identified in Biber’s early work were taken up by other researchers and applied to new areas. For instances, K IM and B IBER (1994) focuses on empirically determining the underlying dimensions of variation in contemporary Korean using a corpus based approach and multidimensional analysis, and B IBER ET AL . (2002) uses a corpus of written and spoken academic discourse7 to identify prevalent registers in the hope that this can inform pedagogy for students of English as a second language. This section first explores the central distinction in multi-dimensional studies in variation; the difference between genre and text type, before going on to outline the broad methodology used in studies of linguistic variation. Then, the impact of the research programme developed by Biber and Finegan is described, with reference to work from various areas, including language teaching (as briefly mentioned above) and historical linguistics. Finally, some criticisms of, and potential substitutes for, the multi-dimensional approach are presented and assessed. For Biber, a typology of text types is any classification scheme for texts. So the traditional categorisation provided by Aristotle (into comedy, tragedy, epic poetry, and so on.) is a typology, as is the categorisation scheme given by traditional discourse theory, where modes of discourse are either narrative, descriptive, expositionary, or argumentative. Genre is one kind of typology, which Biber describes as a “folk-typology” (B IBER, 1989). By describing genre as a “folk-typology” system, Biber means that the genre of a text is classified using its most easily discernable attribute; its external format — whether it is a newspaper article, technical manual, novel, and so on — rather than its internal linguistic features. Biber presents a new typology of English, based on the empirically identified internal linguistic features of a text, rather than relying on either the hand-medown genre distinctions of folk-typology, or the intuitions of linguistics about which features of a text may be important, resulting in the discovery that the traditional distinctions of folk-typology do not map clearly to the classification scheme generated by his empirical study: Genre distinctions do not adequately represent the underlying text types of English... Texts within particular genres can differ greatly in their linguistic characteristics; for example, newspaper articles can range from extremely narrative and colloquial in linguistic form to extremely informational and elaborated in form. On the other hand, different genres can be quite similar linguistically; for example, newspaper articles and popular magazines can be nearly identical in form. Linguistically distinct texts within a genre represent dimensional analysis seeks to quantify these differences using a perspicuous and repeatable methodology. 7 TOEFL 2000 Spoken and Written Academic Language Corpus, which consists of extracts from textbooks, conversations between academics and students, and service encounters between university staff and students. 13 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION different text types linguistically similar texts from different genres represent a single text type. B IBER (1989, p. 6) So, while genre may be a useful text typology insofar as it reflects the external characteristics and social functions of texts — whether they are newspaper articles, childrens’ books, and so on — it fails to reflect the internal linguistic qualities of the text. A given genre may well contain texts of several different text types. For example, the genre of academic papers includes argumentative papers and survey papers, each with different purposes, structures and styles, yet despite their linguistic differences, both are included in the genre “academic papers”. Similarly, argumentative texts may occur across a range of traditional genre categories. Examples here could include newspaper editorials, academic papers, and political pamphlets. Biber’s work is distinctive because it is empirically based. Instead of examining representative documents from each genre, and attempting to extract the distinctive linguistic features common to all documents, Biber uses linguistic features to identify text categories, and to construct a typology: It should be noted that the direction of analysis here is opposite from that typically used in the study of language use. Most analyses begin with a situated or functional description and identify linguistic features associated with that distinction as a second step... The opposite approach is used here: quantitative techniques are used to identify the groups of features that actually co-occur in texts and afterwards these groupings are interpreted in functional terms. The linguistic dimension, rather than the functional dimension is given priority. B IBER (1988, p. 13) Biber’s system involves two stages: First, the linguistic features of a collection of texts are analysed using statistical techniques, and a set of dimensions based on this statistical analysis is presented. Second, using cluster analysis, a typology of texts for English is proposed, based on the previously derived dimensions. Multi-Dimensional Variation: Methodology Stage One: Deriving Textual Dimensions B IBER (1988) identifies five dimensions of variation in English across twentythree genres (see Table 2.1 on page 16). Four hundred and eighty one texts were selected from the Lancaster-Oslo-Bergen corpus (J OHANSSON ET AL ., 1978) and the London-Lund corpus of spoken English (S VARTVIK, 1990). Through a study of the linguistics literature, Biber identified sixty-seven features that 14 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION seemed likely to be significant for genre identification (including the presence of second person pronouns, demonstratives, stranded prepositions, and so on). See Appendix D on page 259 for a brief description of each feature. These features were identified automatically using a set of programs, and then hand checked. The requirement to verify the feature selection in each document by hand, limited the number of documents that could be processed. The features in each document were then counted (and normalised to account for different document lengths), and the data was then subjected to factor analysis (see Appendix H on page 274 for a brief outline of factor analysis), from whence seven factors (that is, patterns of linguistic co-occurrence) where identified. On the basis of the factors identified, Biber determined six dimensions of variation:8 1. Involved versus informational production — Involved text is informal and stresses interpersonal relationships. Examples of involved genre include personal letters, interview transcripts, and so on. Features strongly associated with involved text are: present tense, contractions and second person pronouns. Informational text is primarily concerned with the transmission of information. It is formal, dense with nouns and prepositions, and lacks contractions and pronouns. Examples of informational genre would include academic papers and government reports. 2. Narrative versus non-narrative concerns — Narrative texts are characterised by extensive use of the past tense and third person pronouns. Narrative genres include fiction and biography. Non-narrative texts are associated with heavy use of the present tense and can include genres like academic papers, government documents, technical manuals, and so on. 3. Explicit versus situation dependent reference — B IBER (1988) describes this dimension as a “dimension that distinguishes highly explicit and elaborated endophoric reference from situation dependent exophoric reference.” In other words, explicit (endophic) genres (like academic prose and government documents) are self-contained; they do not depend on extensive reference to an unexplained situation in order to be intelligible. Relative clauses are highly characteristic of explicit genres. Situation dependent genres however assume a high level of familiarity with the domain in question. B IBER (1988) uses the example of a football commentary, which only makes sense because we have background knowledge about football and the kind of events that occur at football matches. 4. Overt expression of persuasion — Important features distinctive of persuasive writing are the occurrence of suasive verbs (for example, “should”). Persuasive genres include press editorials, and argumentative academic 8 B IBER (1988) identified six dimensions, but in later work, this was reduced to five dimensions (B IBER , 1989). The dropped dimension was online informational elaboration. The first three dimensions have both positive and negative features, the last three dimensions only have positive features. 15 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION Table 2.1: Genres Used By B IBER (1988). “LOB” refers to the Lancaster-OsloBergen Corpus (J OHANSSON ET AL ., 1978) and London-Lund Refers to the London-Lund Corpus of Spoken English (S VARTVIK, 1990). Genre Press reportage Editorials Press reviews Religion Skills and hobbies Popular law Biographies Official documents Academic prose General fiction Mystery fiction Science fiction Adventure fiction Romantic fiction Humour Personal letters Professional letters Face-to-face conversation Telephone conversation Public conversations Broadcast Spontaneous speeches Prepared speeches Number of Texts 44 27 17 17 14 14 14 14 80 29 13 6 13 13 9 6 10 44 27 22 18 16 14 Source LOB LOB LOB LOB LOB LOB LOB LOB LOB LOB LOB LOB LOB LOB LOB Biber Biber London-Lund London-Lund London-Lund London-Lund London Lund London-Lund Spoken/Written Written Written Written Written Written Written Written Written Written Written Written Written Written Written Written Written Written Spoken Spoken Spoken Spoken Spoken Spoken prose. These contrast with genres whose function is solely to relay information (for example, technical manuals). 5. Abstract versus non abstract style — Features that indicate abstract writing include the use of the passive voice and agentless verbs (characteristic of academic writing). Non-abstract style is simply the absence of the features that indicate abstraction. 6. Online informational elaboration — Here “online” refers to those genres that are characterised by the relay of information in real time (for example, prepared speeches, and so on). These “online” genres are dense with that complements. Linguistic features are associated with each “end” of the dimension (see Figure 2.4 on the next page) as determined by factor analysis. For example, if we take the narrative versus non-narrative dimension, past tense is associated with 16 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION Figure 2.4: Linguistic Characteristics of Dimensions in B IBER (1988). Dimension Positive Features private verbs contradictions present-tense verbs second person pronouns analytic negation AH-Questions Adverbs past-tense verbs third person pronouns public verbs present participle clauses phrasal co-ordination nominalization infinitives predictive modals necessary modals split auxiliaries conjuncts agentless passives past participle clauses other adverbial subordination Dimension 1 Involved vs Informational Production Dimension 2 Narrative vs Non Narrative Concerns Dimension 3 Explicit vs Situation Dependent Negative Features prepositions place adverbials present-tense verbs attributive adjectives time adverbials time adverbials place adverbials Dimension 4 Overt Expressions of Persuasion No negative features Dimension 5 Abstract vs Non-Abstract Style No negative features the narrative end of the dimension and present tense with the non-narrative end of the dimension. The Second Stage: Deriving a Textual Typology B IBER (1989) builds on his identified dimensions to create an empirically grounded typology of English texts. To derive the typology, he gave each text a “dimension score” based on the frequency of linguistic features associated with that dimension. The dimension score was determined for each dimension by counting the frequencies of features associated with one end of the dimension (for instance, in the case of the Narrative versus Non-narrative dimension these would be features associated with the Narrative end of the dimension, like past tense verbs or third person pronouns) and subtracting the total number of features associated with the other end of the dimension (for instance, in the case of the Narrative versus Non-Narrative dimension, these would be features associated with the Non-Narrative end of the dimension, like present tense verbs 17 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION Table 2.2: Text Typology Derived by B IBER (1989). Intimate interpersonal interactions Scientific exposition Imaginative narrative Situated reportage Informational Interaction Learned exposition General narrative exposition Involved persuasion or time adverbials). A vector of the scores for each dimension was then subjected to cluster analysis and the optimal grouping of the data resulted in eight distinct clusters. These clusters form the basis of Biber’s text typology (see Table 2.2). The text types identified do not map directly to genres. That is, the text-types identified by internal linguistic characteristics do not map directly to the external genre categories of the corpus used by B IBER (1988) (see Table 2.1 on page 16 for a list of the genres used by B IBER (1988)), as is in accordance with Biber’s notion of text typologies as fully determined by linguistic qualities of a text rather than by external function or location. Consistent with this, texts from the Biography genre were spread across four of the text types identified by Biber’s scientific exposition (7%), learned exposition (29%), imaginative narrative (7%) and general narrative (57%). Over half the biographical examples from the corpus came from the General narrative text type. One of the goals of the current research is to identify those features that identify biographical writing across text types. The Multi-Dimensional Analysis Approach: Further Applications The multi-dimensional analysis approach to linguistic variation has been highly influential in the corpus linguistics research tradition (M C E NERY and W ILSON, 1996), principally in the areas of synchronic studies, disynchronic studies, studies of variation in languages other than English, and studies of variation applied to language learning. These four areas will be briefly described using one representative study for each area. Synchronic Studies Synchronic studies investigate linguistic variation between genres at a specific point in time. That is, synchronic studies of variation attempt to account for variation between registers (or genres) rather than the changes in a register over time. An example of this kind of study is C SOMAY (2002), who used multidimensional analysis to study a corpus of US higher education spoken lec- 18 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION tures.9 The aim of the project was to explore how lectures differ with respect to educational level, and degree of interactivity. Twenty-three linguistic features associated with academic lectures were identified (including, a high frequency of nouns, use of attribute adjectives and the passive voice), and placed in dimensions similar to those of B IBER (1988). After subjecting the feature scores to cluster analysis, C SOMAY (2002) found that non-interactive lectures showed a similar pattern to academic prose, whereas lectures (at a similar level) with a highly interactive component, were more akin to spoken discourse (for example, they had fewer passive constructions). Diachronic Studies Diachronic studies investigate variation within a single register (or perhaps groups of registers) over time. The multi-dimensional variation methodology has been particularly important in historical linguistics, of which an example is ATKINSON (1992), who analysed the development of scientific writing between 1735 and 1985 in the Edinburgh Medical Journal. The journal was sampled at approximately 45 year periods (that is, 1735, 1774, 1820, 1864, 1905, 1945 and 1985), with ten articles taken from each period. Further, after an informal analysis of the articles, they were placed into one of five possible categories: 1. case reports (narratives of single cases) 2. disease reviews (summaries of knowledge with respect to a particular disease or ailment) 3. treatment reviews (summaries of possible treatments for a given disease or condition) 4. experimental reports (reports of individual scientific experiments or studies) 5. reproduced speeches (reproduced speeches of medical luminaries) When subjected to multi-dimensional analysis, ATKINSON (1992) found that the Journal had become progressively less narrative over its history across all five identified article types, and also that the language became less overtly persuasive over the same period, perhaps reflecting a more modern, “dispassionate” and objective scientific writing style. Languages Other Than English The use of the multi-dimensional methodology to analyse the extent of variation in Korean (K IM and B IBER, 1994) has already been mentioned briefly on page 13. The study shows that the dimensions identified by factor analysis are not universal — K IM and B IBER (1994) labelled one of the identified dimensions, “honourific”, reflecting the stress in Korean society on deferential 9 173 lectures were selected from the The TK2-SWAL Corpus, a spoken and written corpus of US higher education discourse. 19 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION social relationships, and its relatively recent widespread and officially sanctioned adoption of a non-logographic writing system.10 Another example of multi-dimensional work on a language other than English is B IBER and H ARED (1994), who investigate the development of news registers in Somali, a language with only a very short written tradition.11 The aim of this work was to assess the impact of literacy in Somali by focusing on differences between written registers (news text) and spoken registers (for example, conversations). This was achieved by sampling a corpus of Somali news texts at three periods (roughly, 1974, 1979 and 1987) and identifying changes that have occurred between these samples. B IBER and H ARED (1994) found that as the written registers matured, variation decreased between registers as linguistic use became standardised. Additionally, variation has increased between written and spoken registers as written registers have become more firmly established. Language Learning Applications Two examples of multi-dimensional analysis applied to language learning issues have been briefly mentioned above (C SOMAY, 2002; B IBER ET AL ., 2002). B IBER ET AL . (2002) seeks to explore the diverse range of language tasks faced by international students at US universities, in order that English language instruction can be made more appropriate to a university setting. The multidimensional study showed that, unlike general English, university English is very strongly polarised between written and spoken genres, with university written registers being uniformally informationally dense prose and an impersonal style, whereas spoken genres (even lectures) are characterised by fewer features of impersonal style and features of involvement and interaction. Multi-Dimensional Analysis: Criticism and Alternatives This section briefly surveys some criticism of the multi-dimensional analysis approach to the study of linguistic variation, before going on to discuss an alternative approach to linguistic variation, without the computational and statistical “overhead” of multi-dimensional analysis. Multi-Dimensional Analysis of Linguistic Variation: Some Criticisms L EE (1999) as part of his PhD work, replicated Biber’s original 1988 work on a larger scale, with a greater number of linguistic features (84 compared to Biber’s 67), new feature identification programs, and a much larger corpus of contemporary English — a four million word subset of the British National 10 Hangul, the Korean alphabet, has been the official writing system since 1945. Before 1945, a logographic system based on Chinese characters was used. 11 Somali did not have a widespread written form until 1972, when a new orthographic system was imposed “top-down” by the Somali government. 20 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION Corpus. As part of the replication process, L EE (1999) analysed the multidimensional approach and “found it to be lacking in some respects. In particular, it does not seem to support the kinds of conclusions and claims that have been made for it” (L EE, 1999, p. 396). These criticisms fall into two main groups: 1. Sampling Texts: The representativeness of Biber’s original corpus is not well understood, and hence any extrapolations of findings to the English language generally, are statistically invalid. This is a principled objection to the multi-dimensional approach (and it would seem, to most of corpus linguistics) and is not remedied by the increase in size of the corpus — even a very large corpus (as used by L EE (1999)) is not known to be representative of written English as a whole. 2. Sampling Linguistic Features: The particular choice of linguistic features (and the exclusion of others) dictates the dimensions that are formed in factor analysis (see Appendix H on page 274 for a brief review of factor analysis.) L EE (1999) suggests that feature selection is an iterative process, rather than a matter of selecting features that intuitively seem important, as intuitive feature selection is likely to lead to the creation of factors “that are to some degree artefacts of the features chosen.” (L EE, 1999). In other words, factors in multi-dimensional analysis will simply reflect the theoretical commitments made when selecting which features are relevant to genre; according to L EE (1999), an unacceptably circular methodology. In contrast, Lee suggests that the proper use of multidimensional analysis involves using the maximum number of features and identifying those which are most important in an iterative fashion. The “Keyword” Approach T RIBBLE (1998) presents an alternative to the multi-dimensional method for the analysis of genre developed by B IBER (1988) based on the keyword function available in the software tool, WordSmith.12 T RIBBLE (1998) suggests that the use of genre specific keywords (as identified by WordSmith) serves as a low effort alternative to multi-dimensional analysis. That is, instead of writing linguistically complicated feature extraction programs, and then subjecting the frequencies of the identified features to factor analysis, a relatively complex statistical process – we can simply use empirically identified genre specific keywords identified by the WordSmith program. The keywords approach to genre analysis using WordSmith has three stages: 1. Given a corpus divided into genres of interest (henceforth a test corpus), a wordlist for each genre is created (a list of word types for each genre). 2. The wordlist from each genre in the test corpus is compared against a 12 WordSmith consists of a concordancer, and keyword extractor (as well as other tools). It is used widely in corpus linguistics research and lexicography. More details can be found at: http://www.lexically.net/ Accessed 05-08-06. 21 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION reference corpus, using a feature selection method (see page 50). This reference corpus ought to be large and there should be an attempt to make it representative. The frequencies of each type in the test corpus wordlist (remember there is a test corpus wordlist for each genre represented) is compared to the frequency of that word in the reference corpus, and each typed is ranked according to its “keyness.”13 3. In order to improve the “keyness” of the genre keywords, key-keywords are used. These key-keywords are words that are keywords in more than one genre text. That is, those words that are only key in one text from a particular genre are not key-keywords for that genre. The central idea here is that by eliminating those words that are only key in one text, we are left with key-keywords that better reflect the “essence” of a given genre, rather than the specific topicality of individual texts. X IAO and M C E NERY (2005) tests the keyword approach identified by T RIBBLE (1998) against a Biber style multi-dimensional methodology for the analysis of three genres in American English; conversation, presentational speech (for example, sermons, lectures, and so on) and academic prose. While X IAO and M C E NERY (2005) concludes that the keyword approach is not a substitute for a full multidimensional analysis, its power, at least on the data used in the study, is acknowledged, as the keyword method yields results broadly comparable with Biber style multidimensional analysis. Additionally, X IAO and M C E NERY (2005) emphasises the relative difficulty of the Multidimensional approach, describing it as “very time consuming and computationally/statistically demanding” compared to the keyword approach, which uses “off-the-shelf” software and does not require programming or complex statistics. The current work is not committed to using multi-dimensional analysis as a tool for exploring linguistic variation. Instead, B IBER (1988)’s preliminary data – the frequency tables of linguistic features across genres – are used to identify those features especially characteristic of a particular genre (in this case biographical writing.) The usefulness of this data for research use is stressed by L EE (1999). 2.2 Stylistics and Stylometry This section reviews the linguistic sub-discipline of stylistics, before contrasting stylistics with the allied discipline of stylometrics and its applications in authorship attribution studies. 13 “Keyness” is calculated using either the chi-square or log-likelihood feature selection methods (S COTT , 2005) described on page 50 of this document. 22 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION 2.2.1 Stylistics Stylistics is the formal study of literary style. Style itself is, however, not a straightforward concept, as it is normally used to describe the non-propositional content of a text. That is, the aesthetic “residue” when propositional content has been removed. The particular linguistic choices — ways of expressing — associated with a given genre or writer are described as that genre or writer’s style. For instance, the obituary genre has a particular style characterised by the heavy use of euphemism. For example, “she is survived by three adult sons and a husband” and “she did not suffer fools gladly”, rather than “she is outlived by her husband and three adult sons” and “she was irritable and disagreeable”. L EECH and S HORT (1981) provides a seminal study of stylistics in a literary context. The authors are concerned with studying how certain linguistic choices create specific artistic effects. A very simple example here would be the very different aesthetic effects created by sentence length between, say, Henry James and Ernest Hemingway (sophistication versus power). The difficulty of distinguishing between different styles is repeatedly emphasised by L EECH and S HORT (1981). For them, style is a relational term; we can only define a style in relation to another style. Style is a relational term: we talk about “the style of x”, referring through “style” to characteristics of language use, and correlating these with extralinguistic x, which we may call the stylistic domain. The x (writer, period, and so on) defines some corpus of writings in which the characteristics of language use are to be found. But the more extensive and varied the corpus of writings, the more difficult it is to identify a common set of linguistics habits. L EECH and S HORT (1981, p. 11) The study of style is also relational with respect to the specific research question. For instance, if we are interested in a particular genre, we will try and focus on those features that are particular characteristics of the genre, and those features that differ between different authors working within the same genre will be discounted. L EECH and S HORT (1981) developed a methodology for comprehensively analysing texts (manually rather than automatically) for style. These features are divided into four categories (lexical, grammatical, figures of speech, and cohesion/content) described in more detail below: 1. Lexical Categories General: Is the vocabulary simple or complex? Is it descriptive or evaluative? Are any rare or specialised words used? Nouns: Are the nouns abstract or concrete? How are proper names used? 23 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION Adjectives: How frequent are the adjectives and what are the attributes to which they refer? For example, evaluative or colour adjectives, and so on. Verbs: Are they transitive or intransitive? Do they refer to physical acts, speech acts? Adverbs: How frequent are adverbs? What function do they play? For example, time, degree, place, and so on. 2. Grammatical Categories Sentence Types: Are questions, commands and exclamations used in addition to propositional sentences. Sentence Complexity: How complex are the sentences? What is the average sentence length? Do the sentences vary greatly in length and complexity? Clause Types: What kind of dependent clauses are used: relative, adverbial or nominal? Clause Structure: Is there anything unusual about clause structure? Noun Phrase: Are noun phrases simple or complex? Are there sequences or adjectives? Is apposition used? Verb Phrases: Is the simple past tense used? If not, what kind of tenses are present? Other Phrase Types: Are prepositional phrases, adjective phrases or adverb phrases used? Minor Word Classes: How are function words used? For example, prepositions, auxiliaries, determiners, and so on. General: Any other unusual constructions (for example, superlatives, comparatives, and so on. 3. Figures of Speech Grammatical and Lexical Schemes: Is the technique of structural repetition used? For example, anaphora, parallelism. Phonological Schemes: Are rhyme or alliteration used? Tropes: Is there any obvious linguistic “deviation”? For example, neologisms and so on. 4. Content and Cohesion Cohesion: How does the text manage links between sentences? Are the links implicit, logical, or via coordinating conjunctions. How is repetition used? 24 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION Context: How is speech represented, directly or indirectly? Is first or third person narrative used? Are there differences in style in the reported speech of different characters? 2.2.2 Stylistic Analysis Stylistics as envisaged by L EECH and S HORT (1981) and S HORT (1996) is designed to analyse small sections of literary text, in an attempt at bringing systematic techniques traditionally associated with linguistics to the literary study of English Literature. This enterprise is known as stylistic analysis. The technique is empirical in that it respects the primacy of text, and seeks to explore and describe the literary devices used in text. The identification of important features, however, relies on the skills and intuitions of an experienced reader. For example, if we compare two short text examples presented in Figure 2.5 on the next page (from Martin Amis’s 1985 novel Money and Daniel Dennett’s philosophical monograph Consciousness Explained, respectively) using only some of the analytic methods identified by L EECH and S HORT (1981) we can mark out clear differences between the texts. This identification of relevant features becomes more difficult however, if instead of using examples from radically different genres (contemporary prose fiction and philosophy in this instance) we choose texts from the same genre. The most striking differences between the two examples are their use of vocabulary: we can see that Example 2 uses several specialised, technical words (heterophenomenology, blindsight), and standard words on the edge of acceptability (“ultracautious”) whereas Example 1 relies on a non-technical vocabulary. Neither text contains concrete nouns, suggesting that both of them are designed to convey information about abstract qualities, rather than physical descriptions. Example 1 exhibits a more informal use of adverbs (“awful slowly”) and a more informal tone generally. Example 1 contains a rhetorical question, designed to create a conversational tone, whereas Example 2 concentrates on the clear exposition characteristic of academic writing. Both examples exhibit similar levels of sentence complexity. 2.2.3 Stylometrics: Authorship Attribution Stylometry is concerned with the statistical analysis of style.14 Although the stylistics techniques exemplified by L EECH and S HORT (1981) and S HORT (1996) are empirical in that they draw conclusions from the close analysis of data using a “checklist” type approach, the interpretation of the selected features relies on the skilled judgements of the reader. Stylometrics, can be thought of as a scientific sub-discipline of stylistics that has developed its own methodologies and techniques (WALES, 1989). Stylometrics is also distinguishable from stylistics 14 Literally “the measurement of style”. 25 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION Figure 2.5: Two Examples of Stylistically Different Texts from (A MIS, 1985) and (D ENNETT, 1992), Respectively. EXAMPLE 1: Opera certainly takes its time, doesn’t it? Opera really lasts, or at least Otello [sic] does. I gathered that a second half would follow this one, and this one was travelling awful slowly through its span. The other striking thing about Otello [sic] is — it’s not in English. I kept expecting them to pull themselves together and start singing properly, but no: Spanish or Italian or Greek was evidently the deal. EXAMPLE 2: Here is a place where the ultracautious policies of heterophenomenology pays dividends. Both blindsight subjects and hysterically blind people are apparently sincere in their avowals that they are unaware of anything occurring in their blind field. So their heterophenomenological worlds are alike – at least in respect to their presumptive blind field. not only because of its methodology, but also because of its limited application. Traditionally, stylometrics has concerned itself primarily with the issue of authorship attribution. Stylometry is the science which describes and measures the personal elements in literary and extempore utterances, so that it can be said that one particular person is responsible for the composition rather than any other person who might have been speaking or writing at that time on the same subject for similar reasons. Stylometry deals not with the meaning of what is said or written but how it is being said or written. (M ORTON, 1978, p. 7) A further distinction is often made between stylometry, computational stylometry, and computational stylistics (M C E NERY and O AKES, 2000). Stylometry, although greatly facilitated by the use of computers, is not dependent on them. Indeed, early, painstaking work on stylometry (reviewed below) began well before the advent of digital computers. Computational Stylometry normally refers to automatic stylistic analysis using electronic texts, Computational Stylistics is slightly broader in scope and chiefly distinguished by its use of more complex features than traditional stylometry (W HITELAW and A RGAMON, 2004), and concern with providing a “bridge” between authorship attribution style stylometry and literary stylistics (C RAIG, 1999). 26 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION Development of Stylometry The stylometric method for author identification was first outlined in 1851 in a letter by Augustus de Morgan (reported by K ENNY (1982)). de Morgan suggested that a dispute over the authorship of the Pauline Epistles could be settled by an analysis of the word lengths of the various Epistles. In 1887, Mendenhall (described in K ENNY (1982) and also O AKES (1998)) compared the frequency distributions of word lengths for Shakespeare and several other writers (including Bacon, J.S.Mill, and Marlowe), and discovered that the distributions for each writer had distinctive shapes. In the middle of the Twentieth Century, computers were inaccessible to most language researchers. Y ULE (1944) — who was the author of a standard textbook on general statistics according to L OVE (2002) — did however make use of the advances in statistics since the end of the Nineteenth Century. Instead of (impressionistically) comparing the appearance of frequency distributions (like Mendenhall), Y ULE (1944) used statistical tests based on lexical features (for example, total vocabulary size, use of unique nouns) in order to investigate the disputed authorship of the medieval religious text De Imitatione Christe, which had been variously attributed to both Thomas a Kempis and Jean Charlier de Gerson. Yule’s analysis strongly favoured Thomas a Kempis as author. M ORTON (1965), in the tradition established by de Morgan, analysed the most common word in the Pauline Epistles (“kai” – the Greek for “and”) as a proportion of the total words in each Epistle, finding that the Epistles fell into two distinct groups, indicating that St Paul authored only some of the texts traditionally attributed to him. However, when this method was applied to a corpus of Morton’s own essays, E LLISON (1967) (reported in O AKES (1998)) claims that the result suggests multiple authors, indicating that reliance on single connectives as features is not adequate to reliably distinguish between authors. More recently, there has been an attempt to create an experimental framework for authorship attribution studies, a major part of this effort is the creation of a software environment for (repeatable and perspicuous) author identification experimentation (JGAAP — Java Graphical Authorship Attribution Program) (J UOLA ET AL ., 2006). Standard Stylometric Feature M C E NERY and O AKES (2000) describe six easily computable features commonly used in stylometric studies (see Figure 2.6 on the following page). This work draws a distinction between features that are “contingent” or readily under the control of the author and reflective of the genre of the text (sentence length, word length, vocabulary richness, and relative frequencies of parts-ofspeech) and those authorial decisions that are made at the unconscious level 27 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION Figure 2.6: Examples of “Deep” and “Contingent” Features Described in M C E NERY and O AKES (2000). C ONTINGENT F EATURES 1. Sentence Length has the advantage of being easy to compute, but has been shown to be an unreliable indicator of authorship. 2. Word length like sentence length is very easy to compute, but like sentence length, seems to indicate genre rather than a specific author. 3. Vocabulary Richness is a measure of variety used by the author. The most common and simplest to compute is the type/token ratio. That is, the number of types divided by the number of tokens in a given text. 4. Part-of-Speech Relative frequencies of nouns, verbs, and so on. This choice of feature requires that the text be either hand tagged for part-ofspeech categories — a laborious task — or tagged using an automatic part-of-speech tagger. D EEP F EATURES 1. Word Ratio including the relative frequency of synonyms (like while/whilst and since/because). 2. Letter Based that is, the frequency of different characters. This technique is more common in non-English alphabets. that may indicate differences in “deep” authorial style (ratios of the relative frequency of synonyms and character based frequencies). Authorship attribution studies often draw a distinction between the “deep” and “contingent” styles of an author. An author’s deep style remains constant over time (in adulthood) and between different genres. It cannot be easily disguised. The “contingent” style of an author is that which changes over the life course, and between different genres (for example, sentence length may vary with a particular author working in different genres, as may vocabulary richness). The features characteristic of “deep” style are less amenable to conscious control (M C E NERY and O AKES, 2000; L OVE, 2002). Figure 2.6 details common features used in authorship attribution studies and their status as “deep” or “contingent” based on M C E NERY and O AKES (2000). Authorship Attribution: The Federalist Papers This section briefly reviews some important work on the Federalist Papers, a series of seventy seven articles published in four New York newspapers in 1787 and 1788, which provide a challenging and frequently used test corpus for assessing different authorship attribution techniques. The papers were written to persuade the population of New York to ratify a new constitution for the 28 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION United States, which, initially they failed to do, and the papers were republished as a book in 1788 (with eight extra essays). The papers were initially published using the pseudonym “Publius” and it is generally agreed that three men were responsible for all eighty five papers. These were, Alexander Hamilton, John Jay and James Madison (later the President of the United States). Of the eighty five papers, Madison and Hamilton both claimed authorship of twelve. The papers are especially useful for authorship attribution studies purposes because all three writers attempted to write in a single consistent style (as the person Publius), and the genre and subject matters is remarkably homogenous (18th Century American political rhetoric), providing a solid test of those features and statistical methods that discriminate texts solely on the basis of authorship. There is a central problem in comparing the efficacy of different feature sets (see Section 8 on page 144 for the difference between features and feature sets) for attributing authorship to those papers whose author is known. Different studies used not only different feature sets, but different classification algorithms, meaning that direct comparisons between feature sets are difficult to make. The following three approaches give a flavour of the kinds of techniques used: 1. M OSTELLER and WALLACE (1984), in a book length study used “marker words” as features, after discovering that sentence length failed to distinguish adequately between the authors (owing to the attempts of the authors to adopt a uniform style). These “marker words” were synonym pairs (for example, “upon/on” and “while/whilst”) based on an analysis of Hamilton and Addisons’ wider output outside the federalist papers. The approach used a multi-stage Bayesian methodology to classify the papers, that despite promising results, did not become popular in the authorship attribution community due to the statistical sophistication required to implement it successfully (H OLMES and F ORSYTH, 1995; M C E NERY and O AKES, 2000). M OSTELLER and WALLACE (1984) does however remain a landmark study in authorship attribution studies. 2. M C C OLLY and W EIER (1983) used the features identified by M OSTELLER and WALLACE (1984) and compared their performance to sixty-four context independent function words using a likelihood ratio approach. It was found that classification accuracy was much lower using function words, rather than the synonym pairs identified by M OSTELLER and WAL LACE (1984), indicating that the relative frequency of function words are more a result of genre constraints than specifically authorial style. 3. H OLMES and F ORSYTH (1995), use (among other methods) a genetic algorithm15 to compare the “marker words” identified by M OSTELLER and WALLACE (1984) and forty-nine frequent function words. Like, M C C OLLY 15 An algorithm that is based on the analogy of Darwinian evolution to develop new rules by a process of mutation and fitness testing. 29 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION and W EIER (1983), it was found that use of function words were less successful as discriminating features than the “marker words” identified by M OSTELLER and WALLACE (1984). This result is consistent with the claim that the distribution of function words is dictated by genre rather than authorial style, and suggests that function words may be useful for genre classification (see the experimental work described in Chapter 9). 2.3 Biographical Writing This section briefly surveys some of the defining characteristics of biographical writing, before outlining the history of biographical writing (focusing on biographical writing in English). 2.3.1 Characteristics of Biographical Writing Biography as a genre is the history of the lives of individuals with its own literary form (S HELSTON, 1977; G ITTINGS, 1978; M AUROIS, 1929). A biography needs to be historical; a factually grounded narration describing the unfolding of events over time. A biography needs to be focused on an individual; other persons are considered only insofar as they relate to the individual of interest. It also needs to have a certain form, presenting certain kinds of information (birth and death dates, location of birth, education, and so on) in a given order, and retaining a chronological sequence.16 All these three conditions have to be met for a text to be described as biographical. For instance, a person’s dental records are about an individual (they satisfy the individuality criteria), and they also focus on that individual over time (they satisfy the historical criteria). Dental records, however accurate, do not satisfy the form criteria, in that they focus entirely on biographically less relevant information. To use another example, while the play Hamlet is directly concerned with recounting the life of an individual (the Prince of Denmark) and has a biographical form (describing events in the life of Hamlet), if fails to satisfy the history criterion (the events did not take place and the person does not and did not exist). Biographical texts use a version of what journalists’ term the “inverted triangle” for presenting information: “an old-fashioned device, the origins of which are unclear, but the rules of which stand the test of time.” (PAPE and F EATH ERSTONE , 2005). In standard newspaper writing practice, the first paragraph of an article is a short summary of the story; the essential facts. The next paragraphs will expand on these essential facts, providing background information 16 Often book length literary biographies deviate from this form at a superficial level – say, in the first few paragraphs of the book – but when the overall structure of the text is considered, the form chosen is almost always chronological. 30 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION and perhaps analysis. The final paragraph will bring the article to a graceful end. This technique is useful for the following two reasons: 1. It allows the reader to read as little or as much as they want and still gain a coherent account. 2. If the article has to be edited, it can be done speedily simply by removing paragraphs from the end of the article, without compromising its meaning. In the case of biographical texts the pyramid is more prescribed; certain facts are obliged to appear at the widest part of the pyramid (see 3.2 on page 57). These facts include, dates of birth and death, place of birth, profession, notable achievements and significant events. Additionally, information about family (for example, marital status, details of children and parents) may be included. The first paragraph of the biography is an attempt at summarising the life, on which subsequent paragraphs elaborate. A rule of thumb is that the longer the biography, the more background information is presented; this is obvious if we think of a published biography of a politician. The skeleton facts can be presented in a single paragraph, the rest of the book is a deeper analysis of these facts, and the historical background necessary to enhance understanding. It is notable that very short biographies (one paragraph) consist entirely of these central biographical facts with no elaboration or padding. Sometimes these type of biographies reject the norms of published narrative English and adopt instead an information rich, restricted “telegraph language”, normally to a highly prescribed biographical scheme. For instance, the quotation below forms the beginning of a biography of George Washington, reproduced from the the website of the US Congress: WASHINGTON, George, (granduncle of George Corbin Washington), a Delegate from Virginia and first President of the United States; born at Wakefield, near Popes Creek, Westmoreland County, Va., February 22, 1732; raised in Westmoreland County, Fairfax County and King George County; attended local schools and engaged in land surveying; appointed adjutant general of a military district in Virginia with the rank of major in 1752; in November 1753 was sent by Lieutenant Governor Dinwiddie, of Virginia, to conduct business with the French Army in the Ohio Valley; in 1754 was promoted to the rank of lieutenant colonel and served in the French and Indian war, becoming aide-de-camp to General Braddock in 1755; appointed as commander in chief of Virginia forces in 1755; resigned his commission in December 1758 and returned to the management of his estate at Mount Vernon in 1759; served as a justice of the peace 1760-1774, and as a member of the Virginia house of burgesses 1758-1774. http://bioguide.congress.gov The inverted biographical triangle is illustrated best in longer, multi-paragraph 31 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION biographies written in continuous prose rather than the “choppy”, truncated style of the previous example. The initial paragraph is used to give basic information, and the rest of the text expands on the these facts. This style can be seen below in the first paragraph of a piece on Charles Dickens: 1812-70, English author, born in Portsmouth, one of the world’s most popular, prolific, and skilled novelists. The son of a naval clerk, Dickens spent his early childhood in London and in Chatham. When he was 12 his father was imprisoned for debt, and Charles was compelled to work in a blacking warehouse. He never forgot this double humiliation. At 17 he was a court stenographer, and later he was an expert parliamentary reporter for the Morning Chronicle. His sketches, mostly of London life (signed Boz), began appearing in periodicals in 1833, and the collection Sketches by Boz (1836) was a success. Soon Dickens was commissioned to write burlesque sporting sketches; the result was The Posthumous Papers of the Pickwick Club (1836-37), which promptly made Dickens and his characters, especially Sam Weller and Mr. Pickwick, famous. In 1836 he married Catherine Hogarth, who was to bear him 10 children; the marriage, however, was never happy. Dickens had a tender regard for Catherine’s sister Mary Hogarth, who died young, and a lifelong friendship with another sister, Georgina Hogarth. http://www.bartleby.com It is important to note, however, that often in book length, or more literary biographical texts, the highly constrained “inverted pyramid” is likely not to apply, just as in longer feature articles in newspapers, the journalistic “inverted pyramid” is less likely to apply. 2.3.2 Development of Biographical Writing The earliest known biographies (or proto-biographies) are the classical histories of Herodotus and Thucydides (H ERODOTUS, c 440 BC; T HUCYDIDES, c 411 BC). These texts were primarily historical, but did contain person focused interludes. This tradition was continued with P LUTARCH (c 100). Similarly, many sacred texts contain biographical sections, which are, at least in intent, biographical according to the three characteristics of biography given above. That is, a biography should be historical, focused on an individual and be of a distinct literary form. The earliest English biographies were written in Latin and had religious subjects (for example, Adamnan’s Life of St Columba (A DAMNAN, c 690), or Bede’s Life of St Cuthbert (B EDE, c 700)). These lives of the saints — hagiographies — stressed the enthusiastic description of miraculous religious happenings above factual accuracy. By the sixteenth century however, a more familiar biograph- 32 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION Figure 2.7: Inverted Pyramid for Biographies. INTRODUCTION PARAGRAPH Essential:birth and death dates, location of birth, profession, notable achievements, significant events. Optional: marital status, details of children, details of parents EXPANSION PARAGRAPH/S The next paragraphs will expand on the initial facts, usually with a chronological, narrative structure. BACKGROUND PARAGRAPH/S This group of paragraphs provides relevant background information CONCLUSION The piece is brought to a graceful conclusion. Sometimes an appendix (publications, etc.) ical form had emerged, written in English and concerned with presenting verified fact (a good example of this development is William Roper’s biography of Thomas More (R OPER, 1550)). The first attempt at a National Biographical Dictionary — the Athenae Oxonienses17 was completed in 1691 (W OOD, 1691). The Eighteenth Century saw the publication of Dr Johnson’s Lives of the Poets (J OHNSON, 1781), a work that follows the contemporary form of more literary biographies; basic information is presented, along with criticism and analysis of the biographical subjects’ achievements. Boswell’s Life of Johnson, first 17 The Athenae Oxonienses was subtitled “An Exact History Of All The Writers and Bishops Who have had their Education in The most ancient and famous University Of Oxford, From The Fifteenth Year of King Henry the Seventh, Dom. 1500, to the End of the Year 1690. Representing The Birth, Fortune, Preferment, and Death of all those Authors and Prelates, the great Accidents of their Lives, and the Fate and Character of their Writings.” 33 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION published in 1791 (B OSWELL, 1791) in its accuracy, attention to detail and emphasis on character revealing incidents, exemplifies the literary biographical form. Perhaps the most common biographical subtype is the obituary. Obituaries are distinct from book length treatments as, although they obey the basic form of biographies, they are more anecdotal and selectively focused. They are also generally (though not necessarily) biased in favour of the subject. Obituaries have been part of British and American newspapers since the late nineteenth century. They concentrate on achievement, and in the British tradition, do not dwell on (or often even mention) the cause of death (F ERGUSSON, 2000). An important variation in British obituary writing is the anonymity of the writer; in The Guardian obituaries are signed, in The Times and The Telegraph they are anonymous. Currently, there are a number of forums for biographical writing; the traditional book length treatment, obituaries, encyclopedias and dictionaries of biography. Many of these resources are reproduced online, and there are also numerous websites that generate biographical sketches of various lengths, and in various domains. For example, the U.S government publishes biographical profiles of current and former congressmen online18 and various websites electronically publish short biographies of scientists.19 2.4 Classification As this thesis is concerned with biographical sentence classification it is important to introduce some important issues in classification theory and how these issues apply to text classification. According to the classical theory of classification — presented by Aristotle who we have previously mentioned in relation to literary genre on page 7 — there are a set of necessary and sufficient conditions for category membership. Also, any member of a category is an equally good member of that category. The classical (Aristotelian) theory was further entrenched when systematic classification with respect to biology was developed as part of a scientific biology in the early eighteenth century with the creation of Linneaus’ familiar taxonomy of genus and species (L INNAEUS, 1735). Linnaeus attempted to introduce a more formal system in place of (or parallel with) the more imprecise categories of folk taxonomy that could not support new developments in biology. The twentieth century saw a weakening of the classical theory of classification with the application of ideas from philosophy and psychology to the theory of classification (L AKOFF, 1987); particularly the later Wittgenstein’s the18 For example, http://bioguide.congress.gov/biosearch/biosearch.asp Accessed 05-08-6 19 For example, http://scienceworld.wolfram.com/ Accessed 05-08-06 34 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION ory that the relationships between category members are better characterised in terms of “family resemblance” rather than common properties determining class membership. W ITTGENSTEIN (1953) used the example of “games” to illustrate the point; there are many ways an activity can be a game; it can be competitive or non competitive, indoor or outdoor, athletic or sedentary. Tiddly-winks, lacrosse, tennis, and bowls do not share any single common property that make them belong to the category “games”, yet we have no difficulty in identifying them as games. Rosch’s influential experimental work on human classification also undermined the classical theory; in a series of experiments R OSCH (1973) found that within categories, some members are universally judged to be better examples of a class than others — when applied to colours, this meant that there were “prototype” colours (that is, a “best” or “prototype” red colour, that is a better example of red than the other shades of red) — showing that not all members of a category are equally good representatives of that category, and undermining the classical view of category membership. It seems that the number of features shared between category members is pivotal here; the more features shared with other members of the category, the “better” the example is judged to be (R OSCH, 1973), providing empirical support for Wittgenstein’s philosophical intuition. For example, although football and solitaire both belong to the category “games”, football is a better example of a “game” than solitaire, as football has more features characteristic of games than solitaire (competitive, number of participants greater than one, a clear winner, and so on.). Rosch’s work formed the foundation of the prototype theory of categorisation (TAYLOR, 2003). Additional support for this abandonment of classical classification theory, with its emphasis on necessary and sufficient conditions for class membership, is provided by L ABOV (1975), who in a series of experiments utilising household receptacles demonstrated that there is no hard and fast boundary between neighbouring categories. L ABOV (1975) presented his participants with various containers; when the container was as deep as it was wide, and also had a handle, then participants called it a “cup”. When width was increased however, a higher proportion of participants called it a “bowl” and when depth was increased, more participants called it a “vase”. The important point to note here is that there is no strict rule that distinguishes “bowl” from “cup” or “cup” from “vase”. Prototype theory is a general theory of classification that can easily be applied to the special case of interest here; the classification of texts by genre. For instance, if we are seeking to classify newspaper articles into two classes, editorial and reportage, then although many articles are likely to be prototypical editorial articles (for example, leaders) and many are likely to be prototypically reportage articles (for example, factual reports on the activities of politicians), many articles will also be “outlying” editorials (that is, less good examples of editorials like, for instance, advice columns) or “outlying” reportage (that is, less good examples of reportage, like business reports with extensive background 35 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION material). The classification of texts at the document level (particularly physical books) has traditionally been a central concern for librarians, and a considerable literature and expertise has developed, particularly over the last one hundred and fifty years. Substantial effort was expended on the creation, update and elaboration of comprehensive library cataloguing systems in the late Nineteenth Century (for example, Dewey decimal) (B ATTLE, 2004). Traditionally, these classification schemes were thought of as analogous to Linnaeus’ taxonomy of natural kinds (B ROADFIELD, 1946), but library subject categories are not obviously natural kinds; we cannot classify a book as belonging to a given category in the same way that we could confidently classify a fruit with respect to an obvious quality, like its shape (L AKOFF, 1987). The classification task performed by librarians lacks a developed theoretical basis and can properly be described as a professional craft skill, not based in an integrated theory of classification, but rather classification with respect to a particular academic discipline’s cultural requirements (M AI, 2004). The focus of this thesis is however, automatic text classification; an area which has its own problems distinct from human approaches to text classification. Automatic text classification is the subject of the next section. 2.4.1 Automatic Text Classification The term automatic text classification (also categorisation, or categorisation20 ) has traditionally been used to describe a group or related tasks (S EBASTIANI, 2002): 1. The automatic assignment of texts to predefined categories. This is the dominant sense, and the one assumed here. 2. The automatic identification of a set of categories. This is a usage from the early days of text categorisation research (B ORKO and B ERNICK, 1963). 3. The identification of a set of categories, and the subsequent assignment of documents to those identified categories. This is more normally referred to as text clustering. 4. The general name for document categorisation tasks — that is, both what we are referring to as text classification and text clustering both belong to the more general class of text categorisation. This is the terminology used in Manning and Shütze’s textbook in statistical natural language processing (M ANNING and S CH ÜTZE, 1999). An important distinction exists between categorisation based entirely on the contents of documents, and categorisation where extra-document metadata is available to aid with the classification task (for example, where the document 20 These terms are used interchangeably. 36 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION or text has been indexed). Automatic categorisation based entirely on the contents of the document itself is referred to as endogenous categorisation, whereas categorisation based on a document augmented by metadata is referred to as exogenous categorisation. Most research work on automatic text categorisation focuses on endogenous categorisation exclusively. Additionally, classification tasks can also be distinguished with respect to their accommodation of overlapping categories; that is, overlapping categories allow instances — items to be classified — to belong to more than one category, and non-overlapping categories restrict instances to one category. Items can also be assigned to categories probabilistically (that is, each item is said to belong to each category with a certain degree of probability). Binary classification (of which biographical sentence categorisation is an example) is a special case of non-overlapping categorisation where the number of categories is limited to two. Automatic text classification is used successfully in a number of different application areas. Examples (of which there are many) include, spam filtering, automatically indexing academic papers, author attribution studies, and language identification. 2.5 Machine Learning Since the early 1990s, machine learning techniques for automatic text categorisation have won popularity over knowledge engineering approaches to the task, partly because of the availability of copious training data necessary for machine learning, partly because of the cost (and brittleness) of human expertise in framing task specific rules, and partly due to a shift in theoretical emphasis towards empirical techniques (all these factors interact in precipitating the shift towards machine learning). General theoretical texts on machine learning include M ITCHELL (1997), and more specifically focused on the classification of texts M ANNING and S CH ÜTZE (1999, chap 16). More practically orientated machine learning focused texts include W ITTEN and F RANK (2005). This section is designed to provide enough information and background on machine learning for the reader to make sense of those research chapters (7, 8, 9 and 10) that rely on machine learning techniques. First, the six learning algorithms used in the research are outlined. Second, some comments on the evaluation of learning are presented. Finally, the notion of feature selection (that is, the selection of those features most useful for classification) is introduced, along with a description of the algorithm, a commonly used method for feature selection. 37 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION 2.5.1 Learning Algorithms Automatic classification relies on the use of learning algorithms. Learning algorithms use examples of correctly classified items in order to classify previously unseen items.21 Most learning algorithms make no claim at psychological plausibility. Several learning algorithms were used at different points in this work, ranging from the very straightforward Zero Rule algorithm (which provides a baseline against which the performance of the other algorithms can be tested) to the more sophisticated C4.5 decision tree algorithm. This section describes the six different algorithms used. Zero Rule Learning Algorithm The Zero Rule algorithm provides a baseline for the assessment of other, more sophisticated learning algorithms.22 Zero Rule simply predicts that each example to be classified belongs to the most prevalent class in the training data. For example, if 51% of the training sentences are biographical, and 49% nonbiographical, then 100% of the test data will be classified as biographical, as biography forms the majority class in the training data. One Rule Learning Algorithm The One Rule algorithm is based on the intuition that very simple rules are capable of classifying with great accuracy.23 The idea is that the feature with the highest accuracy on the training data is selected as the single feature on which the test data is tested. For example, if the training data shows that the pronoun feature has the greatest classification accuracy, then that is the only feature used to classify the test data. H OLTE (1993) compared the performance of the One Rule Algorithm to a variant of Quinlan’s decision tree learning algorithm (Q UINLAN, 1988) on sixteen standard machine learning data sets, and discovered that classification accuracy suffered only slightly with the One Rule algorithm. 21 This “learning by example” is often referred to as supervised learning and is compared to unsupervised learning, which first identifies categories, and then sorts unseen instances into these identified categories. Unsupervised learning is also referred to as clustering. 22 The Weka implementation of the Zero Rule learning algorithm (ZeroR) was used. 23 The Weka implementation of the One Rule learning algorithm (OneR) was used. The One Rule algorithm can be conceptualised as a one level decision tree. 38 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION C4.5 Decision Tree Learning Algorithm The C4.5 decision tree learning algorithm (Q UINLAN, 1993) is an extension of the the ID3 algorithm (Q UINLAN, 1988).24 This section describes some relevant features of decision trees, before providing a brief overview of decision tree induction algorithms with particular reference to the C4.5 algorithm. This section draws heavily from M ITCHELL (1997). Decision Tree Representation In this context, decision trees are rule sets constructed from training data, and used to provide a perspicuous decision procedure for the classification of new instances. For instance, the decision tree depicted by Figure 2.8 on the next page and the set of rules reproduced in Figure 2.9 on page 41 are equivalent. The rules can be understood as paths through the tree, from root (family name) to one of eight leaves. Consider Example 2.1. The root node of the decision tree depicted in Figure 2.8 on the following page is (family name). If we assume that “Smith” has been correctly identified as a family name then the sentence will be classified as biographical. (2.1) The prize winning painter was Bill Smith. Consider Example 2.2 beginning at the root node (family name), the first rule is answered no. The sentence does not contain he and therefore the second rule answers no, leading to the third rule — she — which again is answered no, likewise with the fourth rule her. The fifth rule is answered yes, and reaches the node By. The sentence is classified as non-biographical. These rules are presented in Figure 2.9 on page 41. (2.2) The prize winning painter was born in 1972. Decision Tree Induction Decision trees are learned by a recursive procedure which branches on the “best” available feature. The procedure stops when either all training instances are correctly classified or all features have been considered down a particular branch. Figure 2.10 on page 42 shows an idealised binary decision tree to illustrate this. The root of the tree shows the feature which best predicts target class membership. As the features used are binary (yes or no) the instances are split into two groups reflecting whether Feature K is yes or no. The test is repeated for each of the descendent nodes of Feature K. For example, of the 1400 instances for which Feature K has the value yes, Feature N provides the most powerful classification features. At each decision point the feature which predicts the target class most effec24 The Weka implementation of the C4.5 learning algorithm (J48) was used. 39 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION Figure 2.8: Genre Features Decision Tree. Family_Name yes no Bio He yes no Bio She yes no Bio Her no yes Bio Born no yes Forename yes By Bio no Non-Bio no yes Non-Bio Bio tively must be selected. This can be done naively — for example the One Rule algorithm discussed above simply selects that feature which agrees with the target class most often — or by using more sophisticated methods. Information gain is regarded as providing a better gauge of the classification potential of a 40 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION Figure 2.9: Decision Tree Rules Example. IF family name = yes THEN Biographical IF family name = no AND he = yes THEN Biographical IF family name = no AND he = no AND she = yes THEN Biographical IF family name = no AND he = no AND she = no AND her = yes THEN Biographical IF family name = no AND he = no AND she = no AND her = no AND born = yes THEN Biographical IF family name = no AND he = no AND she = no AND her = no AND born = no AND forename = no THEN Non-biographical IF family name = no AND he = no AND she = no AND her = no AND born = no AND forename = yes AND by = yes THEN Biographical IF family name = no AND he = no AND she = no AND her = no AND born = no AND forename = yes AND by = no THEN Nonbiographical feature than relying on accuracy alone (M ITCHELL, 1997). The information gain metric is dependent on the more basic concept of entropy, a quantification of uncertainty among different possible outcomes (S HANNON, 1948). In the case of binary classification, the entropy of a set of instances with respect to a target classification can be calculated using Equation 2.3. is a collection of instances ( is the number of instances in the collection ), are the instances belonging to classified as “yes” with respect to a target classification, and are the instances belonging to classified as “no” with respect to that target classification. Entropy is 1 if equal numbers of instances belong to each class (maximum uncertainty) and 0 if all the instances belong to the same class. (2.3) "# $% $' ! !& ! The information gain metric of a given feature is the forecasted reduction in uncertainty when this feature is used to classify the data. The higher the information gain, the more discriminating the feature with respect to the classification task. The information gain for a given binary-valued feature (( ) with respect to a set of instances is given by Equation 2.4 on the next page. is the collection of all instances, ) are the instances belonging to for which ( has the value “yes”, and * are the instances belonging to for which ( has the value 41 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION “no” with respect to the target the classification task. (2.4) ( " ) ! * Figure 2.10: Example Decision Tree for 3000 Instances. Feature_K no 1400 instances 1600 instances yes Feature_P Feature_N 500 instances 900 instances no yes Class A yes no 750 instances 850 instances Class B Class A Class B Overfitting, where a decision tree is built around the idiosyncracies of the training data, and fails to perform optimally on exposure to new data, is an endemic problem with decision trees. The C4.5 algorithm Q UINLAN (1993) post processes the decision tree (post-pruning) in an attempt to address this problem. Decision trees are first converted to rules (as stated above, each path from root node to leaf constitutes a rule). In this representation, antecedents of these rules are made up of conjunctions of terms. (2.5) IF (Family name = yes) AND (he = no) AND (she = yes) THEN Biographical Consider Example 2.5. If removal of the first term (Family name = yes) or both first and second terms ((Family name = yes) AND (he = no)) result in the rule classifying more accurately, then these initial terms will be abandoned. In other words, if the third term (she = yes) alone classifies with greater accuracy than all three conditions, then the first two conditions are jettisoned. While some algorithms use a validation data set — a collection of instances distinct from the test and training data — for post pruning, C4.5 utilises statistical tests to identify those rules that are not contributing to classification accuracy and trims them. 42 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION Ripper Rule Learning Algorithm Propositional rule learning algorithms produce IF. . . THEN. . . rules that account for the positive instances in a training set.25 The learning process conventionally has two stages: rule production and rule post-processing. 1. Rule generation is an iterative process. First, the maximally discriminant feature with respect to the classification task is identified (often using the information gain metric described above) and converted into an antecedent in an IF... THEN... rule. For example, if feature family name is identified as that feature having most potential, then the initial rule would be IF family name=yes THEN biographical = yes. Remaining features are tested in conjunction with the established rules, and that feature which predicts biography best is selected as an additional antecedent (see Figure 2.11 on the next page for an example). This process is repeated until the desired level of performance is achieved. After each complete rule has been learnt, each instance successfully covered by that rule is removed from the training data, and the process of single rule learning begun again on the diminished training data. This process is repeated until the desired number of rules are produced, or all the data is correctly classified 2. Like decision trees, rule based learning derived rules are subject to overfitting problems. Heuristics can be used to identify and prune those rules (or parts of rules) that tend to reduce the accuracy of the rule set on unseen data. The rule learning algorithm used in this work was Ripper (Repeated Incremental Parsing to Produce Error Reduction) (C OHEN, 1995). Ripper’s major innovation is in the post-pruning of rules. For each rule , a series of competing rules ( ) are constructed using heuristics. Each rule ( ) are then evaluated with respect to their classification accuracy for the data set and the best rule is selected. Naive Bayes Learning Algorithm The Naive Bayes classifier is a popular and computationally inexpensive tool for text classification (J OHN and L ANGLEY, 1995; M ITCHELL, 1997; W ITTEN and F RANK, 2005).26 It differs from the methods we have already considered as it does not produce human readable representations in the form of rules and trees. The algorithm is Bayesian in that it relies on Bayes’ method for calculating probability and naive in that it assumes that the feature values are conditionally independent (and equally important) with respect to the classification task. Despite these simplifications the algorithm is used extensively because it is highly 25 The 26 The Weka implementation of the Ripper algorithm (JRip) was used. Weka implementation of the Naive Bayes learning algorithm (NaiveBayes) was used. 43 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION Figure 2.11: Rule Based Learning Example. Target bio = yes IF the = yes THEN bio = yes ... IF the = no THEN bio = yes IF family_name = yes THEN bio = yes IF family_name = yes AND date = yes THEN bio = yes IF family_name = yes AND past_tense = yes THEN bio = yes IF family_name = no THEN bio = yes IF family_name = yes AND date = no THEN bio = yes ... IF family_name = yes AND past_tense = no THEN bio = yes effective, easy to implement and scales well to large datasets and large numbers of features. The Naive Bayes classifier is given in Equation 2.6 (based on M ITCHELL (1997)) where is the set of all classes, is the target class, and are the feature values that constitute an instance. (2.6) 44 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION Table 2.3: Example Training Sentence Representations. Family Name yes yes yes yes no Forename no no yes no yes “he” yes yes no no yes “she” yes no no no yes “born” yes no no yes yes class biographical biographical biographical non-biographical non-biographical Table 2.4: Example Test Sentence Representation. Family Name yes (2.7) Forename no “she” no “born” yes class ??? “he” yes The attractive simplicity of the technique is best illustrated with a simple example. Table 2.3 shows five sentences represented with five features (family name, forename, he, she and born). Table 2.4 shows the representation of the test sentence to be classified. The data in Table 2.4 is instantiated in Equation 2.6 to yield Equation 2.7. The likeliest class given the data for the instance presented can be derived from the training data (presented in Table 2.3) where there is a 0.6 (3/5) probability of a sentence being biographical and a 0.4 (2/5) probability of a sentence being non-biographical. It is straightforward to calculate conditional probabilities from the training data. For example P(familyname = yes biographical) = 3/3 (1), or to give another example, P(familyname = yes non-biographical) = 1/2 (0.5). The likelihood score can be calculated using Equation 2.7, yielding: Biographical Class: (3/3) (1/3) (2/3) (1/3) (1/3) = 0.0237 Non-biographical Class: (1/2) (1/2) (1/2) (1/2) (2/2) = 0.0625 As the non-biographical class yields a higher result, the example instance given in Table 2.3 is classified as non-biographical. Note that the figures derived are not genuine probabilities, although probabilities can easily be calculated (see M ITCHELL (1997)). A disadvantage of using the Naive Bayes algorithm is that, as it assumes that 45 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION the feature values are independent of each other, it does not perform well on those datasets that demonstrate strong interdependence between features. The algorithm seems highly competitive with decision tree based algorithms when applied to textual data, however (L EWIS, 1992b). Support Vector Machine Learning Algorithm Support vector machines (SVM’s)27 are a comparatively recent innovation in the classification literature (C ORTES and VAPNIK, 1995), and there is some evidence to show that SVM’s perform well in text categorisation tasks (J OACHIMS, 1998).28 As SVM’s are an extension of traditional linear models, this section will first describe some relevant features of linear models for text classification, before briefly describing the particular characteristics of SVM’s. As SVM’s are considerably more complex from a mathematical point of view than the other algorithms considered, their treatment here is less comprehensive, and the reader is referred to B URGES (1998) for an extensive tutorial introduction to the use of SVM’s for classification generally. Linear Models This section relies heavily on (W ITTEN and F RANK, 2005). Linear regression is a technique that can be used to classify numeric data, where the aim is to “to express the class as a linear combination of the attributes with predetermined weights.” (W ITTEN and F RANK, 2005) (see Equation 2.8, where are the weights). are the feature values, and (2.8) ) According to W ITTEN and F RANK (2005), “One way of looking at multi-response linear regression is to imagine that it approximates a numeric membership function for each class. The membership function is 1 for instances that belong to that class and 0 for other instances. Given a new instance we calculate its membership for each class and select the biggest.” Support Vector Machines One obvious problem in using straightforward linear regression is that it assumes that the instance space can be characterised linearly. This is by no means clear for text categorisation tasks (J OACHIMS, 1998). Support Vector Machines avoid this problem by mapping the original instance space into a new instance space that can be characterised linearly. 27 See (W ITTEN and F RANK , 2005) for a simplified account of SVM’s. Vector Machine (SMO) learning algorithm was used. 28 The Weka implementation of the Support 46 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION SVM’s determine a hyperplane29 that is optimally discriminatory with respect to the two classes. In other words, a hyperplane “gives the greatest separation within the classes” (W ITTEN and F RANK, 2005). Support vectors are those instances closest to the hyperplane (that is, those instances that define the hyperplane) of which there must be one for each class. This optimal hyperplane is referred to in the literature as the maximum margin hyperplane. We can think of SVM’s as an optimisation of straightforward linear methods. 2.5.2 Evaluating Learning This section briefly surveys some issues in the evaluation of classifiers.30 First, methods for assessing the accuracy of different classifiers are presented. Second, statistical methods for comparing the performance of different classifiers are described. Assessing Accuracy Accuracy is the percentage of instances correctly classified. The two standard methods for assessing the accuracy of classifiers are the use of training/test evaluation and cross-validation. Both these techniques require a separation of training and test data, as evaluating a classifier on data that has been used to train it is unlikely to reflect the accuracy of the classifier on unseen data. Training/Test Evaluation In training/test evaluation, data is divided into two groups; test data and training data. Each group is stratified to reflect the class distribution in the entire data set. Normally, between two thirds and three quarters of data is used for training, and the remaining portion used for testing the trained classifier. Cross Validation Cross validation is an extension of the simpler training/test method. The data is divided into equally sized sections — each stratified to reflect the class distribution in the wider data set — with each of the sections serving as test data for the remaining sections in turn. The final accuracy score is the mean accuracy for all runs. For example, if stratified 4-fold cross validation is adopted, the data is divided into quarters, with each quarter containing a class distribution that reflects the dataset as a whole (that is, stratified). Each quarter is “held out” in turn while the classifier is trained on the remaining 29 A hyperplane divides -dimensional space. In the case of one, two and three dimensional space, hyperplanes are normally referred to using more familiar terms (dot, line and plane respectively). However, when the number of dimensions is greater than three, the more general term hyperplane is used. 30 Here classifier refers to a particular combination of learning algorithm, feature set and data set. 47 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION three quarters of the dataset. Accuracy is then the mean accuracy of each of the four runs. In this way, all data is used to maximum effect as both training and test data. In order to gain more reliable results, multiple cross-validations are performed. For instance 10 x 4-fold cross validation requires that 4-fold cross validation be performed 10 times with the average of the 10 runs being a better estimate of classification accuracy. Note that for each of the 10 runs (stratified) data is randomly allocated to test/training sets. Statistical Significance While accuracy scores are useful for comparing the performance of classifiers, sometimes we require a more reliable indicator that one classifier performs better than another. In this situation we must turn to statistical tests. The sections below outline the context in which statistical tests are used (hypothesis testing) before reviewing some special issues in evaluating classifiers with statistical tests. Hypothesis Testing Hypothesis testing requires the identification of two hypotheses; the null hypothesis and the experimental hypothesis.31 In the context of this research, the null hypothesis could be, for example, “Classifiers A and B are equally accurate.” An example experimental hypothesis could be “There is a difference in the classification accuracy between Classifier A and Classifier B.” To put this another way, the null hypothesis claims that any observed difference between Classifiers A and B is merely due to chance, and the experimental hypothesis claims that the results of the two classifiers are drawn from different populations. Rather than attempting to prove the experimental hypothesis, the researcher’s aim is (generally) to disprove the null hypothesis. The usual convention allows the rejection of the null hypothesis if the probability of achieving the observed data, given the null hypothesis is less than 0.0532. A Type 1 error is said to occur if the null hypothesis is rejected, when in fact it is true, and a Type 2 error is said to occur when the null hypothesis is accepted when it is in fact false. Another way of putting this is that Type 1 error occurs when a test indicates that there is a difference when there is no such difference, and Type 2 error indicates no difference when there is, in fact, a difference. A one-tailed experimental hypothesis specifies the direction of the difference; that classifier A is better or worse than classifier B. A two-tailed hypothesis claims that there is a difference between the two classifiers, but does not specify the direction of that difference. 31 See O AKES (1998). The textbook covers basic (and more advanced statistics) applied to corpus linguistics. 32 This cut off point is referred to as a “P-value”. 48 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION Evaluating Classifiers with Statistical Tests D IETTERICH (1998) reviews the performance of five different statistical tests for comparing classifiers. Each test was used on three sets of data using two different machine learning algorithms. The five tests used were: 1. The McNemar test. 2. A test for the differences of two proportions. 3. The re-sampled paired -test. 4. The -fold cross validated paired -test. 5. The cv paired -test. D IETTERICH (1998) concluded that the re-sampled -test should never be used, as the likelihood of obtaining Type 1 errors is unacceptably high. Similarly, the -fold cross validated paired -test should not be used as it too produces too many Type 1 errors. D IETTERICH (1998) recommends the use of either the cv paired -test or the less computationally expensive McNemar statistic. N ADEAU and B ENGIO (2003) compares two further statistical tests against those identified by D IETTERICH (1998), and found that the corrected re-sampled -test was associated with a much lower likelihood of generating Type 1 errors. Taking D IETTERICH (1998) and N ADEAU and B ENGIO (2003) as a starting point, B OUCKAERT and F RANK (2004) suggest that repeatability is an important criterion in selecting a statistic, as well as “appropriate Type 1 error and low Type 2 error”. According to the empirical study presented in B OUCKAERT and F RANK (2004), the corrected re-sampled -test does not produce high Type 2 errors (in contrast to the McNemar test) and, as reported by N ADEAU and B EN GIO (2003), is associated with a much lower probability of Type 1 error than the uncorrected re-sampled -test. Additionally, B OUCKAERT and F RANK (2004) shows that the corrected re-sampled -test shows very high reliability, especially when used in conjunction with 10 x 10 fold cross validation. On the basis of this research, the corrected re-sampled -test, in conjunction with 10 times 10 fold cross validations, was adopted as the significance test in evaluating classifiers in this work.33 The reader is referred to Appendix G on page 273 for the formula for the corrected re-sampled -test, and details of the implementation used. machine learning toolkit includes an implementation of the corrected re-sampled -test.TheThisWEKA implementation has its limitations, however, and a Perl implementation of the test was 33 used for the bulk of this research (see Appendix G). 49 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION 2.5.3 Feature Selection Feature selection — the process of minimising the number of features necessary for a classification task without reducing accuracy34 — is conducted for two main reasons, the first pragmatic and the second more principled: 1. Minimising processing times. Although some learning algorithms scale well to large numbers of features (for example, Naive Bayes, SVM’s), others — like decision trees for instance — are not so flexible. 2. Removing “noise” from data. In this case, “noise” refers to those features that lack discriminatory power with respect to the classification task. For example, the unigram “the” is removed in much topic orientated text classification as it is regarded as unlikely to contribute to classification accuracy. YANG and P EDERSEN (1997) showed that aggressive feature selection can increase classification accuracy for certain kinds of texts (newswire articles). They discarded 98% of the unigram features from the Reuters corpus, and retained only the 2% identified as optimal by the chi-squared ( ) method (discussed below). The result was an increase in classification accuracy attributed to noise reduction. W ITTEN and F RANK (2005) provides an overview of feature selection in a general, while F ORMAN (2002) surveys feature selection for text classification in particular. G UYON and E LISSEEFF (2003) reviews various different feature selection algorithms, particularly information-theoretic approaches to the selection problem. The feature selection algorithm was used extensively in the current work, as it has proven success in text classification applications (O AKES ET AL ., 2001; K ILGARRIFF and R OSE, 1998), it is not computationally intensive and it is straightforward to understand. The algorithm — in this context — is designed to identify those features that are most characteristic of a class with respect to a second class. So, in the biographical example, the algorithm identifies those features most characteristic of the biographical class through contrast with the non-biographical class. The following description of the algorithm is based largely on O AKES ET AL . (2001) and O AKES (1998). The technique requires the construction of a contingency table for each feature. If we have two sets of sentences, one biographical (the BC) and one non-biographical (NBC), the union of these two classes is the combined sentences (CC). A contingency table of observed frequencies is constructed for each feature ( . The constituents of the table are listed in Figure 2.12 on the next page, and a representation of a contingency table shown in Table 2.5 on the following page. 34 Feature selection methods can also be used to identify keywords representative of a corpus of interest, if we have a corpus consisting of texts from the genre of interest, and a more general (normally larger) reference corpus. 50 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION Table 2.5: Contingency Table. Biographical Class a c Feature ( Feature ( Non Biographical Class b d Figure 2.12: Constituents of a Contingency Table for . a= Frequency of instances of feature type ( in BC b= Frequency of instances of feature type ( in NBC c= The sum of frequencies of instances of all feature types in BC apart from instances of feature type ( d= The sum of frequencies of instances of all feature types in NBC apart from instances of feature type ( Once the observed frequencies have have been calculated for each feature, the expected frequencies can be ascertained. The expected frequency of a feature is the frequency one would expect given the size of the corpus and the rarity of the word. It is straightforward to calculate the estimated frequency for each position in the contingency table. Consider Equation 2.9, where and refer to columns and rows of the contingency table, respectively. For example, position would be calculated using Equation 2.10. (2.9) ) (2.10) * ) ) * The observed and expected frequency tables enable us to calculate the statistic directly. For each of a, b, c and d, if is the observed frequency and is the expected frequency then is the sum of for each table element. If for a feature we can be 95% confident that the feature does occur more frequently in one of the two categories. This information can be used to rank features according to their discriminating power (that is, from most to least discriminatory), and a cut off point chosen either by capping the number of features (for example, the 500 most discriminating features) or by selecting an appropriate significance level (for example, the features we are 95% confident 51 C HAPTER 2: B ACKGROUND I SSUES FOR B IOGRAPHICAL S ENTENCE R ECOGNITION occur more frequently in one of the classes). Note that those features where the expected frequency ( ) of any table element is less than five should be discarded, as becomes unreliable at low frequencies. Other methods for selecting features, which handle low frequency features include the log-likelihood method (D UNNING, 1993), used widely in corpus and computational linguistics, as it delivers good results with minimal data. F ISHER (1922) describes the Fisher test, a variant of the test that also employs a contingency table can also be used with low frequencies. was used in this case, as we are primarily interested in those features that are maximally discriminatory from a very large population of features, and those features that occur very infrequently — with an expected frequency of less than five — are unlikely to be helpful.35 2.6 Conclusion This chapter has elaborated on some background themes necessary for understanding the thesis as a whole, particularly the notions of style, genre, biography, classification and machine learning. The related notions of gnere and style become important in later chapters as we investigate and assess possible representations for biographical genre classification. These investigations employ machine learning algorithms as a vital methodological tool. The next chapter reviews recent work in automatic text classification by genre and in addition surveys some systems that produce biographies. 35 See F ORMAN (2002) for an extensive empirical study comparing the performance of different feature selection methods for text classification. 52 C HAPTER 3 Review of Recent Computational Work The first part of this chapter reviews recent literature relevant to automatic genre classification. The second part of the chapter reviews some working systems that produce biographies, and relates them to the current work. 3.1 Automatic Genre Classification This section reviews recent work on the study of genre from the perspective of computational linguistics, then goes on to examine recent important literature in feature selection for topic orientated text classification. Finally, recent work on feature selection for genre classification is surveyed. 3.1.1 Recent Work on Genre in the Computational Linguistics Tradition Recent work on genre in the computational linguistics tradition is geared towards practical problems, rather than foundational questions. For a description of two theoretical perspectives on genre see Section 2.1.1 on page 8 and Section 2.1.2 on page 11, which describe Systemic Functional Grammar and the Multi-Dimensional Approach respectively. Research effort in the computational linguistics tradition is focused on finding the optimal method for distinguishing between genres computationally, rather than constructing a theoretical edifice (complete with rigorous definitions of genre and associated concepts). Recently, however, there have been attempts at placing the computational study of genre within the theoretical framework of computational stylistics (for example, K ARLGREN (2004) and A RGAMON ET AL . (2003)). The notion of stylistics (see Section 2.2 on page 22) gives us the basis from which we can ground a 53 C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK better understanding of genre. W HITELAW and A RGAMON (2004) (see page 54) is particularly interesting as it grounds the computational study of genre in the Systemic Functional Grammar research tradition. K ARLGREN (2004) defines a style as the result of an author’s “consistent and distinguishable tendency” to favour certain lexical and syntactic patterns in their writing, to structure material in a given way, and to write with a certain kind of audience in mind. In its turn, a genre (described again by K ARLGREN (2004) as “a vague but well established term”) is defined as a collection of documents that are stylistically consistent, accepted as belonging together by a sophisticated reader, familiar with the genre in question. For instance, leader articles in traditional UK broadsheet newspapers fulfil both these requirements; they are stylistically similar (use of persuasive language and argumentation) and also form a coherent grouping to those familiar with the genre. The two conditions tend to mirror each other; the more stylistically similar a group of documents are, the better they hang together as a genre. For instance, if we extend our newspaper leader example beyond traditional UK broadsheet newspapers, to include leaders in UK newspapers, we can see that although tabloid leaders (like broadsheet newspapers) contain argumentation and persuasive language, their syntactic structure is different (more contractions, shorter sentences, and so on). Hence, readers familiar with both text sources (tabloid and broadsheet) are less likely to recognise them as belonging to the same genre. W HITELAW and A RGAMON (2004) place genre recognition within the context of Systemic Functional Grammar (SFG) as SFG provides a mechanism for representing the stylistic meaning of the text (see Section 2.1.1 on page 8). The stylistic meaning of a text is that component of the text’s meaning that is non-topical (that is, syntactic patterns, document structure, and so on). In other words, the stylistic meaning is the residue meaning once the – in the terms of W HITELAW and A RGAMON (2004) — denotational meaning is removed. A distinctive feature of SFG is its emphasis on choice; language is characterised as a text creator’s choices between different constructions at different points in the writing process, and these choices are represented as system networks, a representation that allows each word or phrase to be tagged with the choices that led to its selection (see Figure 3.1 on page 56 for an example of a system network presented by W HITELAW and A RGAMON (2004)). Note that as system networks are primarily designed for language description by linguists, they are not optimised for computational representation1 W HITELAW and A RGA MON (2004) concentrate on features likely to be revealing of genre (for example, different types of conjunction, pronouns and so on) using a computation1 If we examine system networks from left to right, we can understand each choice as a disjunction that constrains subsequent choices until we arrive at the leaves of the tree; actual text elements. If we take Figure 3.1 on page 56 as an example, we begin at the root, “conjunction”, and then choose one of three ways of expressing “conjunction”, either “elaboration”, “extension” or “enhancement”. If we choose elaboration, the modes of expression under “extension” are not available to us. We have available to us the choice of “apposition” or “clarification”. If we choose clarification, we have the choice of “rather”, “in any case” or “specifically”. 54 C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK ally manageable representation, based on system networks. The strength of the SFG approach for genre classification is that it recognises that documents function as documents, and are not merely strings of concatenating unigrams. The cost of this “whole document” approach is increased complexity. W HITELAW and A RGAMON (2004) gives the example of how a letter (that is, a document belong to the letter genre) can be identified using systemic features. If the token “Dear” appears near the beginning of a document, and later in the document, a well-wishing phrase occurs — here well-wishing is a higher level construct including phrases like, “yours sincerely” and “yours faithfully” — then this “Dear” + well-wishing feature forms a useful attribute for genre categorisation, assuming that letters were one of the target genres. Stylistic features identified using the SFG approach were evaluated in the context of detecting fraudulent email (so-called “Nigerian” emails, which solicit money transfers to anonymous bank accounts). These Nigerian emails were contrasted with two other data sources (newswire text and texts from the British National Corpus sports and leisure domain) using the Pronoun Type systemic feature. The “Nigerian” emails differ from the other two categories in (to use SFG terminology) field (that is, topic. The “Nigerian” emails are about financial transactions) and tenor (that is, social relationships; the “Nigerian” emails attempt to establish a friendly rapport with the reader). Classification accuracy of 99.6 was achieved using a support vector machine learning algorithm. The performance of “bag-of-words” style approaches was not tested. Since the early 1990s the growth of the web has provided a new context for the study of genre. The synergy of technological development and cultural change have led to the rapid maturation of distinct genres that have no clear analogue outside the World Wide Web. Examples here include the now traditional homepage, with its contact details, outlines of major professional or recreational activities, personal details, or photographs. New web genres — like wikis and blogs — have also emerged. In the light of the particular constraints and opportunities provided by the web as a medium, it is unlikely that an optimal taxonomy of web genres will mirror that of established physical media. Attempts to provide a web genre taxonomy include M EYER ZU E ISSEN and S TEIN (2004), which assesses the utility of using the concept of genre in the context of information retrieval. Among other potential uses, search engines could utilise effective genre recognition in order to exclude documents belonging to genres of peripheral interest to the user’s information need. For instance, a user searching for information on a certain model of car may wish to exclude web pages classified as advertising. M EYER ZU E ISSEN and S TEIN (2004) conducted a user study — via questionnaire — which identified the ten most useful web genre from an information retrieval perspective. These genres were: help, articles, discussion, shopping, portrayal (personal home pages, and so on), link collections and download sites. Another application of genre identification in information retrieval is suggested by B OESE (2005). As some web genres con55 C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK Figure 3.1: System Network from W HITELAW and A RGAMON (2004) Apposition that is, in other words, for example Elaboration Clarification rather, in any case, specifically Addititive and, or, moreover Extension CONJUNCTION Adversative but, yet, on the other hand Verifying besides, instead, or, alternatively Matter with regard to, in one respect Simple then, next, afterward Spatio-Temporal Enhancement Complex soon, meanwhile, until now Manner similarly, likewise Causal therefore, consequently, since Cause-Conditional Conditional then, albeit, notwithstanding sist of largely static content (for instance, personal homepages which are — normally — updated infrequently) in comparison to highly dynamic genres (for example, newspaper home pages) spiders (search engine indexing agents) can be more effectively utilised if directed disproportionately towards pages with frequent content changes. 56 C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK 3.1.2 Feature Selection for Topic Based Text Classification The text classification community has expended a huge amount of research effort on exploring the most effective features for representing documents.2 A “raw” text document cannot be directly processed or interpreted by a machine learning algorithm, but instead must be mapped to a succinct, vector based representation, with each vector constructed of values for each feature in the document.3 All documents involved in the classification process, including those documents used for training, testing and in practical classification tasks “in the wild”, must be converted into this computationally tractable representation (S EBASTIANI, 2002) (see Figure 3.2). The simplest and most commonly used representation requires the use of each word in the document collection as a feature, and then constructing a vector for each document that reflects the presence or absence of that word. This is the so called “bag-of-words” representation. For example, if a document collection contains the term “elephant” or “gazelle”, then each document will be represented by a document vector that includes “elephant” and “gazelle” features (again, see Figure 3.2). Figure 3.2: Conversion of Documents to Document Vectors DOCUMENTS OUTPUT PROCESSING features Tokenize elephant gazelle frequency counts Document vector doc doc doc doc . . . 1 2 3 4 0 1 1 0 0 1 0 1 ... An effective document representation will represent only those aspects of the documents relevant to the task at hand. It will “distill” that aspect of the document with maximum potential for classification. Since the early 1990s how2 An extensive bibliography on automatic text classification generally is maintained by Evgeniy Gabrilovich: http://www.cs.technion.ac.il/ gabr Accessed on 02-01-07 3 This value could be, for example, a raw frequency count, a binary representation, or a nominal value. 57 C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK ever, more complex methods of representing documents have been assessed, 4 grams (where is greater than 1) and including the use of word based gram repsyntactic features (for instance, the use of noun phrases).5 An resentation does not require the use of natural language processing tools – part-of-speech taggers, syntactic parsers, and so on — demanded by syntactic features. Instead, a simple word tokenizer is all that is required. -gram ) can be thought of as pseudo-syntactic features, as they features (where are able to partially represent common syntactic features without the need for computationally intensive actual syntactic analysis, like, for instance, sentence parsing. F ÜRNKRANZ (1998) points out that the number of possible gram types in a document collection increases exponentially with . For example, the unigram type “the” is likely to occur with high frequency in any English language text. If we choose to use a bigram representation, then while the number of tokens remains static, the number of types increases with the number of distinct two word sequences beginning with “the”. The number of tokens is almost the same, while the number of types increases much faster; the number of bigrams is potentially (where is the number of unigrams), and the number of trigrams is potentially . Identifying highly discriminatory bigrams in this situation is problematic. F ÜRNKRANZ (1998) suggests an algorithm for pruning features (that is, reducing the number of features while retaining those with grams rehigh discriminatory power). A multi-pass strategy is used, with tained only if their constituent ( )-grams have met a predetermined fre quency threshold. In other words, for an gram feature “the authors are” to be included, the ( )-grams “the authors” and “authors are” must have been observed with a certain threshold frequency in the previous pass. The features were tested on two corpora (including Reuters) using the RIPPER algorithm (C OHEN, 1995). This work indicates that while bigram and trigram features are more successful than unigram features, when , results begin to deteriorate.6 In contrast to the frequency pruning method of feature reduction described above, TAN ET AL . (2002) use a combination of an information-gain metric and grams. The feature selection algofrequency counts to select appropriate rithm first identifies those unigrams that have a frequency above a predetermined threshold in at least one of the categorisation groups, and then selects only those bigrams with a flagged unigram as a constituent. The feature set (unigrams and selected bigrams) were evaluated using the Naive Bayes algorithm on the Yahoo-Science corpus and Reuters corpus. The bigram aug- 4 -grams are sequences of items from a larger sequence. 1-grams, 2-grams and 3-grams are normally referred to as unigrams, bigrams and trigrams, respectively. -grams where 3, the generic term “ -gram” is often used. 5 Semantic based representations — typically using hypernym or hyponym relations ascertained via WordNet, have also been used, but these representations are not considered here. 6 The area of feature selection has attracted much research attention; see this document on page 50. Additionally, F ORMAN (2003) provides a recent survey. 58 C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK mented representation improved classification performance when compared to unigram features alone. The use of syntactic phrasal representations, particularly noun phrases, would, on the surface seem to be beneficial to text classification, as they allow more explicit representations of the conceptual content of the document, with less ambiguity (for example, the noun phrase “river bank”, captures a different concept to either “river” or “bank” individually). This research theme has been pursued by a number of researchers (for example, L EWIS (1992b)). In a comprehensive study analysing optimal representation in both an information retrieval and text categorisation context L EWIS (1992b), found that the use of syntactic phrases as features (here “syntactic phrases” mean noun phrases identified through part-of-speech tagging and matching selected part-of-speech categories) resulted in a deterioration in performance when compared to the standard “bag of words” approach, using several different learning algorithms and the MUC-3 and Reuters corpus (see also L EWIS (1992a)). The benefits and drawbacks of using syntactic phrases are also assessed by M OSCHITTI and B ASILI (2004), who found “overwhelming evidence” that syntactic features failed to improve topic orientated classification accuracy. Two types of phrase were selected as features; proper nouns (identified using a capital letter sensitive grammar) and noun phrases (with noun phrases selected from each classification category). A subset of the Reuters newswire corpus was used for training and testing purposes, and a variant of the support vector machine algorithm (B URGES, 1998) was used for cross-validation. M OSCHITTI and B ASILI (2004) reports that phrasal representation is much less effective than a “bag of words” approach. S COTT and M ATWIN (1999), in a series of experiments again using the Reuters newswire data (although a different subset to that used by M OSCHITTI and B ASILI (2004), reported that phrase based representations (in this case, noun phrases) failed to improve classification accuracy and concluded that “it is probably not worth pursuing simple phrase based representations further.” 3.1.3 Feature Selection for Genre Classification While the area of feature selection for text classification is vast, feature selection for genre categorisation remains a relatively under-researched area, despite a trickle of papers since around the mid-1990s. S ANTINI (2004b) provides a comprehensive survey of the state of the art in automatic genre classification, in addition to briefly reviewing the treatment of genre in the linguistics literature, using B IBER (1989)’ S work on text typology as a starting point, and emphasising the lack of agreement regarding basic terms (for example, differing conceptualisations of “genre” in the literature). An early and much cited treatment of the area is provided by K ARLGREN and 59 C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK Figure 3.3: Genre Categories Used in K ARLGREN and C UTTING (1994). BROWN CORPUS INFORMATIVE IMAGINATIVE MISC PRESS Reportage Editorials Reviews Religion Skills & Hobbies Popular Lore Belles Lettres FICTION General Fiction Mystery Science Fiction Adventures Romantic Humour NON-FICTION Government Documents Scholarly Articles C UTTING (1994), which locates genre classification within the wider goal of improving information retrieval (that is, post-processing information retrieval results with respect to genre). Discriminant analysis was used in order to classify a subset of documents from the B ROWN Corpus according to their alloted genre categories at three different levels of granularity (see Figure 3.3). It was noted that on evaluation, classification accuracy decreased with granularity (that is, the greater the number of genre categories, the lower the accuracy), leading the researchers to question the validity and usefulness (for these purposes) of the classification scheme. For example, in the B ROWN corpus classification scheme used in the research, it is not obvious if there is a marked stylistic difference between the “mystery” and “adventure” genres. A distinctive quality of K ARLGREN and C UTTING (1994)’ S approach is the use of syntactic features, based loosely on Biber’s studies of text typology, but concentrating only on those features that can be reliably identified using a partof-speech tagger (for example, Prepositions, First Person Pronouns). The intuition here is that syntactic features are ideal for capturing the topic independent stylistic forms of a document – precisely what is important in genre classification. 60 C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK K ESSLER ET AL . (1997) stresses the potential usefulness of effective genre recognition for a range of natural language processing applications. One of the examples discussed is word sense disambiguation. In some genres, particular word senses are unlikely to occur. For example, the word “pretty” is much more likely to have the sense “attractive” or “beautiful” in formal genres, than in very informal, conversational genres, where “pretty” is typically used as a synonym for “rather”). Two reasons are given for the lack of research attention directed at genre classification: 1. It was not until the mid 1990s and the rise of the World Wide Web that some kind of genre classification became desirable. Prior to that, classification techniques had been applied to detect topic and genre has traditionally been identified through the source of documents. 2. Theoretical understanding of genre is limited, especially when compared to topic, which, although it has its own theoretical problem, does have a more coherent and developed basis than genre (that is, there is agreement about what topic is. This cannot be said for genre). Indeed, even if we can gain a theoretical understanding of genre, it remains an open question whether techniques existed to identify genre specific features computationally. K ESSLER ET AL . (1997) refers to features for text classification as “generic facets” which are indicated by “generic cues”. For Kessler, a facet “is simply a property that distinguishes a class of texts that answers to certain practical interests, and which moreover is associated with a characteristic set of computational or linguistic properties, whether categorised or statistical, which we will describe as “generic cues”. Four kinds of generic cues were used: 1. Structural cues (passives, nominalisations, syntactic features). 2. Lexical cues (for example, Mr, Mrs). 3. Character level cues (for example, punctuation). 4. Derivation cues (rates of lexical and character level features). Five genre categories were specified by K ESSLER ET AL . (1997): reportage, editorial, science and technology, legal, and fiction. Kessler concludes that their approach delivers reasonable classification accuracy, but it does not suggest that it is an improvement on existing, simpler methods. S TAMATATOS ET AL . (2000a) describes the use of various features on the classification by genre of a corpus of modern Greek texts. The corpus and genre classification scheme were prepared specifically for the experiment. A distinctive feature of the corpus is it is constructed from two hundred and fifty unfiltered web pages. The genres used were: press editorial, press reportage, academic prose, official documents, literature, recipes, curriculum vitaes, planned speech and broadcast news scripts. 61 C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK The corpus was split into equal sized test and training sets. Two classification algorithms were used: multiple regression and discriminant analysis. The textual representation is based on the stylometric characteristics of the texts. A natural language processing tool specifically designed for the processing of modern Greek text, SCBD, was used extensively in the study. The program first identifies sentences using a straightforward heuristic, then performs intersentential chunking using a multi-pass approach. The detected chunks are verb phrases, noun phrases, prepositional phrases and adverbial phrases. The final representation consists of twenty two features each consisting of the count of the number of stylistic features present. The features are divided into three categories; token level features, phrase level features and analysis level features. Token level features indicate the average length of sentences in the document, and density of punctuation. Phrase level features point to the relative number of phrase tokens in a text (for example, number of noun phrases identified divided by the total number of phrases in the document) or the average length of a given phrase type in a given document (for example, total number of words in noun phrases, divided by number of noun phrases). There are ten of these phrase based relational features. The distinctive part of this approach is shown in the analysis level, when information from the SCBD processing of each text is used in the construction of the feature representation. Nine features are identified here, including counts of the number of words left unanalysed after each pass of the chunking algorithm, and ratios of keywords to total number of words in the document. These features measure the syntactic complexity, and the ratio of unusual words in the document (respectively). The use of stylometric features alone produces classification accuracy of 88%. It is, however, difficult to compare this technique to a “bag of words” style representation. In contrast to the computationally intensive method of analysing Greek text explored in S TAMATATOS ET AL . (2000a), S TAMATATOS ET AL . (2000b) uses word frequency counts to identify those words that best serve as linguistic features. However, in contrast to previously employed methods for identifying candidate words (for example, K ARLGREN and C UTTING (1994) K ESSLER ET AL . (1997)), where genre categories are analysed to identify features, S TAMATATOS ET AL . (2000b) uses frequencies in unrestricted text (that is, texts that have not been pre-classified with respect to genre). S TAMATATOS ET AL . (2000b) used three feature sets on a corpus made up of four genres from the Wall Street Journal Corpus (editorial, letters to the editor, reportage, and sports news). The genres were identified using Wall Street Journal header types. There was an equal split between training and test data, and discriminant analysis was used as the classification algorithm. The three feature sets used were: 1. Most common words from each of the four genre categories in the Wall 62 C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK Street Journal Corpus. 2. Most frequent words from the British National Corpus. 3. Most frequent words in the British National Corpus, augmented by punctuation frequencies ( 97%). S TAMATATOS ET AL . (2000b) found that feature set two was more successful than feature set one at accurately classifying the Wall Street Journal test data, but that feature set three (that is, the most frequent words in the British National Corpus augmented by punctuation frequencies) was most effective. Like K ARLGREN and C UTTING (1994), F INN and K USHMERICK (2003) envisage genre recognition as important within an information retrieval framework, improving the user experience. The authors locate genre as an exclusively stylistic concept, independent of topic. Genre is concerned with “what kind of document it is, rather than what the document is about” (F INN and K USHMER ICK , 2003). The example of querying a search engine with “chaos theory” is given. Someone preparing a research paper on astro-physics and a ten year old preparing a homework assignment require very different kinds of documents. F INN and K USHMERICK (2003) used three feature sets in their work, together with a decision tree learning algorithm. The experiment attempted to distinguish between objective and subjective genres (that is, reportage and reviews respectively). The feature sets used were: 1. Bag-of-words — The text classification standard representation. 2. Part-of-speech statistics — Each document is represented by thirty six features (that is, thirty six parts of speech) and each of these features in turn represents a part-of-speech. Instead of binary features, the features are percentages, reflecting the proportion of words of a given part-of-speech category in the document (for example, if 5% of the document’s words are preposition, the preposition feature will have the value 5). 3. Text statistics — Average sentence lengths, frequency of function words, and so on. F INN and K USHMERICK (2003) found that classification performance could be enhanced using a “meta-classifier” combining models created from each feature set, but also found that no single text representation performed best over all genre categorisation tasks. Two classification scenarios were considered. Part-of-speech based features proved most successful when attempting to distinguish between “objective” and “subjective” documents (for example, newspaper reportage and opinion pieces, respectively) at 84%. For the second classification task — assessing whether a review is either positive or negative7 — a unigram (“bag-of-words”) representation was most successful at 82%. 7 Classification according to the emotional content of a document is often referred to as sentiment analysis 63 C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK For M EYER ZU E ISSEN and S TEIN (2004), genre classification is defined as the discrimination of documents on the basis of “their form, style, or target audience”. Their approach uses a compact feature representation in conjunction with several classification algorithms to classify randomly selected web pages. Part of the work involved conducting a user study centred on the utility of classifying search engine results by genre. That is, whether post processing documents judged relevant by the search engine into different genre categories is an aid to information seekers. Different genre labels were presented to the participants who were asked to label them as “very useful”, “sometimes useful”, “not useful” and “don’t know”. The experiments’ eight highest scoring categories in terms of usefulness were: help, articles, discussion, shopping, portrayals of institutions, private portrayals, link collections and software downloads. M EYER ZU E ISSEN and S TEIN (2004) stresses that for the purposes of practical web genre classification, it must be possible to construct feature sets from web pages (documents) within the time constraints necessary for a “real time” system. Three levels of feature complexity are considered: 1. Low complexity: Character and word counts (for example, common words, punctuation, and so on.) 2. Medium complexity: Features that require dictionary look up, or make use of structural properties of HTML documents (for example, proportion of link tags to other kinds of tags). 3. High complexity: Features that are dependent on grammatical analysis or part-of-speech tagging. Two feature sets were contrasted, the first consisted of features from the low complexity and medium complexity categories, and the second makes use of features from each of the three complexity levels. The second feature set had the best accuracy, yielding accuracy levels of above 70%. S ANTINI (2004a) tested the use of part-of-speech trigrams for features as “trigrams are large enough to encode useful syntactic information, and small enough to be computationally manageable” (S ANTINI, 2004a). Ten genres from the British National Corpus were used in the work: conversation, interview, planned speech, public debate, academic prose, advertising, biography, instructional, popular lore, and reportage. Three feature sets of part-of-speech trigrams were used in conjunction with a Naive Bayes classifier, yielding encouraging results, although these results are not directly comparable to similar work because of differences in genre categories, training and test data, and classification algorithm. Classification accuracy as high as 82% was achieved. While it has been shown that the use of syntactic representations has not improved classification accuracy in the context of general text classification (L EWIS, 64 C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK 1992b; M OSCHITTI and B ASILI, 2004; S COTT and M ATWIN, 1999), it is possible that syntactic features may be especially useful for genre classification, as genre classification is less topic-orientated than most text classification tasks. The current experimental work on text representations for genre classification suggests that syntactic features may be useful (for example, S ANTINI (2004a) and S TAMATATOS ET AL . (2000a). It is, however, difficult to gain more than an impressionistic view of the issue, as the current experimental work uses different learning algorithms, different data-sets, different genre classification systems and different languages. 3.2 Systems that Produce Biographies Most of the working systems reviewed here characterise the production of biographies from multiple documents as a summarisation problem8 (the Southampton A RTEQUAKT system — see page 78 — is a notable exception). Effective document summarisation has been a goal of computational linguistics and information retrieval work since the first digital computers L UHN (1958). A summary extracts what is most important from its source document, or as S P ÄRCK J ONES (1999) puts it, a summary is “a reductive transformation of source text to summary text through content reduction by selection and/or generalisation of what is important in the source”. Of course, what is judged important is heavily reliant on the context or task. This section first reviews some background issues in summarisation and its extension, Multiple Document Summarisation (MDS),9 as many biography production systems work within an MDS framework. Then, four systems that produce biographies are reviewed. 3.2.1 The Summarisation Task Summaries and summarisation tasks can vary along a number of different dimensions (M ANI, 2001b,a; M ANI and B LOEDORN, 1999; M AYBURY and M ANI, 2001; S P ÄRCK J ONES, 1999). The most important of those are compression rate, intended audience, relation to source document, function, coherence and language. The compression rate of a summary refers to the ratio of summary length to source document length, normally expressed in percentage terms. For example, a summary that is one tenth the length of the source document has a compression rate of ten percent. A summary that is nine tenths the length of the source document has a ninety percent compression rate. 8 Other approaches outside the summarisation paradigm exist however. For instance, the biographical Question Answering system developed by F ENG and H OVY (2005) produces answers to standard biographical sentences from free text. 9 This section draws heavily on M ANI (2001a), a textbook on summarisation. 65 C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK The intended audience of a summary is important in determining appropriate content and is conventionally divided into two categories; user focused summaries, where either information or format is geared towards user interests, and generic summaries, which are not designed for a specific set of user interests. Summaries can differ in their relationship to the source document in two main ways; they can be made up of extracts from the source document (that is, extracted sentences) or they can take the form of the traditional abstract, which contains information about the source document that is not necessarily in the document itself. Abstracts, unlike extractive summaries, are normally coherent. The function of a summary is usually described as either indicative (to indicate whether the document is worth reading), informative (where all information in the document is captured at some level of detail), or critical (where an evaluative component is included in the summary). Summaries can also be judged on the dimension of coherence, with extractive summaries often less coherent than abstracts. While summaries are normally mono-lingual (that is, the summary is in the same language as the source document(s)), there is also the possibility that the source document(s) are summarised into a different language (that is, they are translated as part of the summarisation process). The need for automatic summarisation, while present through the late 1950s to late 1980s, only became a real pressure in the 1990s, with the exponential increase in online textual materials. Before the nineteen nineties, the obvious advantages of automatic summarisation over human summarisation — cost — was offset by summary quality concerns. Only when the volume of online text became overwhelming was significant work directed at improving basic methods which had moved on little since the work of L UHN (1958) and E D MUNDSON (1969). The high level architecture of a summarisation system is conventionally divided into three processing modules (M ANI, 2001a; M AYBURY and M ANI, 2001; S P ÄRCK J ONES, 1999; H OVY, 2003); analysis, transformation and generation. 1. Analysis — the initial stage, where a representation of the source document is constructed. 2. Transformation — the internal representation is manipulated to produce a representation of the summary. This module is most important in abstract summaries. Extractive summaries tend to conflate stages 1 and 2. 3. Generation — a natural language output is generated from the representation of the summary. In the context of automatic summarisation, the term “shallow processing” is 66 C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK usually used to refer to extractive systems, which, as discussed above, output sentences from the source document. To use the distinction made by S P ÄRCK J ONES (1999), extractive systems extract text rather than facts from a source document. In the three part summarisation architecture detailed above, systems that employ shallow processing conflate the first two stages and move directly to generation (stage 3). The very earliest attempts at automatic summarisation were shallow, relying on word frequency counts (L UHN, 1958). Later attempts, using word frequency counts augmented with some corpus evidence (E DMUNDSON, 1969), formed the framework for summarisation for many years (M ANI, 2001a) . 3.2.2 Multiple Document Summarisation Multi-document summarisation (MDS) is an extension of the traditional summarisation task, inheriting all the older task’s requirements, but bringing its own special problems. If summarisation can be defined as “a reductive transformation of source text to summary text through content reduction by selection and/or generalisation of what is important in the source” (S P ÄRCK J ONES, 1999), then, extending S P ÄRCK J ONES (1999)’s definition, MDS can be described as a reductive transformation of source text to summary text (where “source text” is a collection of related documents) through content reduction by selection and/or generalisation on what is important in the source (while removing redundancy and possibly flagging inter-document differences and similarities).” To put this another way, an MDS summariser can be seen as a particular kind of summariser, with the following distinguishing characteristics: It takes as input a collection of related documents. It removes redundancy (that is, repetitive information) from the summary. For example, news articles covering the same event will (probably) contain a great deal of repetitive information that ought only to appear once in an output summary. It flags cross-document differences and similarities. For example, discrepancies between news articles can be flagged. The nature of the MDS task, as well as some of the distinctive applications in which it might be employed (for instance, online news article summarisation versus traditional abstracting of single scientific papers according to a template), make certain demands in addition to those explicit in the definition: MDS can accept any number of documents greater than 1. It might be more appropriate to use different methods for summaries based on different sizes of source document collections. For example, a summary based on a two document collection may best be tackled using a radi- 67 C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK cally different method from a summary based on a five hundred document collection. Compression needs to be greater with MDS. For example, for documents of a fixed size (say two hundred sentences) a ten percent summary of a single document is twenty sentences long. A ten percent summary of a five hundred document collection — with documents of the same length again — is ten thousand sentences long; not a very useful summary. To gain a twenty sentences long summary from the five hundred documents, the compression rate would have to be 0.02 percent. This kind of compression rate would be very difficult to achieve with standard single document extractive or abstractive techniques. Cross-document co-reference is an inescapable problem in MDS. Simple extractive techniques that sidestep co-reference are inadequate for MDS due to the need for high compression. Redundancy of information is perhaps the central problem in MDS. We can see the extent of this problem if we imagine a naive attempt at MDS that involves simply feeding each source document through a single document summariser, and concatenating the results. The product of this process would be both extremely long — which is clearly of limited value in a summary — and highly repetitive. In order to produce a viable and useful summary, it is essential to remove some of this repetition, or at least minimise it to (application specific) acceptable levels. M ANI (2001a) suggests four signals that indicate repetition when comparing two text elements,10 and where potential for eliminating elements exists: 1. Semantic equivalence: Two elements have exactly the same meaning (paraphrases). 2. Informational equivalence: Two elements contain the same information. This is weaker than semantic equivalence. 3. Information subsumption: Text element A subsumes text element B if the information in B is contained in A. 4. String identity: Two elements consist of exactly the same string. While techniques have been developed to identify redundancy, research on identifying and flagging inter-document differences are less well developed. An exception is R ADEV (1999) who has developed a system of twenty four relations that provides a structure for flagging differences (for example, CONTRACTION, REFINEMENT, and so on.) M ANI (2001a) suggests a general architecture for MDS systems consisting of five modules: 10 Text elements are sentences or clauses. 68 C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK 1. Selection: select text elements from the document collection using standard summarisation approaches. 2. Matching: match the extracted text elements to identify and remove redundancy, 3. Salience: select the most salient elements. Rank and output according to compression rates. 4. Reduction: use aggregation to reduce the text elements further and reduce non-redundant information. 5. Generate: output the final summary, using natural language generation techniques. While MDS is a well developed area of research, comparatively little work has been done in specifically biographical MDS. The usefulness of a functional biographical summariser when, for instance, quickly producing succinct reports on named individuals from news articles is clear (for example, M C K E OWN ET AL . (1999)). While it is currently not possible to produce a biography of the same quality as a professionally written and published work, automatic biographical summarisation does hold out the possibility of speed gains when compared to humans sifting through large document collections. Great quantities of data can be quickly filtered to provide a useful and informative summary, where significant events and facts about a person’s life are culled from source documents and presented in an orderly and appropriately succinct manner. Several attempts at building biography orientated MDS systems have been attempted in recent years, notable efforts (reviewed below) include the New Mexico system (C OWIE ET AL ., 2001), the Mitre/Columbia system (S CHIFF MAN ET AL ., 2001), the Southern California system (Z HOU ET AL ., 2004) and the Southampton system (K IM ET AL ., 2002), which differs from the preceding three as it is not presented within an biographical MDS framework. 3.2.3 New Mexico System The system described by C OWIE ET AL . (2001) is designed to aid in the classic information retrieval situation, where a user is searching for information about, say, Tony Blair and is engulfed with a huge quantity of hits (Google returns 788,000 hits for the string “Tony Blair”) many of which only mention Tony Blair incidentally. Even given that the information retrieval results are all highly relevant, the user is still required to read and synthesise a potentially huge number of documents to satisfy their information need. The problem is exacerbated when the results are in a language unfamiliar to the user. The time and effort required to construct a summary of the target’s career is very high. 69 C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK Cowie’s system aims to resolve these problems by automatically producing a “personal profile” (biography) for the query term from the retrieved documents (see Figure 3.4 on page 72). The system outputs a personalised profile consisting of a chronologically ordered list of events, with links to source documents, quickly enough for the system to be used in “real time”. The system is designed as a three module pipeline, consisting of an information retrieval stage, a summarisation stage, and a merging and output stage: 1. Information retrieval — A collection of documents concerned with a given individual is gathered using standard information retrieval techniques. The documents are automatically filtered to exclude those not in the designated languages (English, Spanish or Russian). The user is given the opportunity to filter out obviously inappropriate documents. Those documents that are only incidentally related to the query person, or those documents that refer to a person who shares a name with the target, can be filtered at this stage. For example, if we are seeking to produce a biographical summary of Tony Blair, British Prime Minister, we are unlikely to be interested in documents pertaining to Tony Blair, New Zealand Cafe owner. 2. Summarisation — For each document in the collection, find a date for the document, select the most relevant chunks of text and determine a date to associate with each chunk (the default document date if no explicit date reference is made). If the source language is not English, translate to English (retaining both the text chunk date and a reference to the source document). 3. Merging and output — Each of the translated extracted text chunks is arranged in chronological order and output in HTML format with links to the respective source document. The system uses a query guided standard statistical text summarisation technique to extract text chunks from each document, that is, the summarising system is positively biased towards those sentences that contain the query term. The process follows six steps: 1. Language recognition — The system determines the source language using a character based gram model. 2. Low level text processing — HTML tags and extraneous text are stripped from the document. 3. Tokenisation – Sentence, word and paragraph tokenisation. 4. Dating — Get document date (normally this is either at the beginning or end of the text). 5. Sentence ranking — Each sentence is scored and ranked (in the original language) biased by the query term. The system tries to determine a 70 C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK date for the sentence using simple pattern matching. If a date cannot be determined, then the default document date is used. 6. Translation — If the source document is other than English, the extracted text chunks are translated to English and the dates transformed to a standardised format. The work is partly based on McKeown’s work on summarising medical literature for health professionals (M C K EOWN, 1998); work which was later generalised and extended to new domains (M C K EOWN ET AL ., 1999). Both M C K E OWN (1998) and M C K EOWN ET AL . (1999) emphasise the inadequacies of “key sentence” summarisation techniques when the aim is to produce one summary of multiple documents, rather than a single summary per document. The single document technique risks producing a highly repetitive (hence lengthy) summary that may at its most extreme present the same problems as a simple search engine query. The unpredictable nature of the World Wide Web news sources presents special challenges. Unlike the medical domain, where journal articles are highly conventional — often with a labelled conclusion flagging relevant sentences — news articles from disparate sources exhibit sometimes radically different styles and formats. It follows that document structure or formatting cannot be used as a scaffolding mechanism for the identification of relevant sentences. Further, it cannot be assumed that all Web documents are created equal. The system needs some way of assessing the relative authority (reliability, trustworthiness) of the document or document creator. Cowie suggests a number of enhancements to the system.11 Cross document co-reference — As the query term is a simple name string, if the term is ambiguous between two (or more) well known people with that name, then we will have a contaminated (noisy) output.12 This problem is partially addressed in the system by including domain specific terms in the original IR query (for example, “Tony Blair” + politics). This mechanism is however potentially inadequate when dealing with individuals associated with more than one domain. Establishing dates — The current system assumes that the date of publication is the date of any extracted sentence, unless a date is specifically referred to in the sentence. The addition of even simple temporal reasoning, able to manage terms like “yesterday” or “next week” in relation to the base date would be a significant improvement. Merging — The straightforward outputting of sentences could be enhanced to cope with potentially repetitive entries. Additionally, entries that are 11 Personal communication. 12 Cowie uses the example of Berezovsky the politician versus Berezovsky the musician. Another obvious example is Freud the psychoanalyst and Freud the artist. 71 C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK Figure 3.4: New Mexico System (C OWIE ET AL ., 2001). contradictory could be flagged at this stage as worthy of investigation by the user. The fact that encouraging (though unevaluated) results have been obtained using fairly simple techniques shows that automatic biography extraction from the World Wide Web is a viable technique and could be greatly improved by tailoring standard techniques to the specific task. 3.2.4 Mitre/Columbia System The system — described in S CHIFFMAN ET AL . (2001) — uses corpus statistics and linguistic knowledge at different points in processing. It concentrates on selecting appropriate biographical descriptions from the source documents and removing redundancy, rather than producing a coherent summary. No attempt is made at temporally ordering the material selected for the output summary, and the final presentation uses canned text generation methods. 72 C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK Figure 3.5: Sample Output from S CHIFFMAN ET AL . (2001). EXAMPLE 1: Vernon Jordon is a presidential friend and a Clinton adviser. He is 63 years old. He helped Ms. Lewinsky find a job. He testified that Ms. Monical Lewinsky said that she had conversations with the president, that she talked to the president. He has numerous acquaintances, including Susan Collins, Betty Curries, Pete Domenici, Bob Graham, James Jeffords and Linda Tripp 1,300 documents 707,000 words 607 Jordan sentences 78 Extracted sentences 2 groups: friend, adviser EXAMPLE 2: Victor Polay is the Tupac Amaru rebels’ top leader, founder and organization’s commander-and-chief. He was arrested again in 1992 and is serving a life sentence. His associates include Alberto Fujimoiri, Tupac Amaru Revolutionary and Nestor Cerpa. 73 documents 38,000 words 24 Polay sentences 10 extracted appositives 3 groups: leader, founder and commander-in-chief. The system is basically extractive, with some post extractive smoothing and merging to improve coherence and reduce redundancy. Figure 3.5 gives two examples of system output, both examples are reproduced from S CHIFFMAN ET AL . (2001). Selected output descriptions have been “strung” together using a canned text generation system. Consider Figure 3.5, Example 1. The system used a newswire corpus of 13,000 documents concerned with the Clinton impeachment proceedings — the Clinton corpus — the corpus contains 607 sentences mentioning Vernon Jordan explicitly, and 82 descriptions, 78 of which are appositives (discussed below) and four relative clauses. Additionally, 65 sentences where Vernon Jordan is an (again, explicitly named) deep subject are present — although these are not used in further processing. Apposition is conventionally defined as a relation where a phrase or word appears next to a phrase or word of the same kind. The term is most frequently used to describe the relation between juxtaposed noun phrases (for example, “I’ve lost my dog, Wilbur”). It is clear that this relation can be exploited to gather information about named persons. Although not all the existing facts can be picked out by apposition, those that are should be reliable (low recall, 73 C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK but high precision in the language of information retrieval). Typical journalistic practices are useful here, with their traditionally heavy stylistic reliance on appositive phrases (for example “The US President, George Bush” or “The British Chancellor, Gordon Brown”). Appositive phrase detection (implemented with Finite State Automata) carry most of the extractive burden in the system. Relative clauses modify the head of a noun phrase, typically using a pronoun which shares a referent with that head (for example, “. . . who helped Lewinsky find a job”).13 If relative clauses can be identified accurately for named individuals, it is again clear that these can be harnessed for extracting relevant biographical facts (for example, “Gordon Brown, who became chancellor in 1997”). These two relatively shallow recognisers — implemented using finite state techniques — allow biographical information to be harvested from the document. Relying on the intuition that an important fact will be mentioned multiple times over a large document set, it seems likely to appear in an appositive description (or relative clause). Corpus statistics are used at several points in processing to help identify and rank the most suitable appositive descriptions for summarisation. For example, appositive phrases are clustered and ranked by analysing the corpus frequency of their head nouns. Additionally, the system utilises linguistic resources (in the form of WordNet) at several points in the processing. For example, when merging redundant extracted appositive phrases, if head nouns from two appositive phrases share a common parent below P ERSON in the WordNet concept classification system, then only the phrase containing the most frequent head noun is retained. This use of disparate resources justifies the claim that the system combines linguistic knowledge and corpus derived statistics. The system follows a pipeline architecture and is built around the extraction of appositive phrases featuring named persons. Processing has four stages: 1. Pre-processing — Every document in the collection is tokenised at the sentence and word level, part-of-speech tagged (using the Alembic tagger14 ) tagged for named entities (only person names) and then parsed using the CASS parser15 ). Finite state automata are then used to locate and extract appositive phrases (these automata are designed to match both pre and post modifying appositive phrases, for example “Current US president George Bush” and “George Bush, the US President”). Additionally, 13 It is not clear whether S CHIFFMAN ET AL . (2001) distinguishes between person orientated restrictive relative clauses (for example, “The man who you see”. Note how the clause helps identify the referent) and person orientated non-restrictive relative clauses (for example, “Sarah, who got the job, was very happy”. Note how the clause provides additional information about the referent.) It seems likely that the non-restrictive type of relative clause is easier to identify heuristically. 14 http://www.mitre.org/technology/alembic-workbench/manual/AWB-content.html Accessed on 02-01-07 15 http://gross.sfs.nphil.uni-tuebingen.de:8080/release/cass.html Accessed on 02-01-07 74 C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK relative clauses are picked out at the pre-processing stage, but do not contribute to the final summary. 2. Cross-document co-reference — A named entity recogniser was used to sort the named person references from each document into bins (a bin for each distinct person). The program uses linguistic knowledge (special rules about name abbreviations) and a statistical measure of similarity of the immediate window of surrounding words between named persons with potentially identical referents. The output of this cross document co-reference stage of processing is that the set of extracted descriptions for each distinct person are grouped together. 3. Appositive processing — At this stage, there are already a set of (perhaps highly repetitive) appositive descriptions for each distinct individual. Appositive processing has several steps: The first stage of appositive smoothing involves removing duplicate descriptions, which given the large document collection, and the likely repetitive nature of competing news articles, could remove a high proportion of occurring phrases. Only one copy of a duplicated phrase needs to be retained. The second stage identifies errors in pre-processing by identifying phrases that do not seem to have a person as head. Even the most reliable named-entity taggers will make mistakes; identifying companies as people, and so on. The system employs a novel “person typing” program for identifying erroneous appositives. The program relies on WordNet for linguistic knowledge and implements the following rule for distinguishing between those appositive phrases that have persons as heads, and those which do not: A string (that is, head noun of an appositive phrase) refers to a person if at least 35% of senses of that string are descended from the synset for PERSON in WordNet. The third stage involves removing redundancy; a deep problem for all MDS systems but especially pertinent when dealing with journalistic material where multiple articles describe the same incident. Redundancy, as mentioned previously, occurs when multiple nonidentical strings contain the same information, or one string subsumes the information in another string. Again, the system uses WordNet and corpus statistics to help identify and merge repetitive descriptions. Additionally, a further stage of processing occurs when duplicates are eliminated and modifiers conjoined (for example, “British Prime Minister” and “Member of Parliament for Sedgefield” might be conjoined as “British Prime Minister and Member of Parliament for Sedgefield”). 75 C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK 4. Generation: Ranked descriptions are selected according to the desired compression rate, and a shallow canned text generation technique is used to “string” them together, resulting in occasionally incoherent, but intelligible output. While the system is successful in meeting its stated aims in that output text is relevant and coherent, several improvements could easily be made: Organising output in temporal order. No attempt is made at maintaining temporal order, an important requirement for conventional biographies that are expected to follow linear narrative conventions (see Section 2.3 on page 30). This temporal incoherence is a potential cause of confusion to a reader expecting linear narrative. A naive time stamping method (especially in journalism, where publication dates are usually flagged in conventional ways) could be one simple way of ordering output. Additionally, temporal information could form part of the ranking criteria for outputting descriptions, with more recent information having a higher weighting than older news. Linking descriptions to source documents. The extract smoothing and merging processes and the fact that reference to source documents are not retained, means that a user cannot easily locate the context or source of an extracted description. This could be important in a situation where a description is unusual, surprising or important and the reliability (or authority) of a source document needs to be determined. In a web based context, a link to the source document would be most appropriate, other summarisation contexts would require their own referencing methods. Mechanisms for resolving disagreement. There is no clear method to resolve and flag disagreement between descriptions. 3.2.5 Southern California System This biographically orientated MDS system developed by Z HOU ET AL . (2004) at the University of Southern California is highly relevant to the current thesis. As part of the development of the Southern California system, a biographical corpus was created (described in Section 4.3.6 on page 94). The system architecture is outlined in Figure 3.6 on the following page and can be divided into five steps: information retrieval, identification of biographical sentences, sentence merging, ranking of sentences and redundancy removal. The information retrieval stage of processing identifies those documents that contain a given person’s name (the target person), from a large document collection. 76 C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK Figure 3.6: System Architecture for Z HOU ET AL . (2004) MDS system. IR on Person's Name Identify Biographical Sentences Merge Biographical Sentences Rank Biographical Sentences Remove Redundancy/ Restrict Length Output Biography The identification of biographical sentences stage takes as input those documents containing the target person’s name, splits the documents at the sentence level, and classifies each sentence as belonging to one of ten categories: bio (birth dates, death dates, and so on), fame factor, personality, personal, social, education, nationality, scandal, work and finally none, which is simply the absence of a biographical category. The classification scheme is based on a “checklist” approach, with every biography made up of sentences selected from each of the nine biographical categories. 77 C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK Various sentence representations were chosen for input to the sentence classigrams. The Naive Bayes classifier was fication module including frequent trained and evaluated on the biographically marked up corpus. The representation chosen for the final system architecture is not specified. The next stage of processing is the merging of biographical sentences. All sentences containing a variant of the target person’s name are added to the store of biographical sentences. Those sentences identified by the classifier that are direct quotations, or less than five words long are discarded, along with any duplicate sentences. After the sentences have been merged, the result is a list of distinct biographical sentences. The task is now to rank them according to their importance, and a variant of inverse document frequency was used to achieve this. Due to the potential high degree of repetitive information characteristic of the MDS task (with multiple – perhaps very similar – input documents) there is a need for a redundancy removal filter. The system used a method outlined in M ARCU (1999) to remove sentences until the desired compression is achieved. The system’s output does not reflect the ten way classification system developed at the classification stage; instead, a two way (biographical and nonbiographical) classification system was adopted, with each of the original biographical categories (education, work, scandal and so on) subsumed in one single biographical category. Note that the non-biographical category is unchanged in the transition to a binary scheme. One way of further developing the system, suggested by the authors, is the utilisation of the fine grained classification scheme in the output summary. For example, general biographical information (birth dates, death dates, and so on) can be output in the first sentences, followed by fame factor sentences (explaining why the target subject is notable) and then more detailed information. Additionally, this kind of biographical summarisation system can be tailored according to user interest (for example, a user may be interested in the educational background of the target individual, and could request that a summary contain only this kind of information). From the point of view of this thesis, the most important element of the Southern California system is the experimental work on feature representation, which is explored, using the corpus data created by Z HOU ET AL . (2004), in Chapter 10 on page 177 of this thesis. 3.2.6 Southampton System The A RTEQUAKT system (K IM ET AL ., 2002) uses a combination of linguistic resources, information extraction technology and knowledge engineering to produce user tailored biographies of artists. These biographies are not de- 78 C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK Figure 3.7: Sample Output from ARTEQUAKT System (K IM ET AL ., 2002) French Impressionist painter, born at Limoges. In 1854 he began work as a painter in a porcelain factory in Paris, gaining experience with the light, fresh colours that were to distinguish his Impressionist work and also learning the importance of good craftsmanship. His predilection towards light-hearted themes was also influenced by the great Rococo masters, whose work he studied in the Louvre. In 1862 he entered the studio of Gleyre and there formed a lasting friendship with Monet, Sisley, and Bazille. He painted with them in the Barbizon district and became a leading member of the group of Impressionists who met at the Cafe Guerbois. His relationship with Monet was particularly close at this time, and their paintings of the beauty spot called La Grenouillere done in 1869 (an example by Renoir is in the National museum, Stockholm) are regarded as the classic early statements of the Impressionist style. signed to rival human created attempts in terms of quality, but are designed to be coherent and useful. In contrast to other biography creating systems examined here, the creators of the A RTEQUAKT system do not locate their work in a summarisation framework but rather consider it a generation system: generating biographies from knowledge extracted from a collection of documents rather than producing a biographically orientated MDS summary of those documents. While A RTEQUAKT shares its information extraction driven approach for acquiring facts from documents with self-described MDS summarisation systems (for example M C K EOWN ET AL . (2002)), A RTEQUAKT’s reliance on a rich ontology (beyond ad hoc use of WordNet for hyponym, synonym, and hypenym information) is the basis for its claim to be a biographical generation system. From a user point of view, the A RTEQUAKT system employs a web interface. The system allows the user to tailor the biography produced along several parameters: artistic style, the painter’s family background, their influences, and the extent of the user’s interest in their paintings. Example output is given as a 150 word summary in Figure 3.7. Processing is divided into three modules: knowledge extraction, information management, and narrative generation. Each of these three modules is described below: 1. Knowledge extraction — This module strips factual information from documents returned by a search engine (with an artist’s name as the search engine query). Instead of using a cascade of Finite State Automata — the standard technique in information extraction approaches (C OWIE and L EHNERT, 1996) — A RTEQUAKT uses syntactic and semantic analysis to 79 C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK carry the extraction load. The system designers wished to develop a general purpose knowledge extraction method that did not require either human designed extraction rules, or the extensive corpus annotation required for the Machine Learning of extraction rules (C ARDIE, 1997). The designers did consider using newer, more adaptive Machine Learning techniques (C IRAVEGNA, 2001; W ILKS and C ATIZONE, 1999; YANGAR BER and G RISHMAN , 2000) but the need to identify a suitable corpus — clearly non-trivial in the case of unpredictable search engine results — meant that a deeper approach was adopted. The output from the knowledge extraction stage of processing is an XML formatted document representing extracted facts, the source sentences for the extracted facts, and the original document text and URL. 2. Information management — The second stage of processing involves adding the extracted facts to a knowledge base of extracted facts. It is important to emphasise that in contrast to other systems considered, A RTEQUAKT stores its knowledge and generates biographies from this stored knowledge. This knowledge is built in two stages. First, by populating a special purpose ontology from the extracted facts. Second, by running a set of error checking routines over the newly constructed entities, checking for obvious duplications. In the context of the system, “ontology” is used to refer to “conceptualisation of a domain into machine readable format” (K IM ET AL ., 2002). The ontology used is based on the Conceptual Reference Model (CRM) – an ontology for representing the world of cultural artefacts (their location, owners, and so on)16 – extended to cover the domain of artists and their lives. The output of the extraction phase of processing — an XML file with tags mapping to classes in the ontology — is parsed and used to populate the knowledge base. The populated knowledge base is then checked for duplicate information and merging opportunities. 3. Narrative generation — The final stage of the process is narrative generation, where a biographical story is generated using facts and relations stored in the knowledge base and user tailored templates are used to organise the output. The system designers use human written biographical templates. For example, a template geared toward producing biographies that focus especially on artistic influences would include an “influences” entry. The decision to use this method is — as reported by the system designers – grounded in narrative theory (B AL, 1985), where a narrative is based on a sequence of events (a story), which in turn is based on a collection of facts and events (a fabula). It could however, equally well derive from traditional reliance of Artificial Intelligence on 16 http://cidoc.ics.forth.gr/index.html Accessed 80 on 02-01-07 C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK script-like representations (S CHANK and A BELSON, 1977).17 The biographical templates used are written in XML, using the Fundamental Open Hypertext Model (FOHM) (M ILLARD ET AL ., 2000), a standard for representing hyper-media. The biographical templates are constructed from sub-structures, called “sequences”. These sequences are made up of queries (expressed in terms of the knowledge base categories) that must be inserted in the generated biographies, respecting the sequence order. Queries are put to the knowledge base and their responses (often in the shape of URL’s to complete source sentences) are inserted in the summary. While the A RTEQUAKT system is not fully implemented, some simple example output is available and the system seems successful in meeting its limited aims (although no evaluative work has been attempted). There are however, several problems with extending the A RTEQUAKT approach to different applications: 1. The limited domain of the system is potentially a problem; its reliance on a handmade ontology and biographical templates (specific to artists) has serious implications for portability. The system could be extended to manage other groups, but only after an intensive knowledge engineering effort. It is possible that biographical templates could be developed for politicians, business leaders, and so on, but it is not clear that the overall quality of biographies produced by such a system would justify the extra knowledge engineering cost. 2. The rigid separation of the knowledge base from the information retrieval stage could result in a failed user query — say for an obscure artist — when highly relevant information existed, but had not found its way into the knowledge base. 3. The use of fixed biographical templates, while providing a structured narrative (providing, that is, that information is present in the knowledge base), does not cater for important biographical information that the system designers had not anticipated. For instance, Da Vinci’s engineering and medical achievements rival his achievements as an artist, yet a fixed template — which lacks “engineering achievement” and “medical achievement” properties — runs the risk of excluding these biographically important facts. It is often the extent to which a person deviates from the template that makes them biographically interesting. 17 Biographies are suitable for this kind of representation. All lives follow a similar script if viewed from sufficient distance. 81 C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK 3.2.7 Other Relevant Work Considerable research effort has been expended on the construction of biographical summaries in recent years under the auspices of the United States government sponsored Document Understanding Conference18 (DUC) and the Text Retrieval Conference19 (TREC). Competing groups are given several tasks and scored on their system’s performance in the expectation that this kind of focused competition and standardised evaluation of results will quicken the pace of research. Task 5 in DUC-200420 required that the competing systems generated answers to the question “Who is X?” from a collection of documents, where X can refer to an individual or, additionally a group of people. An example system from DUC-2004 is B LAIR -G OLDENSOHN ET AL . (2004). The system, developed at Columbia University, relies heavily on definitional predicates to identify relevant sentences (for example, “X is a Y”). Clauses containing these definitional predicates are identified by two methods: 1. Text categorisation based on machine learning techniques. 2. Finite State Automata based on corpus derived patterns. Under the TREC umbrella, there are several different tracks, one of which is question answering (QA). One of the QA task is the answering of definitional questions. Definitional questions are of the form “What is X?” or “Who is X?” Examples of definitional questions are given by V OORHEES (2001) in an overview of the QA track of TREC-2003. They include “Who is Colin Powell?” and “What is mould?” An example system from TREC-2003 is G AIZAUSKAS ET AL . (2003), which uses a collection of fifty patterns (for example, “X is a Y”). This section has reviewed some recent biographically orientated systems, from the rich ontology driven A RTEQUAKT system (K IM ET AL ., 2002), through information retrieval based MDS approaches (C OWIE ET AL ., 2001) and linguistic and corpus based MDS approaches (S CHIFFMAN ET AL ., 2001; Z HOU ET AL ., 2004) to the pattern matching approach favoured by DUC and TREC conference entrants focussing on definitional questions. There are a number of different tasks subsumed under the biographical summarisation heading. The definitional questions of the DUC and TREC competitions are designed to “pick out” general purpose descriptions of (“X is a Y”), unlike, say the A RTEQUAKT system which uses an ontology to populate a database of particularly significant facts relevant to an artist’s life (K IM ET AL ., 2002). The A RTEQUAKT system is domain specific and an extensive knowledge engineering effort would be required to transport the system to another domain, where different factors are important (for example, a politician’s career is likely to require a radically different ontology from that of an artist). These 18 http://www-nlpir.nist.gov/projects/duc Accessed on 02-01-07. on 02-01-2007. 20 http://duc.nist.gov/duc2004/tasks.html Accessed on 02-01-2007. 19 http://trec.nist.gov/ Accessed 82 C HAPTER 3: R EVIEW OF R ECENT C OMPUTATIONAL W ORK differences in the aims of the different systems make direct comparison between them difficult. One major difference between the systems considered and the current work is that the focus of the current work is on the identification of biographical sentences rather than the production of biographical summaries as issues of selecting and ordering sentences for coherence are less relevant. 3.3 Conclusion This chapter has analysed recent work in automatic genre classification and described functioning systems that produce biographies of named individuals from multiple texts. The next chapter outlines the methodology and resources used in the work, and the remaining chapters describe the research conducted as part of this project. 83 C HAPTER 4 Methodology and Resources This chapter describes the overall research methodology employed and the resources used in the thesis. The first section describes the methodology used. The second and third sections describe the resources used (software and corpora, respectively). 4.1 Methodology This section is designed to provide a bridge between the background chapters (1, 2, 3 and 4) and the research chapters (5, 6, 7, 8, 9 and 10). Here we describe how the human study and experimental work on automatic genre classification support the main hypothesis that biographical writing can reliably be identified at the sentence level using automatic methods. This claim has two sub-hypotheses: 1. Humans can reliably identify biographical sentences without the contextual support provided by a discourse or document structure. 2. “Bag of words” style sentence representations augmented by syntactic features provides a more effective representation for biographical sentence recognition than bag of words representations alone. We address hypothesis 1 by determining the agreement level achieved when a group of human participants classify a set of sentences as biographical or non-biographical. If the agreement level is high, this would indicate that humans are able to distinguish between biographical and non-biographical sentences without the aid of a supporting discourse structure. Hypothesis 2 is approached by comparing the performance of syntactic and “bag-of-words” features on a gold standard corpus of biographical and non-biographical sentences using the 10 x 10 cross-validation technique described in Section 2.5.2 on page 47. 84 C HAPTER 4: M ETHODOLOGY AND R ESOURCES Note that it is possible for hypothesis 1 to be true, and hypothesis 2 false (and vice versa). Additionally, it is possible that both sub-hypotheses are false, while the main hypothesis is true. For example, it could be shown that people are not able to reliably identify biographical sentences and also that “bagof-words” style features perform better than syntactic features. The success of the main hypothesis is not dependent on the kind of sentence representation used (“bag-of-words” or syntactic), but rather on the existence of some sentence representation that provides good results in conjunction with a learning algorithm.. Hypothesis 1 is designed to indicate whether biographical sentence classification is a task that can be reliably performed by humans. It remains possible that while humans may struggle to perform the biographical sentence classification task, there exists an automatic method that performs the task reliably. Hypothesis 2 claims that a particular type of sentence representation — a syntactic representation — is likely to be especially useful for biographical sentence classification. This claim is associated with the contention that biographical sentence classification is a genre classification task, and that features which capture the non-propositional content of sentences will be especially useful and enhance classification accuracy. Syntactic features are contrasted with “bag-of-words” style features. “Bag-of-words” style features have traditionally been used with some success in topical text classification. However, in examining representations for automatic sentence classification, this research is not confined by hypothesis 2, but also explores other sentence representations that may be useful (for example, the use of function words and “keykeywords”). The main hypothesis (and two sub-hypotheses) provides a framework for the thesis, but other research questions are addressed within that framework (for example, the utility of the key-keywords methodology (see page 21) for identifying biographically relevant features). The first hypothesis is addressed in Chapters 5 and 6. Chapter 5 describes the creation of a biographical annotation scheme and corpus. Chapter 6 — a human study — establishes that human beings can reliably identify biographical sentences (according to the scheme identified in Chapter 5). This is an important first step, as it shows that it is possible that biographical sentences can be identified without supporting discourse or document structure. This section also provides gold standard data for subsequent automatic classification experiments. The second hypothesis is broadly addressed in Chapters 7, 8, 9 and 10. Chapter 7 tests several different learning algorithms using the gold standard data and a standard feature set, in order to identify the learning algorithm which provides best accuracy for the biographical classification task. Chapter 8 does not describe learning experiments, but rather outlines those feature sets used in Chapters 9 and 10. Chapter 9 compares the performance of different fea85 C HAPTER 4: M ETHODOLOGY AND R ESOURCES ture sets using the gold standard data, particularly focusing on hypothesis 2 (that is, whether syntactic features improve classification accuracy). Chapter 10 tests the biographically orientated feature set identified by Z HOU ET AL . (2004) using the gold standard data in order to discover whether the feature identification mechanism used by Z HOU ET AL . (2004) is exportable to other biographical data. The thesis concludes with a chapter detailing contributions made and possible directions for future research. 4.2 Software The human study (described in Chapter 6 on page 124) did not require extensive software resources. The main work was in producing the content and marking it up in HTML for online use. The relevant web page was uploaded to a publicly accessible web server with limited scripting capabilities from which participant responses were harvested. Several scripts were written in Perl to post-process the data into a form suitable for the R1 statistical programming environment. A script written in Perl (bio features)2 was developed as part of the project, and used to convert input sentences into various sentence representations (sentence vectors) in a format appropriate for the Weka machine learning environment.3 The program utilises several existing NLP tools, including standard implementations of sentence splitting algorithms, and a Perl part-of-speech tagging library based on the Penn Treebank tagging scheme. The Weka machine learning and evaluation environment was used extensively to evaluate the classification success of the various sentence representations (W ITTEN and F RANK, 2005). Weka provides a suite of machine learning algorithms in both GUI driven and command line (scriptable) environments. Weka also provides implementations of standard machine learning evaluation techniques. The system is written in Java, and is hence portable across all major platforms. Perl scripts were used to organise the computationally expensive multiple runs of the bio features script and Weka machine learning algorithms on a UNIX machine with enhanced RAM. Further scripts were written for data analysis purposes and to perform statistical tests. See Appendix G on page 273 for details of the Perl implementation of the corrected re-sampled -test. 1 http://www.r-project.org Accessed on 02-01-07. code is available by contacting [email protected]. 3 Weka input files have the suffix “.arff” and are colloquially referred to as “ARFF files”. 2 This 86 C HAPTER 4: M ETHODOLOGY AND R ESOURCES 4.3 Corpora In order to isolate the distinctive qualities of biographical text, it was necessary to identify two types of corpora: biographical corpora and multi-genre corpora. The biographical text used in the development of this work came from a variety of different sources. All the corpora used were published in English (British, American and New Zealand English) and are roughly contemporary (that is, all the corpora consist of modern English). The biographical corpora described in this section are: The Oxford Dictionary of National Biography, Chambers Biographical Dictionary, Who’s Who, The Dictionary of New Zealand National Biography, Wikipedia biographies and a biographical corpus developed at the University of Southern California. The multi-genre corpus used in the experiments were the B ROWN corpus and the S TOP corpus. The TREC newstext corpus was also used, and is described in this chapter. Descriptive statistics for all the biographical corpora are presented in Table 4.1 on page 96 4.3.1 Dictionary of National Biography The Oxford Dictionary of National Biography, of which there was a substantially re-written new edition published in 2004 (OUP, 2004), is designed as a historical reference of notable British people: The Oxford DNB aims to provide full, accurate, concise and readable articles on noteworthy people in all walks of life, which present current scholarship in a form accessible to all. No living person is included; the Dictionary’s articles are confined to people who died before 31 December 2000. It covers people who were born and lived in the British Isles, people from the British Isles who achieved recognition in other countries, people who lived in territories formerly connected to the British Isles at a time when they were in contact with British rule, and people born elsewhere who settled in the British Isles for significant periods or whose visits enabled them to leave a mark on British life. OUP (2003) The original Dictionary of National Biography was published in instalments of sixty three volumes between 1885 and 1900, under the editorship of Leslie Stephen (and Sidney Lee from 1891). The complete dictionary was reissued in 1908-09, and supplements were published periodically between 1911 and 1996, when work on the new dictionary began in earnest (FABER and H ARRI SON , 2002). An XML encoded version of the original Dictionary of National Biography — the old DNB — was provided by the Oxford University Press for this study. The old electronic DNB contains the 1908-09 edition text, and the supplementary 87 C HAPTER 4: M ETHODOLOGY AND R ESOURCES data produced up to 1996. The entries follow a clear set of editorial guidelines. Each entry begins with the subject’s name, birth and death dates, and a brief description of their occupation. The remaining part of the biography concentrates on the subject’s achievements and importance. The length of the article is a reflection of the importance attached to the subject and is thus an editorial decision (FABER and H ARRISON, 2002). There is considerable variation in the lengths of biographical entries in the old DNB. See Table 4.1 on page 96 for descriptive statistics.4 A short example entry from the old DNB concerning Charles Babbage is given below: Babbage, Charles 1792-1871, mathematician and scientific mechanician, was the son of Mr. Benjamin Babbage, of the banking firm of Praed, Mackworth, and Babbage, and was born near Teignmouth in Devonshire on 26 Dec. 1792. Being a sickly child he received a somewhat desultory education at private schools, first at Alphington near Exeter, and later at Enfield. He was, however, his own instructor in algebra, of which he was passionately fond, and, previous to his entry at Trinity College, Cambridge, in 1811, he had read Ditton’s Fluxions, Woodhouse’s Principles of Analytical Calculation, and other similar works. He thus found himself far in advance of his tutors’ mathematical attainments, and becoming with further study more and more impressed with the advantages of the Leibnitzian notation, he joined with Herschel, Peacock (afterwards Dean of Ely), and some others, to found in 1812 the Analytical Society for promoting (as Babbage humorously expressed it) “the principles of pure D-ism in opposition to the Dot-age of the university.” The translation, by the three friends conjointly (in pursuance of the same design), of Lacroix’s Elementary Treatise on the Differential and Integral Calculus (Cambridge, 1816), and their publication in 1820 of two volumes of Examples with their solutions, gave the first impulse to a mathematical revival in England, by the introduction of the refined analytical methods and the more perfect notation in use on the continent. Babbage graduated from Peterhouse in 1814 and took an M.A. degree in 1817. He did not compete for honours, believing Herschel sure of the first place, and not caring to come out second. In 1815 he became possessed of a house in London at No. 5 Devonshire Street, Portland Place, in which he resided until 1827. His scientific activity was henceforth untiring and conspicuous. In 1815-17 he contributed to the Philosophical Transactions three essays on the calculus of functions, which helped to found a new, and even yet little explored, branch of analysis. He was elected a fellow of the 4 Queen Victoria’s entry in the Dictionary of National Biography is by far the lengthiest at one hundred thousand words. 88 C HAPTER 4: M ETHODOLOGY AND R ESOURCES Royal Society in 1816. He took a prominent part in the foundation of the Astronomical Society in 1820, and acted as one of its secretaries until 1824, subsequently filling the offices, successively, of vice-president, foreign secretary, and member of council. . . OUP (2004) We can see from this (truncated) example, how the entry adheres to the prescribed pattern; name, birth and death dates, occupation, and family background, before elaborating on the achievements of the subject (in this case, his numerous important publications and memberships). In addition to the bulk of the biography, which is descriptive, the biography has an evaluative element (for example, the description of his early education as “desultory”). The biography also adds source materials and references at the end of the article. 4.3.2 Chambers Biographical Dictionary The Chambers Biographical Dictionary (C HAMBERS, 2004) is a single volume from a British publisher. The dictionary aims to give single paragraph descriptions of people of historical or contemporary importance from a British perspective. However, biographical subjects are not exclusively British but drawn from a wider, international pool. Examples here include, Jeff Bridges (US actor), Julian Barnes (British novelist) and Gu Ban (Chinese historian). The biographies are designed to provide a brief introduction rather than critical evaluation, and contain stereotypically biographically significant information (that is, birth date, death date, occupation, location of birth). For this study, Chambers provided an XML encoded electronic copy of a subset of the dictionary (those names beginning with “B”). The example below shows the form of a typical entry: Babbage, Charles 1791-1871 English mathematician Born in Teignmouth, Devon, and educated at Trinity and Peterhouse colleges, Cambridge, he spent most of his life attempting to build two calculating machines. The first, the difference engine, was designed to calculate tables of logarithms and similar functions by repeated addition performed by trains of gear wheels. A small prototype model described to the Astronomical Society in 1822 won the Society’s first gold medal, and Babbage received government funding to build a full-sized machine. However, by 1842 he had spent large amounts of money without any substantial result, and government support was withdrawn. Meanwhile he had conceived the plan for a much more ambitious machine, the analytical engine, which could be programmed by punched cards to perform many different computations. The cards were to store not only the numbers but also the sequence of operations to be performed, an idea too ambitious to be realized by the mechanical devices available at the time. The idea can now be seen to be the essential germ of today’s electronic 89 C HAPTER 4: M ETHODOLOGY AND R ESOURCES computer, with Babbage regarded as the pioneer of modern computers. He held the Lucasian chair of mathematics at Cambridge from 1828 to 1839. C HAMBERS (2004) 4.3.3 Who’s Who Who’s Who (B LACK, 2004) is a comprehensive, single volume collection of biographical sketches whose subjects are substantially connected to the United Kingdom. In contrast to the DNB, the biographical subjects in Who’s Who are all living (there is a companion volume — Who was Who — to cater for the deceased). The first volume was published in 1849, and has been reissued regularly (in recent times, every year) since that date. In contrast to the other biographical dictionaries considered here, Who’s Who is autobiographical (that is, the entries are written by the subjects). The biographical subject completes a form containing pertinent information, which is the primary source for the generation of the final biography. Certain groups of people are included by default. For example, Members of the British Parliament, High Court Judges, and certain subgroups of the British aristocracy. Other professional groups are — for example, sportspeople, artists, journalists, and so on — are selected for inclusion in the dictionary by a selection committee. The form of a Who’s Who entry is designed to provide information rather than evaluation; the format can seem rather schematic compared to the discursive style of the multi-volume DNB. An example entry is presented below: FRY, Stephen John, writer, actor, comedian. b. 24 Aug. 1957 of Alan John Fry and Marianne Eve (ne Newman). Education Uppingham Sch.; Queens Coll., Cambridge (MA). Career TV series: Blackadder, 1987-89; A Bit of Fry and Laurie, 1989-95; Jeeves in Jeeves and Wooster, 1990-92; Gormenghast, 2000; presenter, QI, 2003; Theatre: Forty Years On, Queen’s, 1984; The Common Pursuit, Phoenix, 1988; films: Peter’s Friends, 1992; I.Q., 1995; Wilde, 1997; Cold Comfort Farm, 1997; The Tichborne Claimant, 1998; Whatever Happened to Harold Smith?, 2000; Relative Values, 2000; Gosford Park, 2002; (dir) Bright Young Things, 2003. Columnist: The Listener, 1988-89; Daily Telegraph, 1990 Publications: Me and My Girl, 1984 (musical performed in West End and on Broadway); A Bit of Fry and Laurie: collected scripts, 1990; Moab is My Washpot (autobiog.), 1997; novels: The Liar, 1991; The Hippopotamus, 1994; Making History, 1996; The Stars and Tennis Balls, 2000 Recreations: smoking, drinking, swearing, pressing wild flowers. Address: c/o Hamilton, Ground Floor, 24 Hanway Street, W1P 9DD. Clubs: Savile, Oxford and Cambridge, Groucho, Chelsea Arts. B LACK (2004) 90 C HAPTER 4: M ETHODOLOGY AND R ESOURCES It can be seen from the example above that the biography consists of important dates, occupational information, career highlights (publications, television and film appearances), contact details, and a list of recreations. The presentation is obviously schematic, with the information set forth in list format under suitable headings. 4.3.4 Dictionary of New Zealand Biography The Dictionary of New Zealand Biography (U NIVERSITY OF A UCKLAND P RESS, 1998) is a collaborative enterprise between Auckland University Press and the New Zealand Department of Internal Affairs.5 There are two criteria for inclusion in the dictionary. First, the subject must have made a contribution to (or had an impact on) New Zealand’s development. Second, the subject must be dead. Rather like the Oxford DNB, the Dictionary of New Zealand Biography aims at a discursive, evaluative function. The three thousand entries have multiple paragraphs and are written in continuous prose. They are designed to supply the biographical basics, and also to provide context and evaluation. Below is an extract from an entry on one time New Zealand resident, Karl Popper: Karl Raimund Popper was born in Vienna, Austria, on 28 July 1902, the son of Simon Siegmund Carl Popper, a lawyer, and his wife, Jenny Schiff. He turned his inquisitive and enterprising mind to a variety of activities. He was apprenticed to a cabinet-maker, joined a youth organisation where he worked with delinquent adolescents, tramped in the Austrian mountains, taught himself mathematics and physics, and became active in political movements during the First World War as a socialist. Most significantly, he studied philosophy at the University of Vienna and was awarded a PhD in 1928. On 11 April 1930 he married Josefine Anna Henninger in Vienna; there were no children of the marriage. By that time he was an accomplished musician with a decided preference for classical music and had qualified as a schoolteacher. It was Popper’s practical and political interests that first directed him to the philosophy of science, because he realised that it was vital to be able to tell genuine knowledge from pseudo-knowledge and superstition. Having become acquainted with the dominant philosophical school in Vienna, known as the Vienna Circle, he concluded that its philosophy of science was deficient. It was based on the old Baconian inductionist theory that science involves making observations and generalising them into universal laws. This, Popper argued, explained neither how scientific thought actually proceeds, nor why we consider that its findings correspond to reality. He therefore formulated a counter-proposal, which he published in 07. 5 The dictionary can be accessed online at: http://www.dnzb.govt.nz Accessed on 01-02- 91 C HAPTER 4: M ETHODOLOGY AND R ESOURCES 1934 under the title Logik der Forschung its English translation, The Logic of Scientific Discovery, was not published until 1959. On Popper’s account, science proceeds by formulating hypotheses. The criterion of a scientific hypothesis is that it generates predictions that are capable of being falsified by reference to empirical data. These hypotheses can never be conclusively proved true, so that scientific knowledge is necessarily provisional. This idea revolutionised the understanding of the nature and value of scientific knowledge. Albert Einstein read Logik in manuscript and applauded it vigorously. Popper’s radical revision of induction at once brought the philosophy of science into line with actual practice and provided an unprecedentedly convincing account of how science succeeds in arriving at knowledge about nature. With the advance of Nazism in Austria and the growth of antiSemitism, Popper, who was of Jewish origin though not a practising Jew, decided to emigrate. In 1936 he learnt of an advertisement for a lectureship in philosophy at Canterbury University College, Christchurch, New Zealand. He applied and took up the position in early 1937. . . U NIVERSITY OF A UCKLAND P RESS (1998) The extract above shows the discursive nature of the New Zealand biographies. The style can be distinguished from the other dictionaries in that it is chronological in form, and does not conform to the inverted pyramid form characteristic of the Dictionary of National Biography. For instance, in the first sentence, although a birth date and location is provided, a death date is not. 4.3.5 Wikipedia Biographies A wiki is a user editable web application.6 The traditional model of information provision on the web involves a service provider serving content to a community. In the case of wikis however, the content is provided by the community itself, in the form of user editable web pages. Any user has the opportunity to edit information on a wiki, if that information is judged by the user to be incorrect or misleading, and in their turn, any corrections made may themselves be corrected by another user. Wikipedia is an attempt to create a new, free, user-editable encyclopedia using methods loosely based on open software development techniques. It was launched in 2001, and currently has 1,690,892 articles in English.7 Wikipedia classifies and organises biographies in a number of ways, including death year and alphabetically by family name, along with numerous more 6 http://en.wikipedia.org/wiki/Wiki Accessed on 02-01-07. of 01-03-07. 7 http://en.wikipedia.org/wiki/Wikipedia%3AAbout As 92 C HAPTER 4: M ETHODOLOGY AND R ESOURCES whimsical categories.8 For this work, the biographies of those who died in the first three years of the twenty-first century were used, yielding 959 biographies.9 In line with previous examples, an example biography for the Nineteenth Century scientist and mathematician, Charles Babbage is reproduced below: Charles Babbage (26 December 1791 – 18 October 1871) ) was an English mathematician, analytical philosopher, mechanical engineer and (proto-) computer scientist who originated the idea of a programmable computer. Parts of his uncompleted mechanisms are on display in the London Science Museum. In 1991, working from Babbage’s original plans, a difference engine was completed, and functioned perfectly. It was built to tolerances achievable in the 19th century, indicating that Babbage’s machine would have worked. Nine years later, the Science Museum completed the printer Babbage had designed for the difference engine; it featured astonishing complexity for a 19th century device. Charles Babbage was born in England, most likely at 44 Crosby Row, Walworth Road, London. A blue plaque on the junction of Larcom Street and Walworth Road commemorates the event. There was a discrepancy regarding the date of Babbage’s birth, which was published in The Times obituary as 26 December 1792. However, days later a nephew of Babbage wrote to say that Babbage was born precisely one year earlier, in 1791. The parish register of St. Mary’s Newington, London, shows that Babbage was baptised on 6 January 1792 . Babbage’s father, Benjamin Babbage, was a banking partner of the Praeds who owned the Bitton Estate in Teignmouth. His mother was Betsy Plumleigh Babbage. In 1808, the Babbage family moved into the old Rowdens house in East Teignmouth, and Benjamin Babbage became a warden of the nearby St. Michael’s Church. His father’s money allowed Charles to receive instruction from several schools and tutors during the course of his elementary education. Around age eight he was sent to a country school to recover from a life-threatening fever. His parents ordered that his “brain was not to be taxed too much” and Babbage felt that “this great idleness may have led to some of my childish reasonings.” He was sent to King Edward VI Grammar School in Totnes, South Devon, a thriving comprehensive school still extant today, but his health forced him back to private tutors for a time. He then joined a 30student academy under Reverend Stephen Freeman. The academy had a well-stocked library that prompted Babbage’s love of mathematics. He studied with two more private tutors after leaving the academy. Of the first, a clergyman near Cambridge, Babbage 8 Categories include “professional cyclists who died during a race”, “famous left handed people” and “people known as The Great”. 9 http://en.wikipedia.org/wiki/Lists of people Accessed on 02-01-07. 93 C HAPTER 4: M ETHODOLOGY AND R ESOURCES said, “I fear I did not derive from it all the advantages that I might have done.” The second was an Oxford tutor from whom Babbage learned enough of the Classics to be accepted to Cambridge. . . http://en.wikipedia.org Unlike the previously described biographical corpora, Wikipedia entries are not subject to a strict editorial policy, nevertheless, most adhere to the intuitively plausible “biographical pyramid” idea (see Figure 3.2 on page 57), with name, location of birth, and birth and death dates in the first sentences. They are also homogenous in length of the first, summary paragraph, indicating that while an externally imposed editorial policy is not in force, the expectations of the Wikipedia community compel writers towards writing “model” biographies. 4.3.6 University of Southern California Corpus The University of Southern California Corpus (USC Corpus)10 is a small biographical corpus consisting of 130 multi-paragraph biographies (the average number of words being 1339 per biography). Unlike the other corpora considered here, the USC Corpus focuses on only ten named biographical subjects (Curie, Edison, Einstein, Gandhi, Hitler, King, Mandela, Monroe, Picasso, and a group of people, the Beatles). The biographies are harvested from various online biographical websites and contain much repetitive, redundant information. Another distinctive feature of the corpus is that it has been tagged for biographically relevant clauses throughout. Details of the annotation scheme used are provided in Section 5.1.2 on page 103. An extract from one of the ten biographies of Martin Luther King included in the USC corpus is reproduced below: Martin Luther King, Jr., bio (January 15,1929-April 4, 1968) /bio bio was born /bio Michael Luther King, Jr., but later had his name changed to Martin. His grandfather began the family’s long tenure as pastors of the Ebenezer Baptist Church in Atlanta, serving from 1914 to 1931; his father has served from then until the present, and from 1960 until his death Martin Luther acted as copastor. edu Martin Luther attended segregated public schools in Georgia, graduating from high school at the age of fifteen /edu ; edu he received the B. A. degree in 1948 from Morehouse College /edu , a distinguished Negro institution of Atlanta from which both his father and grandfather had been graduated. After three years of theological study at Crozer Theological Seminary in Pennsylvania where he was elected president of a predominantly white senior class, edu he was awarded the B.D. in 1951 /edu . With 10 The USC corpus was kindly provided by Ling Zhou & Eduard Hovy at the Information Sciences Institute, University of Southern California. 94 C HAPTER 4: M ETHODOLOGY AND R ESOURCES Figure 4.1: Discrepancies in Annotation Styles in the USC (Curie Biographies). TAGGED B IOGRAPHICAL C LAUSE education Successfully passed /education the examinations in medicine TAGGED B IOGRAPHICAL W ORDS : N OTE THAT THE CORPUS IS REPORTED AS ANNOTATED AT THE CLAUSE LEVEL ONLY. education inventor /education personal sister /personal personal father /personal fame celebrity /fame personal married /personal a fellowship won at Crozer, edu he enrolled in graduate studies at Boston University, completing his residence for the doctorate in 1953 and receiving the degree in 1955 /edu In Boston personal he met and married Coretta Scott /personal , a young woman of uncommon intellectual and artistic attainments. personal Two sons and two daughters bio were born /bio into the family /personal . In 1954, Martin Luther work King accepted the pastorale of the Dexter Avenue Baptist Church in Montgomery, Alabama /work . USC corpus (Z HOU ET AL ., 2004) The mark-up scheme used in the USC corpus is not applied consistently throughout the corpus. Sometimes clauses are identified, and sometimes biographical words. Some examples of the discrepancies in annotation are shown in Figure 4.1. 4.3.7 The TREC News Corpus A one hundred megabyte extract from the 1998 APW (Associated Press Wire) TREC (Text Retrieval Conference) corpus was used in this work.11 The corpus was used as a source of non-biographical data. Traditionally, local news providers have subscribed to a news wire service, who provide a skeletal text for breaking news stories; the local news provider then augments this story according to the house style. The news stories, although they frequently mention persons, are not biographical in purpose. That is, biographical information is sometimes presented in the context of a wider news story. In order to assess the 11 http://trec.nist.gov/ Accessed on 02-01-07. 95 C HAPTER 4: M ETHODOLOGY AND R ESOURCES Table 4.1: Descriptive Statistics for Biographical Corpora. No. Entries Total No. Words Mean Words Per Entry Stan. Dev. Words Per Entry No. Chars Per Word Mean Sent Per Entry Stan. Dev. Sent Per Entry Chambers 921 96707 105.39 51.32 5.18 5.58 2.19 NZ 2977 2878412 1002.01 480.44 6.01 32.12 16.90 DNB 36466 33853718 928.42 1364.20 6.55 34.89 50.90 WW 36687 5280196 143.97 98.95 5.35 - Wiki 959 152481 159.00 215.34 4.57 17.58 24.31 proportion of biographical sentences in the corpus, a sample of one thousand sentences was taken. These sentences were then manually classified according to the criteria specified in Section 5.2.2 on page 112. It was found that 11% of sentences were classified as biographical, and 89% were classified as non-biographical. Of the 11% of sentences classified as biographical, the vast majority were examples of apposition. Topic coverage is wide; from sports results to the activities of US politicians and international political entities (for example, the World Health Organisation). An example entry is reproduced below: LONDON (AP) The European Union will hold a two-day tradeboosting meeting with 12 Mediterranean countries in Parlermo, Italy, this week, Foreign Secretary Robin Cook said Monday. Britain, which currently hold the presidency of the EU, will chair the meeting with foreign ministers from Algeria, Cyprus, Egypt, Israel, Jordan, Lebanon, Malta, Morocco, Palestinian Authority, Syria, Tunisia and Turkey. The meeting, starting Wednesday, is part of the socalled EuroMed process started at a conference in Barcelona, Spain, in 1995 to boost political, economic and cultural links between the EU and neighbors on the southern and eastern Mediterranean rim. http://trec.nist.gov/ 4.3.8 The B ROWN Corpus The B ROWN corpus12 was developed by Francis and Kucera at Brown University and made available digitally in 1964. It is thus one of the earliest electronic corpora, and remains widely used in the twenty-first century. The cor12 A Standard Corpus of Present-Day Edited American English, for use with Digital Computers. The B ROWN corpus is supplied with the Python Natural Language toolkit (lite) http://nltk.sourceforge.net Accessed on 02-01-07.. An online manual for the B ROWN corpus is available at http://icame.uib.no/brown/bcm.html Accessed on 02-0107. 96 USC 130 174141 1339.55 1185.76 6.09 79.40 63.77 C HAPTER 4: M ETHODOLOGY AND R ESOURCES pus consists of around one million words from 500 articles (approximately 2000 words per article). The corpus is balanced in that it consists of texts from various sources in an attempt at providing a snapshot of written American English in the 1960s (see Figure 4.2 on the following page for the text types presented in the corpus). The version of the corpus used in this work was originally part-ofspeech tagged, but these tags were removed for the purposes of the current research. Below is a brief extract from an article classified under “Popular Lore” in the B ROWN corpus: Yet, in spite of this, intensive study of the taped interviews by teams of psychotherapists and linguists laid bare the surprising fact that, in the first five minutes of an initial interview, the patient often reveals as many as a dozen times just what’s wrong with him; to spot these giveaways the therapist must know either intuitively or scientifically how to listen. Naturally, the patient does not say, “I hate my father”, or “Sibling rivalry is what bugs me” . What he does do is give himself away by communicating information over and above the words involved. Some of the classic indicators, as described by Drs. Pittenger, Hockett, and Danehy in The First Five Minutes, are these: ambiguity of pronouns, stammering or repetition of I, you, he, she, et cetera signal ambiguity or uncertainty. The B ROWN Corpus 4.3.9 The STOP Corpus The STOP (Lancaster Speech, Thought and Writing Presentation) Corpus (S EMINO and S HORT, 2004)13 was developed at Lancaster University in the 1990s. The corpus consists of around 250,000 words, made up of 120 documents, each around 2000 words long. The corpus is classified approximately equally into three narrative genres (fiction, newspaper news, and (auto)biography) (see Figure 4.4 on page 100, note that the leaves of the tree depicted represent example texts used in the corpus) and is heavily annotated according to the theory of speech and thought presentation described in L EECH and S HORT (1981) (see Section 2.2.1 on page 23 for more on the approach to stylistics presented by L EECH and S HORT (1981)). The annotation used in the corpus was not utilized in the current work — all annotation was stripped before processing — and will not be discussed here. However, a truncated example entry (with markup) is provided in Figure 4.3 on page 99. 13 The corpus manual is available at: http://www.comps.lancs.ac.uk/computing/users/eiamjw/stop/handbook/ba.html (Accessed on 02-01-07) and the corpus itself is available from the Oxford Text Archive. 97 C HAPTER 4: M ETHODOLOGY AND R ESOURCES Figure 4.2: B ROWN Corpus Hierarchy of Text Types REPORTAGE 44 texts EDITORIAL 27 texts RELIGION 17 texts 374 texts INFORMATIVE PROSE POPULAR LORE SKILLS & HOBBIES 30 texts LEARNED 80 texts BELLES LETTRES REVIEW GENERAL FICTION 126 texts IMAGINATIVE PROSE MYSTERY SCIENCE FICTION 98 36 texts MISC 500 texts BROWN CORPUS 48 texts 75 texts 17 texts 29 texts 24 texts 6 texts ADVENTURE 29 texts ROMANCE 29 texts HUMOUR 9 texts C HAPTER 4: M ETHODOLOGY AND R ESOURCES Figure 4.3: Truncated Example Entry from the STOP Corpus (S EMINO and S HORT, 2004): Michael Caine’s Autobiography header Author: Michael Caine Title: What’s it all about? Date: 1992 Publisher: Century B Michael Caine (first person narrator) C Rene Clement G Harry Salzman J “the powers that be”(in filmmaking) K Johnny Morris L Spanish policeman M Brigitte Bardot O Eric Sykes U Sean Connery X unknown Y “ the British” [troops in the film ’Play Dirty’] /header body head Bardot tries it on /head pb n=3D242 sptag cat=3DN next=3DNRSAP whonext=3DX s=3D1 w=3D19 I have been in over seventy-three films in thirty years and by the time you read this it will probably be seventy-six. sptag cat=3DNRSAP who=3DX next=3DN s=3D1 w=3D15 People often criticise me for not being discriminating enough and even for working so hard. sptag cat=3DN next=3DNI whonext=3DB s=3D17+0.82 w=3D454 Why bother? As far as discrimination is concerned I have a definite standard by which I choose films: I choose the best one available at the time I need one. Of course this has often led me down dubious artistic paths, but even they are not without their advantages. It is much more difficult to act well in a bad film with a bad director than in any other type of movie and it gives you great experience in taking care of yourself. It also means that when a good script does turn up you’re ready for it. It’s not unlike athletes in training who will practise running on sand so they find it easy to run on a solid track in competition. Plus of course there’s the money. You get paid the same for a bad film as you do for a good one- because no one knows for sure if the bad film is going to be bad or the good film is going to be good until the premiere. You can wind up, as I do when a good role comes along, absolutely prepared, having worked right up to date, or you can sit there waiting for it for five years, scared /body 99 C HAPTER 4: M ETHODOLOGY AND R ESOURCES Figure 4.4: Hierarchy of Texts Included in the STOP Corpus (S EMINO and S HORT, 2004) STOP CORPUS NEWS FICTION SERIOUS (AUTO)BIOGRAPHY SERIOUS SERIOUS C.S.Lewis(biography) Laurie Lee (autobiography) Guardian Independent Greene (Brighton Rock) Woolf (Night & Day) POPULAR POPULAR Lewis (Get Carter) Peters (The Holy Thief) Mirror Star 100 POPULAR Michael Caine (autobiography) Doris Stokes (autobiography) C HAPTER 5 Developing a Biographical Annotation Scheme In order to produce a set of gold standard biographical sentences for machine learning experiments, an annotation scheme specifying a procedure for identifying biographical sentences was developed, and a small corpus then created based on this annotation scheme. This chapter describes the development of a biographical annotation scheme and corpus. First, three existing annotation schemes for tagging biographical texts are outlined. Second, a new scheme is described, informed by existing schemes, but aimed at ease of use in annotating biographical texts. This scheme is then tested against biographical data, in order to test whether the scheme can be used to comprehensively annotate short biographical texts. Finally, a small biographical corpus, based on the annotation scheme is described. 5.1 Existing Annotation Schemes There are numerous annotation schemes in existence that have some relevance to describing significant life events for individuals. A review of specialist annotation schemes relevant to biography has been produced by the Text Encoding Initiative1 , these include schemes designed to represent genealogical data, inter-family relationships and archaeological artifacts. This document reviews in detail annotation schemes specified by the Text Encoding Initiative, the University of Southern California (Z HOU ET AL ., 2004) and the scheme used as a guideline to contributors to the Dictionary of National Biography (OUP, 2003). 1 Report on XML Markup of Biographical and Prosopographical data, http:://www.tei-c.org. Accessed on 01-08-06. Prosopography is a research method in history which examines the relationships between historical figures in order to identify common experiences (among other things). 101 C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME 5.1.1 Text Encoding Initiative Scheme The Text Encoding Initiative (TEI) 2 publishes SGML and XML standards for the description of textual data for the humanities and allied areas. As part of their standard, the TEI publishes a special purpose module for “the encoding of proper names and other phrases descriptive of persons, places or organisations, and also of dates and times.”3 Although the TEI standard tag set allows for the identification of proper names, it does not allow for the tagging of the strings constituent parts (for example, forename, family name, and so on). The TEI Names and Dates module does however allow for this more detailed level of analysis.4 The most directly relevant specifically biographical part of the Names and Dates TEI module, is section 20.4 — Biographical and Prosopographical Data. The authors of the scheme envisage three possible usage situations: 1. The conversion of existing biographical records (for example, the Dictionary of National Biography). 2. The creation of structured biographical data from a document collection or corpus. 3. The creation of biographical (or curriculum vitae like) data structures in business contexts (for example, human resources). The scheme is built around three “basic principles”: 1. Personal characteristics or traits are the qualities of an individual not under that individual’s control. These include sex, ethnicity, eye colour. 2. Personal states are (among others) marital state, occupation, and place of residence. These states are temporally extended, normally having a clear beginning and a clear end (for example, marriage, living in a certain location and so on) and normally reflect the choice of the individual. 3. Events are changes in personal states associated with a specific date (or narrow range of dates). The TEI scheme divides its biographical tags into three groups, reflecting the division between personal characteristics, personal states and events. These three tag groups are described in detail below: Personal Characteristics – – faith : refers to an individual’s religious beliefs. langKnowledge : describes a person’s language knowledge (for example, languages spoken). 2 For the many activities of the Text Encoding Initiative Consortium see http://www.tei-c.org. Accessed on: 01-08-06 3 The full guidelines for the TEI annotation scheme are available at www.tei-c.org. Accessed 01-08-06. 4 Section 20.1, Personal Names of the TEI guidelines describes this facility. 102 C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME – langKnown : describes the person’s knowledge of a given language of interest. – nationality : describes nationality (or previous nationality). – sex describes the sex of the person. – socecStatus : socio-economic status describes a person’s social or economic status. – persTrait : This is a general tag that describes any personality trait of interest (for example, “She was known for her generosity”). Personal States – persName : contains the persons name or part thereof (including titles, honourifics, and so on). – relation : describes relationships (family, professional, social). – occupation : describes a person’s job, occupation or career. – residence : details a person’s place of residence (or past place of residence). – affiliation : describes a person’s relationship with some organisation. – – education : describes a person’s educational experiences. floruit : describes a person’s “flourishing” period (that is, the period in their life that they were productive). Personal Events – birth : details information about a person’s birth (for example, location, date) – death : details information about a person’s death (for example, location, date). – persEvent : is a general tag that describes any event (excluding birth and death) of significant or importance in the life of that person. 5.1.2 University of Southern California Scheme A small annotated biographical corpus already exists (Z HOU ET AL ., 2004) at the University of Southern California and has been used in this work (see page 94 for a description of this corpus). The annotation scheme and corpus are unsuitable for use as gold standard data however, as: 103 C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME 1. The corpus is created entirely from explicitly biographical texts (that is, biographical articles from the web) rather than general texts that contain biographical information. Texts are harvested from a single type of source. There may be differences between web based and published biographies. For example, published biographies may adopt a more formal style. 2. The annotation scheme developed by Z HOU ET AL . (2004) is under specified and hence applied inconsistently to the corpus. Z HOU ET AL . (2004) uses nine factors, identified from biographical texts. Broadbrush categories are described, but the fine points of how categorisation decisions are made in difficult cases are not supplied. It can be noted that annotation styles differ considerably within the Zhou (2004) corpus. For example, biographical clauses are sometimes tagged (this is the stated aim of the corpus) and sometimes, biographical words are tagged. For example: work King was ordained in 1947 and became (1954) minister of a Baptist church in Montgomery, Alabama /work Marilyn work appeared /work in the work production /work of George Cukov’s Let’s Make Love. See Figure 4.1 on page 95 for more examples of inconsistent tagging in the USC corpus. Z HOU ET AL . (2004) used used nine annotation categories ( bio , fame , personality , personal , social , edu , nation , scandal and work ). See page 94 for more details of these categories. XML style tags were used for each of the nine categories (although the documents were not validated using XML technology) (W YNNE, 2004). It is assumed that if a sentence or clause is not tagged, then it is non-biographical. The nine categories focus on essential facts, suitable for inclusion in a short summary about a person’s life. Further, biographical facts may be embedded in a document that is primarily concerned with another individual (for example, there may well be biographical sentences referring to Tony Blair in an article about George Bush). Additionally, biographical facts may be included as incidental information in general texts (for example, a general news story may include biographical information about “British Deputy Prime Minister, John Prescott”). The nine categories are detailed below, along with examples taken from the USC corpus. The annotation aims to pick out biographical information. The subject of the biographical information — who it is about — is not relevant. BIO Information on birth and death. Clause may also contain information about location: – “? was born in Warsaw on March 14th 1879” 104 C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME – “was born on Jan 16th 1879” – “Thomas Alva Edison was born to Sam and Nancy on Feb 11th 1947” – “He died in Princeton” – “Einstein died in Princeton on April 18th 1955” – “Marie Curie died of leukemia in 1934” – “Marie Curie died at the age of 67” – “Hitler’s mother died in 1907” FAME : What a subject is famous for. This kind of information is broadly positive (for example, awards, honours, achievements). More negative notable events (notoriety) come under the scandal heading. – “was awarded the 1964 Nobel Peace Prize for his efforts” – “Hitler received the Iron Cross, 2nd class” – “King was the youngest man to receive the Nobel Peace Prize” – “Ghandi is that rare great man held in universal esteem” – “Ghandi became the international symbol of India” – “He asked the whole nation to strike for one day.” CHARACTER : Attitudes, qualities, character traits, political or religious attitudes. – “Edison was conservative” – “Einstein, though not religious, was a believer” – “She was always exceedingly modest about her achievements” – “His mental abilities and powers of concentration were extraordinary” – “The young Hitler was a resentful, discontented child” PERSONAL : Information concerning relationships with intimate partners, parents, siblings, children, friends. Also, non-fatal illnesses. – “She was also her mother’s faithful companion” – “She had recently got married” – “He had six children, three by each wife” – “They were married in 1953 and would have four children” – “His parents rejected her because of her family’s impoverished financial situation” 105 C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME SOCIAL : Introduction to friends, partners, collaborators and colleagues. Changes in location and social milieu. – “During 1923, he visited Palestine.” – “Edison met Eadweard Muybridge [sic]5 at West Orange” EDUCATIONAL : Institutions attended, dates, evaluative judgements on time in education, educational choices. Types of education. – “Marie attended science classes” – “In 1896 he entered the Swiss Federal Polytechnic School” – “He majored in sociology and in his junior year decided to enter the ministry” NATIONALITY : References to a person’s or persons’ nationality. – “Einstein renounced German citizenship” – “He became a citizen of the United States” – “He was unkind to his first wife, Serbian physicist, Mileva Marick” SCANDAL : Understood as reasons for fame that are negative: – “Marie’s critics have charged that she neglected her children while younger.” – “Bobby Kennedy was also reported to have had an affair with Marilyn.” – “Paul Langevin challenged the editor of the newspaper to a duel with pistols” WORK : This includes references to position, job titles (including apposition) – “In addition to teaching, Curie also began to spend time in the laboratory.” – “During 1958, he published his first book.” – “Edison’s company produced over 1700 movies.” – “British Chancellor, Gordon Brown.” 5 Eadweard Muybridge (1830-1904) was a pioneer photographer who was born in England, but spent most of his adult life in the United States. A recent biography is entitled, The Man Who Stopped Time: The Illuminating Story of Eadweard Muybridge: Pioneer Photographer, Father Of The Motion Picture, Murderer, (C LEGG , 2007). 106 C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME 5.1.3 Dictionary of National Biography Scheme The Dictionary of National Biography (DNB) biographical scheme — which is not strictly speaking an annotation scheme, but rather a set of guidelines for biography writers (OUP, 2003) — stipulates that biographies should contain “standard factual components” when these are available. These “standard factual components” fall into four categories: personal data, family data, career and sources of information.6 Each category contains a number of “standard factual components”, some of which are obligatory (required) and some optional: Personal Data Required: – Name: full names, alternative names, nicknames, short forms. – Full dates of birth (or as second best, baptism), death and burial. – Titles: aristocratic titles, knighthoods, baronetcies, high ecclesiastical titles, and so on. – Places of birth (or, as second best, baptism), death and burial: addresses should be given if possible, and places identified by country, modern place name, or other means. – Places of settled residence: addresses should be given if possible and places identified by county, modern place name or other means if necessary. – Cause of death: disease, condition or other cause; where possible a contemporary report should be supplied, with subsequent interpretation, if any. Optional: – Physical appearance. – Character traits. Family Data Required: – Father: full names, alternative names, titles, vital dates (years only), occupation. – Mother: maiden name, alternative names, titles, vital dates (years only), occupation (when other than “wifely”). – Subject’s spouse(s) or partner(s) other than spouse (common-law spouse, mistress, established lover): full names, for women maiden 6 The “sources of information” section has been ignored as it deals primarily with bibliographical data and archived material. 107 C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME name and former name if previously married, titles, vital dates (years only), occupation, date of marriage or start of the relationship, date of its dissolution. Optional: – Subject’s place in the family: number of sisters and brothers, seniority in relation to the subject. – Children: number, name(s) of parent(s) where the subject had more than one spouse or partner, more information if relevant. Career Required: – Religious affiliation(s): faith and sect, degree of adherence, evidence of lack of religious affiliation. – Geographical/ethnic interest: countries, regions, and cultures with which the subject was associated, and which had an impact on his/her life and career. – Place(s) of education: school, college, university, Inn of Court, apprenticeship, and so on, with dates of attendance; degrees or other awards and qualifications with dates. Optional: – Occupation(s). – Offices and ranks held (with dates): precise dates (day, month, year) of appointment to major offices should be given. – Honours conferred (with dates): the number listed should be determined by importance relative to other information in the text. – Works by the subject: major works, with summary of minor works. – Historiographical context: comment on significance and changing histiographical reputation (depending on the importance of the subject). The opening of an entry follows a rigidly defined format, whereas other factual components (obligatory and non-obligatory) can be integrated in the prose structure as required. The New Dictionary of National Biography: Notes for Contributors Handbook (OUP, 2003), uses the example of Gladstone’s biography: Gladstone, William Ewart (1808-1898), statesman and author, was born on 29 Dec 1809 at 62 Rodney Street, Liverpool, the fifth of six children of John Gladstone (1764-1861), merchant and MP, and 108 C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME Figure 5.1: Dictionary of National Biography Opening Schema. subject Familyname, subject forename, (year of birth - year of death), job roles/s, date of birth year of birth, location of birth, birth order father name (year of birth - year-of-death), father job role/s, mother name (year of birth - year of death), mothers father name, mothers fathers location, mothers name. his second wife, Anne (1773-1835), daughter of Provost Andrew Robertson of Dingwall and his wife, Annie. The rigidly defined format is given schematically in Figure 5.1. 5.2 Synthesis Annotation Scheme This section describes the issues and considerations involved in developing a biographical annotation scheme for the purposes of this work. After describing the scheme, and how it differs from the existing biographical schemes described in Sections 5.1.1, 5.1.2 and 5.1.3, the new scheme will be tested on short biographies in order to assess how well it accounts for biographical data. 5.2.1 Developing a New Biographical Scheme The new scheme must be simple enough for people to annotate quickly, confidently and effectively without a prolonged training period (that is, it must not be too “deep” in structure). The Dictionary of National Biography’s three top level categories subsume six of the University of Southern California’s schemes categories (that is, bio , nation , character , work , education and personal ), but do not account well for the three remaining categories in the University of Southern California scheme ( fame , social and scandal ). For instance, it is conceivable that a person could be famous due to his or her family background (for example, someone related to the British Queen), for their career (for example, the British Prime Minister) and for some personal fact (for example, an aristocratic title, which would be counted as a personal fact under the Dictionary of National Biography scheme). The relationship between the Dictionary of National Biography categories and the University of Southern California categories is depicted in Figure 5.2 on the following page. 109 C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME Figure 5.2: Relationship Between the Dictionary of National Biography and University of Southern California Biographical Schemes. DNB Personal Career Family <fame> USC <bio> <personality> <nation> <education> <work> <social> <personal> <scandal> The TEI scheme covers similar ground to the University of Southern California scheme, but is more explicitly detailed. It also includes the categories persTrait and persEvent . These are general purpose tags that can be used to describe any character trait or event in a person’s life that seems significant to the annotator. This catch-all approach is less appropriate in a situation where we are assessing inter-annotator agreement and wish to produce a set of clear guidelines for making annotation decisions. Elements of all three schemes were used to construct a biographical annotation scheme designed for both ease and consistency of annotation (see Figure 5.3 on the next page for a comparison between the new scheme and the USC and TEI schemes7 ). Some differences between the USC scheme and the new scheme are summarised below: In the new scheme, the (USC) nation tag is discarded, and nationality information is contained in the key tag (Key Life Facts). In the new scheme, cause of death is included in the key tag. In the new scheme, place of settled residence is included in the key tag. 7 The Dictionary of National Biography scheme is a set of authorial guidelines rather than an annotation scheme proper, and therefore is not included in Figure 5.3. 110 C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME The USC tag social is dropped, as this kind of information is included in relationships in the new scheme. The USC tag scandal has been dropped from the new scheme, as scandal can be understood as “negative” fame (that is, infamy, notoriety). Some differences between the Text Encoding Initiative scheme and the new scheme are given below: Several TEI biographical tags relating to key life facts (that is, TEI tags nation , persName , residence , birth and death ) are subsumed in the new key category. Some tags in the TEI scheme — like flourit , the period in which a person “flourishes” — have not been retained in the new scheme. The TEO tags occupation and the (new scheme) tag work . affiliation are subsumed in Figure 5.3: Relationship Between New Synthesis Scheme, Text Encoding Initiative Scheme, and University of Southern California Scheme. USC <bio> New Scheme <nation> <key> <persName> <nation> <fame> <fame> <character> <personality> <death> <persEvent> <relationships> <persTrait> <social> <edu> <residence> <birth> <scandal> <personal> TEI <education> <relation> <education> <work> <work> <occupation> <affiliation> 111 C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME 5.2.2 A Synthesis Biographical Annotation Scheme The synthesis annotation scheme, unlike the USC annotation scheme discussed above, is designed to identify biographical text at the sentence level. This strategy has the advantage that, unlike clauses, sentences can be straightforwardly identified both by people and automatically. A disadvantage of this sentence level approach however is that biographical information may form only part of a multi-clause sentence. For instance the sentence “ The Economist Intelligence Unit compiled the index on behalf of the Australian IT entrepreneur and philanthropist, Steve Killelea, who said he hoped it would encourage nations to address the issue of peace” 8 in addition to the highlighted biographical appositive phrase, contains non-biographical information about the Economist Intelligence Unit and “the issue of peace”. According to the synthesis biographical scheme, the entire sentence would be tagged as biographical ( work ). The six tag biographical scheme is presented below: key : Key information about a person’s life course: – Information about date of birth, or date of death, or age at death. – Names and alternate names (for example, nicknames). – Place of birth: “Orr was born in Ann Arbor, Michigan but was raised in Evansville, Indiana”. – Place of death: “He died of a heart attack while holidaying in the resort town of Sochi on the Black Sea coast”. – Nationality: “He became a naturalized citizen of the United States in 1941”. – Cause of death: “He died of a heart attack in Bandra, Mumbai”. – Longstanding illnesses or medical conditions: “He stepped down from the position on grounds of poor health in February 2004”. – Place of residence: “Sontag lived in Sarajevo for many months of the Sarajevo siege”. – Physical appearance: “With his movie star good looks he was a crowd favourite”. – Major threats to health and wellbeing (for example, assassination attempts, car crashes). fame : What a person is famous for. This kind of information can be broadly positive (for example, rewards, prizes, honours, and so on) or negative (for example, scandal, jail terms, and so on). Examples of fame tags include: 8 http://www.guardian.co.uk Accessed on 01-05-07. 112 C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME – “His study of Dalton won him the Whitbread prize” – “In 1976 heroin landed him in Los Angeles County Jail, where he spent two months for possession of narcotics” character : Attitudes, qualities, character traits and political or religious views. For example: – “He was raised Catholic, the faith of his mother” – “Jones is recalled as a gentle and unassuming man” relationships : Information concerning relationships with intimate partners and sexual orientation. Relationship with parents, siblings, children and friends. – “Her mother died when she was eleven” – “Nine people testified against him at his trial, including another wife he tried to set on fire” education : Institutions attended, dates, educational choices, qualifications awarded (with dates if available). General comments on educational experiences. For example: – “Corman studied for his master’s degree at the University of Michigan, but dropped out when two credits short of completion” work : References to positions, job titles, affiliations, (for example, employers), lists of publications, films or other work orientated achievements. General areas of interest (for example, industries, sectors, geographical regions). – “He returned to England in 1967 to work for the offshore pirate radio station Wonderful Radio, London” 5.2.3 Assessing the Synthesis Annotation Scheme One obvious method of exploring the utility of this scheme is to assess whether it successfully accounts for “real world” biographical data available in short, information-packed biographical summaries (for instance the short biographical entries found in Wikipedia biographies). That is, if all (or most) sentences in short biographies can be tagged using the new scheme, then on the face of it, this seems to indicate that the scheme is worth developing, testing more rigorously, and using as a standard for the development of gold standard data. In other words, if the new annotation scheme covers or accounts for the sentences in short biographical texts, then that is a first step to showing that sentences tagged using the scheme are biographical. As a first step towards assessing the scheme, four self contained biographies were obtained from several different sources. Two of the biographies were 113 C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME multiple paragraph texts (Phillip Larkin and Alan Turing, both from the Dictionary of National Biography) and two were single paragraph biographies (Paul Foot and Ambrose Bierce, from Wikipedia and Chambers Dictionary of Biography respectively). A truncated example biography of Alan Turing (annotated using the new scheme) is shown in Figure 5.4 on the following page and all four marked up biographies are reproduced in Appendix F on page 265. Note that only two sentences out of twenty-three are unaccounted for using the new six tag annotation scheme for the Alan Turing Text (these are shown in Examples 5.1 and 5.2). (5.1) “He tackled the problems arising out of the use of this machine with a combination of powerful mathematical analysis and intuitive short cuts which showed him at heart more of an applied than a pure mathematician” (5.2) “He suggested that machines can learn and may eventually ‘compete with men in all purely intellectual fields’ ” Table 5.1 on page 116 summarises the data from the initial analyses of annotation scheme coverage on different data sources. Note that for both the single paragraph biographies — Bierce and Foot — total coverage was achieved (that is, all the sentences could be tagged using the six tags from the new scheme). For the multi-paragraph Dictionary of National Biography biographies — Larkin and Turing — some sentences were not accounted for by the scheme. In the case of Larkin 30.8% of sentences in the entry could not be accounted for. The figure for the Turing entry was 8.7%. Note that the Larkin biography was by far the longest biography considered (almost twice the length of the Turing entry). On the basis of this data, the scheme accounts less well for longer biographical essays than for short, punchy biographies. As outlined in Chapter 2, short biographies contain more focused biographical text, centering on key facts about an individual, rather than the discursive patterns obvious in essay or book length biographical texts. As a second step towards annotating the scheme — after having achieved indicative results that the annotation scheme accounts for short biographies — the scheme was tested on four Wikipedia biographies: Jack Anderson, Kerry Packer, Richard Pryor and Stanley Williams.9 Table 5.2 presents the results of this annotation exercise. It can be seen that for three of the four biographies analysed, coverage was total. One sentence in one of the entries was unaccounted for (see Example 5.3) as it referred to a posthumous event. For an example of a Wikipedia biography annotated according to the new scheme, see Figure 5.5. Note that all four annotated Wikipedia biographies are reproduced in Appendix F. (5.3) “A few months after his death, the FBI attempted to gain access to his files as part of the AIPAC case on the grounds that the information could 9 All biographical subjects are classified as December 2005 deaths in the Wikipedia categorisation system: www.wikipedia.org. Accessed 01/08/06. 114 C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME Figure 5.4: Entry for Alan Turing in the Dictionary of National Biography Annotated Using New Six Way Scheme. relationships work key Turing , Alan Mathison 1912 - 1954 , mathematician, was born in London 23 June 1912, the younger son of Julius Mathison Turing, of the Indian Civil Service, and his wife, Ethel Sara, daughter of Edward Waller Stoney, chief engineer of the Madras and Southern Mahratta Railway. /key /work /relationships relationships G. J. and G. G. Stoney were collateral relations. /relationships character education He was educated at Sherborne School where he was able to fit in despite his independent unconventionality and was recognized as a boy of marked ability and character. /education /character education He went as a mathematical scholar to King’s College, Cambridge , where he obtained a second class in part i and education He a first in part ii of the mathematical tripos (1932-4). /education was elected into a fellowship in 1935 with a thesis “On the Gaussian Error Function” which in 1936 obtained for him a Smith’s prize. /education fame In the following year there appeared his best-known contribution to mathematics, a paper for the London Mathematical Society “On Computable Numbers, with an Application to the Entscheidungsproblem” a proof that there are classes of mathematical problems which cannot be solved by any fixed and definite process, that is, by an automatic machine. /fame fame His theoretical description of a “universal” computing machine aroused much interest. /fame work After two years (1936-8) at Princeton, Turing returned to King’s where his fellowship was renewed. /work fame work But his research was interrupted by the war during which he worked for the communications department of the Foreign Office; in 1946 he was appointed O.B.E. for his services. /work /fame work The war over, he declined a lectureship at Cambridge, preferring to concentrate on computing machinery, and in the autumn of 1945 he became a senior principal scientific officer in the mathematics division of the National Physical Laboratory at Teddington. /work work With a team of engineers and electronic experts he worked on his “logical design” for the Automatic Computing Engine (ACE) of which a working pilot model was demonstrated in 1950 (it went eventually to the Science Museum). /work work In the meantime Turing had resigned and in 1948 he accepted a readership at Manchester where he was assistant director of the Manchester Automatic Digital Machine (MADAM). /work hurt U.S. government interests” As a third step towards assessing the scheme, one thousand sentences were sampled from five data sources (summarised in Table 5.3 on page 118). These one thousand sentences were then classified by the researcher using the annotation scheme. The proportion of biographical sentences for each sample is presented in Table 5.3 on page 118. The data sources used, and the reasons for using them, will be described in turn. DNB-5 On the basis of an analysis of the Dictionary of National Biography (see Section 4.3.1 on page 87) it was observed that the first five sentences of 115 C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME Table 5.1: Coverage of New Annotation Scheme Using Different Sources. source Chambers DNB DNB Wikipedia name Bierce Larkin Turing Foot words 129 1213 635 552 sentences 7 32 23 19 key 3 2 3 2 fame 0 4 4 3 char 0 3 4 0 relation 0 2 3 2 edu 0 3 3 1 work 5 10 8 12 Table 5.2: Coverage of New Annotation Scheme on Short Wikipedia Biographies (Deaths in December 2005.) name Anderson Packer Pryor Williams words 224 81 218 197 sentences 10 4 9 8 key 3 1 1 4 fame 4 3 4 7 char 2 1 1 1 relation 1 0 0 0 edu 0 0 0 0 work 4 1 3 0 unclas 1 0 0 0 Figure 5.5: Wikipedia biography for Richard Pryor Annotated Using New Six Way Scheme. work His catalog includes such concert movies and recordings as Richard Pryor: Live Smokin’ (1971), That Nigger’s Crazy (1974), Bicentennial Nigger (1976), Richard Pryor: Wanted Live In Concert (1979) and Richard Pryor: Live on the Sunset Strip (1982). /work work He also starred in numerous films as an actor, usually in comedies such as the classic Silver Streak, but occasionally in the noteworthy dramatic role, such as Paul Schrader’s film Blue Collar. /work work He also collaborated on many projects with actor Gene Wilder. /work fame He won an Emmy Award in 1973, and five Grammy Awards in 1974, 1975, 1976, 1981, and 1982. /fame fame In 1974 he also won two American Academy of Humor awards and the Writers Guild of America Award. /fame each biographical entry consistently contain facts characteristic of biographies ( key data, according to our biographical scheme; birthdates, marital status, and so on). Intuitively, it seemed that this subsection of the Dictionary of National Biography would provide a good source for testing whether the annotation scheme developed captures exemplary biographical writing. DNB-R Sentences from the Dictionary of National Biography that are not one of the first five sentences of an entry were selected as it was hypothesised that a lower proportion of these sentences would be biographical according to the scheme presented here. Entries in the Dictionary of National Biography typically become more discursive after the initial few sentences, 116 unclas 0 10 2 0 C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME sometimes dwelling on historical background or context. TREC-U Sentences from the TREC corpus (a corpus of news text, see Section 4.3.7 on page 95) were sampled as it was hypothesised that news text would contain relatively few sentences containing biographical information (according to the scheme used here), and that the biographical information present, would mainly be in the form of apposition (for instance, “job title, name”). TREC-F Sentences from the TREC corpus that do not contain person names or personal pronouns. That is, the TREC corpus was first filtered to remove sentences containing person names and personal pronouns, then 1000 sentences from the remaining sentences were sampled. It was hypothesised that this sample would produce a very small proportion of biographical sentences (close to zero). CHA-A Sentences from the Chambers Biographical Dictionary (see page 89) were used on the intuition that the short, information packed entries would contain a high proportion of biographical sentences according to the biographical scheme. The results of the analysis (see Table 5.3 on the next page) show that those sentences harvested from sources where we would reasonably expect a high density of biographical sentences (that is, CHA-A and DNB-5) score very highly (93.5% for CHA-A and 84.9% for DNB-5). Those sentences taken from data sources where we would expect a lower proportion of biographical sentences (that is, DNB-R and especially TREC-U) have a much lower proportion of biographical sentences (29.1% for DNB-R, and 11.1% for TREC-U). Note that DNBR contains a higher proportion of biographical sentences than TREC-U. This is perhaps because, although DNB-R contains a lower proportion of biographical sentences than DNB-5, the sentences are from a biographical dictionary, and hence likely to contain a higher proportion of biographical text than newstext (that is, TREC-U). The data source that includes no person names or personal pronouns is TREC-F, consisting of only 0.6% biographical sentences. This result is explained by the difficulty of expressing biographical information without the use of names or pronouns. This section has indicated through three different approaches that the annotation scheme developed is sufficiently adequate to act as a provisional annotation scheme. First, the annotation scheme was tested on biographical texts from different sources in order to establish the “coverage” of the scheme. Second, coverage of the scheme on a small set of short Wikipedia biographies was assessed. Finally, the scheme was used to classify sets of 1000 sentences from disparate sources (for example, published biographies and newswire text), in order to test whether a higher proportion of sentences from explicitly biographical text (like Chambers) were accounted for by the biographical scheme, compared to newswire text. 117 C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME Table 5.3: Percentage of Biographical Sentences Based on 1000 Sentence Sample. Data Source CHA-A DNB-5 DNB-R TREC-F TREC-U Description Entries from the Chambers Dictionary of Biography The first five sentences of entries in the Dictionary of National Biography All but the first five sentences of entries from the Dictionary of National Biography Entries from a subset of the TREC corpus containing no person names or personal pronouns Entries from a subset of the TREC corpus that includes person names and personal pronouns Proportion Bio 93.5% 84.9% 29.1% 0.6% 11.1% Cumulatively then, the success of these three approaches to validating the synthesis biographical scheme, suggest that the scheme is appropriate for the annotation of short biographies. The work conducted in Chapter 6 on page 124 (Human Study) on assessing the agreement of several annotators, also provides support for the adequacy of the scheme. 5.3 Developing a Small Biographical Corpus This section describes the creation of a small corpus of texts annotated using the six tag biographical scheme described in Section 5.2.2 on page 112.10 The corpus consists of 84,305 word tokens from 80 different documents.11 First, the four sources of texts are described and an example of annotated text given for each source. Second, some issues involved in creating the corpus and descriptive statistics are presented. 5.3.1 Text Sources Four text sources were used: news text from The Guardian12 newspaper, text from BBC obituaries, obituaries from The Guardian newspaper, and finally lit10 To access the corpus, email [email protected] texts were annotated using the SGML aware EMACS text editor. 12 http://www.guardian.co.uk Accessed on 02-01-07. 11 The 118 C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME erary texts selected from the multi-genre STOP Corpus.13 Guardian News Text Texts were sampled from The Guardian newspaper online edition on three days, (11-08-06 (13 documents), 12-09-06 (12 documents), and 24-09-06 (12 documents)). The Guardian is a “serious” British newspaper known for its moderate left wing bias. News items only were chosen, though theme or subject was not restricted. A short extract from an article describing events in the British Home Office is reproduced below: work John Reid, the home secretary, called for solidarity “across all sections of the community” today in the face of the “immense” terrorist threat facing Britain. /work Mr Reid used a press briefing to announce that the “critical” terrorist alert would remain as a “precautionary measure” until further notice. work Both he and Douglas Alexander, the transport secretary, will be meeting with national aviation security representatives later today. /work http://www.guardian.co.uk Guardian Obituaries Seventeen obituaries were sampled from The Guardian newspaper from the first half of 2006. The obituaries include those of prominent lawyers, civil servants, diplomats and journalists (see Section 2.3 on page 30 for more on obituaries). The extract below is from the obituary of the musician Gene Simmons: fame key Among the others who created memorable rockabillystyle recordings was Gene Simmons, who has died aged 69 and who achieved success in 1964 with Haunted House, a schlock-horror number previously recorded by the RB artist Johnny Fuller. /key /fame work key Born in Elvis’s home town of Tupelo, Mississippi, Simmons took up the guitar as a child after his two sisters brought an instrument home. /key /work work He began his professional musical career at 15, playing with his brother Carl at local dances and on radio as the Simmons Brothers band /work . http://www.guardian.co.uk 13 The Lancaster Speech, Thought and Writing Presentation Corpus, available from the Oxford Text Archive: http://ota.ox.ac.uk. Accessed on 02-01-07. 119 C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME BBC Obituaries The eleven BBC obituaries used in the corpus were downloaded from the BBC website in July 2006.14 They include writers, actors, politicans and princes. The extract below is from the biography of novelist Saul Bellow: fame With an awareness of death and the miracle of life at the foundation of his work, Saul Bellow’s novels brought him huge success, and both the Nobel and Pulitzer Prize. /fame He is cited by many contemporary authors as a critical creative influence. Bellow’s message was one of hope and affirmation. He said, “In the greatest confusion, there is still an open channel to the soul.” key Many of his novels were set in Chicago where his poor RussianJewish parents moved when he was a child. /key He later reported, “I saw mayhem all around me. By the age of eight, I knew what sickness and death were.” http://news.bbc.co.uk/obituaries STOP Corpus Fifteen texts were included from the STOP corpus (see Section 4.3.9 on page 97). Although the STOP corpus includes texts from newspaper sources, only texts from the (auto)biography and literary categories were included (see Figure 5.6 on the following page for a list of the literary texts used). Each text used is around two thousand words in length. The extract below is if from a biography of former British Prime Minister, Margaret Thatcher. character Her eyes, according to Alan Watkins of the Observer, took on a manic quality when talking about Europe, while her teeth were such as ’to gobble you up’. /character More sinister still, she slipped into the habit of using the royal “we” in public. (“We are a grandmother”). The STOP Corpus Descriptive statistics for all the text sources that constitute the new biographical corpus are presented in Table 5.415 and Table 5.5.16 The proportion of documents of each type (STOP corpus, obituary and newstext) are depicted in Figure 5.7 on the next page. Note however that documents from the STOP corpus are considerably lengthier than obituaries or newstext documents, and 14 http://news.bbc.co.uk/obituaries Accessed on 08-02-07. that as single sentences can have multiple tags, there are fewer biographical sentences than biographical tags for each data source (see Table 5.4, rows 4 and 6.) 16 Note that Table 5.5 shows the “Guardian News” category broken down into three subcateogories according to the date on which the information was gathered. 15 Note 120 C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME Figure 5.6: Sources of Documents Used From the Literary Genres of the STOP Corpus. (AUTO)B IOGRAPHY Alan Turing: The Enigma of Intelligence, by Andrew Hodges Leonard Cohen: Prophet of the Heart, by L.S.Dorman The Benny Hill Story, by John Smith A Bag of Boiled Sweets, by Julian Critchley Curriculum Vitae, by Murial Sparke The Downing Street Years, by Margaret Thatcher What’s it all About? by Michael Caine F ICTION Possession, by A.S.Byatt Peach, by Elizabeth Adler Brighton Rock, by Graham Greene Money, by Martin Amis The Moor’s Last Sigh, by Salman Rushdie Lace, Shirley Conran Get Carter, Ted Lewis Daughter of Deceit, by Victor Halt Table 5.4: Descriptive Statistics for Biographical Corpora. No. of Documents in Corpus Avg. Length of Documents (in Words) Total Number of Bio Tags Avg. No. of Bio Tags per Document Total Number of Bio Sentences Avg. No. of Bio Sentences per Document Guardian News 37 824 194 6.5 170 4.6 BBC Obits 11 643 173 15.7 150 13.6 Guardian Obits 17 778 327 19.7 247 14.5 therefore the proportion of text derived from the STOP corpus is higher than 19%. 5.3.2 Issues in Developing a Biographical Corpus In building and annotating a biographical corpus, we are interested in exploring features that distinguish biographical from non-biographical text (according to the annotation scheme adopted). Two major concerns guided our choice 121 STOP Corpus 15 2257 107 7.1 90 6.0 C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME Figure 5.7: Types of Document Used in Biographically Tagged Corpus. Table 5.5: Average Number of Biographical Tag Types per Text. Source Type Guardian News 11-08-06 Guardian News 12-09-06 Guardian News 24-09-06 BBC Obituaries Guardian Obituaries STOP Corpus key 0.38 3.01 0.41 3.82 6.35 1.47 fame 0.61 0.5 0.25 3.54 1.05 0.27 char 0.15 0.08 1.16 2.36 2.82 1.67 relation 0 0.75 0.33 2.18 4.23 1.8 edu 0 0 0.25 0.64 1.29 0.27 work 3.38 3.33 3.41 5.45 8.94 3.00 of source material: First, that the corpus should provide non-trivial results. For example, if we had chosen two data sources — short entries from a schematic biographical dictionary and software manuals — it is likely that the software manual texts would contain very few biographical sentences, and the biographical texts (for example, obituaries) a high proportion of biographical sentences (to take an example, for the BBC obituaries described above, approximately 75% of the sentences are classified as biographical using this scheme). Genres included should frequently include non-biographical (according to our scheme) person orientated information (for example, information about a person that is not covered by the annotation scheme described in this chapter). Second, data sources should be spread across different genres, in order to identify different types of biographical constructions, common in different genres. 122 C HAPTER 5: D EVELOPING A B IOGRAPHICAL A NNOTATION S CHEME For example, apposition is a very common stylistic device used by journalists in news texts, but is less common in more literary writing (represented by the STOP corpus). It is important that both these issues are addressed, as the data is to be used as training data in machine learning experiments to identify biographical text (see Chapters 7, 8, 9 and 10). 5.4 Conclusion This chapter has described the creation of a biographical annotation scheme, and a corpus based on that scheme. The next chapter, Chapter 6 goes on to describes a human study designed to validate both the biographical annotation scheme described in this chapter. 123 C HAPTER 6 Human Study 6.1 Introduction This chapter reports the results of a web based human study. Participants were invited to categorise a series of sentences as biographical or non-biographical, and agreement between the assessors was calculated. The chapter is divided into three main sections. First, some necessary background on inter-annotator agreement1 issues is presented. Second, an initial pilot study using a ternary classification scheme is described. Third, the main study, which uses data from various biographical and multi-genre sources, along with the categorisation scheme developed in Chapter 5, is set forth. The study has three goals: 1. To validate the biographical annotation scheme developed in Chapter 5. If participants agree on the status of sentences — as biographical or nonbiographical — it suggests that the biographical annotation scheme developed can be applied consistently. 2. Given that the biographical annotation scheme developed in Chapter 5 is adequate (that is, point 1), to establish to what extent people are able to reliably distinguish between isolated (that is, context-less) biographical and non-biographical sentences. 3. To provide high quality “gold standard” data for experiments in automatic text classification (see Chapters 7, 9 and 10). 1 Inter-annotator agreement is also described as inter-rater agreement and inter-classifier agreement. 124 C HAPTER 6: H UMAN S TUDY 6.2 Agreement This study involves presenting several participants with a list of sentences, along with instructions for the suitable classification of those sentences. Consider Example 6.1. Each participant was asked to classify this sentence according to the annotation scheme provided. Assessing agreement has the primary aim of checking that the annotators have a shared understanding of the categories. High agreement levels indicate that annotators have a good understanding of the concepts (categories) involved, and a clear decision procedure for allocating sentences to those categories. In the context of the current work, high agreement would suggest that annotators have a good understanding of which features are characteristic of biographical and non-biographical sentences (as defined by the annotation guidelines). (6.1) He was born in Widnes and educated at the University of Liverpool Statistical techniques for measuring inter-classifier reliability are regularly used in areas where human classifiers are required to place previously unseen instances into determined categories, without the benefit of pre-classified “gold standard” examples with which to assess the individual classifier’s efforts. Example areas include medical statistics (comparing a group of specialists’ diagnoses (K RAEMER, 1992)), and optometry (comparing human and machine methods for gathering optometric measurements (WATKINS, 2003)). Inter-classifier agreement is important in the computational linguistics research tradition due to a lack of gold standard data. In the computational linguistics research tradition, data tends to be derived from intuitive judgments, and one way of verifying (or supporting) intuitive judgments is to assess how many people who are proficient in the language of interest, have the same judgment in a given situation. There are numerous methods available for calculating agreement. Two common agreement metrics are presented here: percentage based scores and variants of the KAPPA statistic. 6.2.1 Percentage Based Scores Percentage based scores (which simply report the mean percentage of annotators who agree on the class of each item) , while straightforward to understand, are not optimal for assessing agreement, as they do not account for expected agreement. For example, consider the (idealised) data presented in Table 6.1 on the next page. The table shows the result of two annotators assigning ten sentences to one of two categories. The two participants can only agree in 60% of cases (that is, six sentences). Are we then entitled to say that agreement is good, despite the fact that for 40% of the sentences there was no agreement? Note that even if the participants categorised the sentences randomly, it is likely that there would be 50% agreement. 125 C HAPTER 6: H UMAN S TUDY Table 6.1: Raw Agreement Scores (Idealised Example Data) Sentences Annotator 1 Annotator 2 1 yes yes 2 no yes 3 yes yes 4 no yes 5 no no 6 no no 7 yes yes 8 no yes 9 yes yes 10 no yes 6.2.2 The KAPPA Statistic C ARLETTA (1996) challenged the usefulness of percentage measures for assessing agreement and instead proposed the use of the K APPA statistic, arguing that K APPA allows a level of interpretability and insight into agreement data that cannot be provided by raw percentage scores.2 The KAPPA (C OHEN, 1960) statistic measures agreement between a pair of classifiers , with variants for measuring agreement where the number of classifiers is greater than two (see page 128). KAPPA (Equation 6.2) measures the raw agreement between classifiers, while discounting expected agreement. A score of 0 indicates that any agreement can be accounted for by chance, and a score of 1 indicates perfect agreement (C ARLETTA, 1996). (6.2) # Here P(A) is the proportion of times the classifiers agree and P(E) is the proportion of times we would expect the classifiers to agree by chance. There are several different methods of calculating expected agreement (discussed below). An important limitation with the use of KAPPA is the lack of an accepted significance level for agreement. While we know that 1 is total agreement and 0 is no agreement beyond chance, where are the thresholds for average, good and excellent agreement? Traditionally, KAPPA greater than 0.8 has been regarded as “good reliability” and KAPPA greater than 0.67 but less than 0.8 as fair reliability “allowing tentative conclusions to be drawn” (C ARLETTA, 1996). This scale is the default used in computational linguistics research since C ARLETTA (1996), despite the fact that it is widely acknowledged as arbitrary (C RAGGS and M C G EE W OOD, 2005; D I E UGENIO, 2000; D I E UGENIO and G LASS, 2004). C RAGGS and M C G EE W OOD (2005) suggests that researchers should acknowledge “that there is no magic threshold that, once crossed, entitle us to claim 2 Note that D I E UGENIO and G LASS (2004) suggest that percentage scores should be used as one of several methods for analysing agreement levels. C RAGGS and M C G EE W OOD (2005) however rejects this as unnecessary, arguing that a single variant of the KAPPA statistic is sufficient. 126 C HAPTER 6: H UMAN S TUDY Table 6.2: Types of KAPPA; Methods for Calculating Expected Probability. Method 1 S COTT (1955) F LEISS (1971) K RIPPENDORFF (1980) Method 2 C OHEN (1960) that a coding scheme is reliable” (C RAGGS and M C G EE W OOD, 2005), and — this is implicit rather than directly stated — that only indicative claims can be supported with agreement statistics alone. Other researchers in psychology and medical sciences consider KAPPA greater than 0.75 as “almost perfect” (L ANDIS and K OCH, 1977) and “excellent” (E MAM, 1999). Acceptability levels dip as low as 0.5 in psychiatric diagnosis (G ROVE ET AL . (1981) referenced by D I E UGENIO (2000)). Types of KAPPA statistic In recent years, there has been a debate about the most appropriate variant of the KAPPA statistic to use in computational linguistics research. The variants fall into two main groups, according to their method for calculating expected probability ( ) (see Equation 6.2 on the preceding page). Method 1 assumes that the distribution of proportions over the categories is the same for all annotators (that is, that annotators make classification decisions in the same proportion). Method 2 does not assume that the distribution of proportions over the categories is the same for all annotators. Instead, expected probability is calculated on the basis that each annotator has a distinct distribution of proportions over the categories (that is, that some annotators may systematically favour one classification over another). Table 6.2 gives examples of KAPPA variants from each of these groups.3 D I E UGENIO and G LASS (2004) suggests that when reporting agreement in computational linguistics research, two KAPPA statistics should be reported, one each from Methods 1 and 2, as well as a percentage measure, as this will allow a more balanced view of the data. On the other hand C RAGGS and M C G EE W OOD (2005), in response to D I E UGENIO and G LASS (2004), suggests that only one KAPPA statistic is worthwhile reporting, a Method 1 KAPPA. C RAGGS and M C G EE W OOD (2005) suggests that the use of multiple agreement measures shows a lack of confidence in the statistics, and goes on to explicitly reject Method 2 KAPPA techniques as the “purpose of assessing the reliability of cod3 Note that some of these agreement statistics are variously referred to as P I and than KAPPA . 127 ALPHA rather C HAPTER 6: H UMAN S TUDY Table 6.3: Idealised Data for KAPPA Example. Sentences 1 2 3 4 Total Proportion Biographical Category 0 5 4 6 15 0.375 Non-Biographical Category 10 5 6 4 25 0.625 ing schemes is not to judge the performance of the small number of individuals participating in the trial, but rather to predict the performance of the scheme in general” (C RAGGS and M C G EE W OOD, 2005). They go on to suggest that any “bias” exhibited by individual annotators (that is, marked differences in the proportion of classification decisions between annotators) is best minimised by increasing the number of annotators, a hypothesis confirmed by A RTSTEIN and P OESIO (2005), who compared agreement scores from Method 1 and Method 2 KAPPA types (S COTT (1955) and C OHEN (1960), respectively) and showed that as the number off annotators grows, bias decreases. K APPA for more than one annotator F LEISS (1971) describes a frequently used agreement statistic for more than two annotators that satisfies the requirements set out by C RAGGS and M C G EE W OOD (2005) (that is, a Method 1 agreement statistic). The statistic also allows for multiple sets of annotators. That is, the annotators classifying one sentence may be different to the annotators classifying another sentence. Fleiss’s KAPPA statistic is also often implemented in standard statistical software.4 In order to demonstrate how the statistic is calculated, a worked example is included,5 using the idealised data presented in Table 6.3, which shows four sentences, two categories (biographical and non-biographical) and ten annotators. Before describing the method for calculated Fleiss’s KAPPA, it is necessary to introduce some notation. is the total number of sentences, is the number of annotators per sentence, the subscript is the sentence number, the subscript is the category number, is the total number of categories, and is the number of annotators who assign the th sentence to the th category. 4 Fleiss’s KAPPA for two or more annotators is implemented in the IRR package of the open source R statistical programming language. 5 A fuller example based on psychiatric diagnoses is presented in F LEISS (1971) (the original paper). 128 C HAPTER 6: H UMAN S TUDY The first step is calculating the total level of agreement (uncorrected for expected agreement). The proportion of agreeing pairs for each sentence is cal culated using Equation 6.3 (Equation 6.4 shows the calculation for Sentence 1, Table 6.3 on the page before). Note that is the proportion of agreeing scores in Sentence . (6.3) (6.4) (6.5) The total agreement for all data ( mean of the proportion for all ) is the or . See Equation 6.5 sentences (that is, for the formula). We have now calculated agreement for the data presented in Table 6.3 on the preceding page, but in order to calculate KAPPA we need to identify and discount expected agreement . Equation 6.6 shows the equation used to calculate expected agreement, and Equation 6.7 shows the calculation used to identify expected agreement for the data shown in Table 6.3 on the page before. Note that (lowercase ) is the proportion of all assignments belonging to category . (6.6) (6.7) We now have all the data required to perform the final KAPPA calculations. See Equation 6.8 and Equation 6.9 on the next page. (6.8) # 129 C HAPTER 6: H UMAN S TUDY (6.9) # Using the accepted scale, KAPPA = 0.12 is a very poor agreement score, indicating that there is very little agreement above chance in the idealised data presented in Table 6.3 on page 128. This section has explored some difficulties in quantifying agreement and provided a description of and justification for the use of Fleiss’s KAPPA. A worked example has also been included. The next two sections describe how Fleiss’s KAPPA has been applied to assess inter-annotator agreement in two related studies. 6.3 Pilot Study This section describes a pilot study designed to explore the ability of people to distinguish between biographical and non-biographical sentences using a provisional biographical classification scheme. Results of the study indicated shortcomings in the biographical scheme used, and led to the development of a new scheme (see Chapter 5 on page 101) and a more extensive study based on that scheme (see Section 6.4 on page 132 in this chapter). The pilot study was conducted using a web form questionnaire format.6 Seventeen potential participants were contacted of whom fifteen completed the web questionnaire. All participants were between the ages of twenty and seventy and all were university educated native English speakers (British English, Hiberno-English and New Zealand English) and were personally known to the researcher as family, friends or colleagues.7 Each participant was asked to categorise one hundred sentences of varying lengths.8 There were three possible categories: 1. Core Biographical — Relevant information here could include details about birth and death dates, education history, nationality, employment, achievements, marital status, number of children, and so on. If the central purpose of a sentence is to convey information about an individual, then that sentence can be classified as core biographical. Note that the sentence may contain anaphors rather than person names. For example, the sentence “He was jailed for a year in 1959 but, given an unconditional pardon, became Minister of National Resources (1961), then Prime Minister (1963), President of the Malawi (formerly Nyasaland) Republic (1966), 6 See http://www.dcs.shef.ac.uk/ mac/bioexp.html email containing the URL of the study was sent to potential participants. 8 The guidelines for completing the test and a list of the one hundred sentences used are reproduced in Appendix A on page 191. 7 An 130 C HAPTER 6: H UMAN S TUDY and Life President (1971)” about President Banda of Malawi is, according to this scheme, core biographical. 2. Extended Biographical — Contains information about a person, but ancillary to the main thrust of the sentence. The distinction between core and extended biographical sentences is that extended biographical sentences while they may contain information about an individual, are not directly about that individual. For example in the sentence “This new consumer is a pretty empowered person,” said Wendy Everett, director of a study commissioned by the Robert Wood Johnson Foundation.” Wendy Everett is not who the sentence is about — although we do learn that she is the director of a study — the real focus of the sentence is the “new consumer”. 3. Non Biographical — Contain no person names or titles or pronouns. For example in the sentence “Of the 6 million notebooks Taiwan turned out last year, Quanta produced 1.3 million sets, accounting for about 8 percent of the world output.” there is clearly a reference to a company (Quanta) but no reference to a person either directly or anaphorically. The test sentences were selected from five sources, twenty sentences from each. Table 5.3 on page 118 provides details (and the shorthand code) for each of the five sources used. Further details of these corpora can be found in Chapter 5. On the basis of an analysis of the Dictionary of National Biography it was observed that the first five sentences of each entry consistently contain facts characteristic of biographies (birth dates, career, marital status, and so on). Intuitively, it seemed that this subsection of the Dictionary of National Biography would provide the best available source of core biographical sentences. Sentences from the Chambers Dictionary of Biography were chosen for the same reason. These intuitions are supported by the result reported on page 115, where 89.9% of DNB-5 sentences and 93.5% of CHA-A sentences were classified as biographical in a sample of 1000 sentences taken from each group, using the classification scheme described in Chapter 5, rather than the three-way classification scheme described in this section. Sentences from the Dictionary of National Biography that are not one of the first five sentences of an entry were selected as a first attempt at approximating an extended biographical category. Entries in the Dictionary of National Biography typically become more discursive after the initial few sentences, sometimes dwelling on historical background to an extent that is not obviously biographical in the limited sense of the word, although often information about the individual is given. Sentences from the subset of the TREC corpus containing person names and personal pronouns (that is, TREC-U) were used for similar reasons. This intuition is supported by the result reported on page 115, where 29.1% of the DNB-5 sample, and 11.1% of the TREC-U sample were found to be biographical. 131 C HAPTER 6: H UMAN S TUDY Table 6.4: Inter-classifier Agreement Results. Categories 2 Categories 3 Categories KAPPA 0.752 0.431 A subset of the TREC corpus that contained no person names or personal pronouns (that is, TREC-F) was used on the intuition that sentences containing no references to persons, either directly or indirectly, could not be classified as biographical. It was expected that sentences from this source would reliably be classed as non biographical. Note that only 0.6% of the TREC-F sample discussed on on page 115 was classified as biographical using the binary annotation scheme described in Chapter 5. While three categories were used in the experiment core biographical, extended biographical and non biographical it was decided to subsume the two biographical categories in the light of feedback from participants (that is, sentences classified as biographical and extended biographical regarded as belonging to one single biographical category. Note that no data was discarded in this transition from a ternary to a binary classification). Participants had little difficulty distinguishing a biographical sentence from a non biographical sentence, yet they reported confusion over the distinction between the core and extended categories, suggesting that the distinction between the two biographical categories was under specified. This reported difficulty is empirically validated in the results (see Table 6.4) where the inter-classifier kappa using three categories was poor, and the score using two categories (biographical and non-biographical, with the biographical category consisting of those sentences classified as core and extended biographical) was very good at around 0.75. The relative success of the binary classification suggested that a new study be conducted, employing a more understandable biographical categorisation scheme. This clearer, binary annotation scheme is described in Chapter 5 on page 101 and empirically assessed in the next section. 6.4 Main Study This section describes a study that involved twenty five participants classifying sentences as biographical or non-biographical, using the binary annotation scheme developed in Section 5.2 on page 109. It is important to emphasise that the participants were confronted with a binary classification task. In this main study, the seven way biographical scheme (key, fame, character, relationships, education, work and unclassified) developed in Section 5.2 are reduced to two 132 C HAPTER 6: H UMAN S TUDY biographical classes. The biographical class subsumes the first six biographical classes of the synthesis scheme, and the non-biographical class corresponds to “unclassified” in the synthesis scheme. 6.4.1 Motivation The three way classification scheme (extended, core and non-biographical) used in the pilot study caused confusion among participants. The distinction between (extended) and core biographical categories, where the participants was asked to decide if the sentence was about a person or merely contained biographical information, was particularly problematic. This distinction is dropped in the binary classification scheme developed in Chapter 5 on page 101. Instead of asking participants what the sentence is about — whether a sentences is about a person (core) or incidentally contains information about a person (extended) — the new scheme focuses entirely on the information content of the sentence. That is, whether the sentence contains biographically relevant facts according to the six biographical categories identified in Chapter 5 on page 112 ( key , fame , character , relationships , education and work ). This confusion is reflected in the KAPPA scores for the pilot study, where for three categories (extended biographical, core biographical and non-biographical) they are poor, but if we subsume the core and extended categories, so that the categorisation task becomes binary, we achieve a good KAPPA score (0.75). In order to assess the claim that a binary annotation scheme with well developed informational criteria for classifying sentences would achieve high agreement scores, a new study was designed, using the annotation scheme and corpus developed as part of this research. Although the agreement results achieved by subsuming the two biographical categories suggesting that a single binary classification scheme was most appropriate for describing the distinction between biographical and non-biographical sentences, it was felt that a new study, which presented participants with a new binary classification scheme based on previous biographical annotation schemes (for example, the Text Encoding Initiative biographical scheme, described on page 102), was a more robust method of testing the hypothesis. A second aim of this study is to provide a corpus of “gold standard” attested sentences for the automatic sentence classification experiments described in Chapters 7, 9 and 10). 6.4.2 Study Description Twenty five participants used a web interface to classify five hundred sentences, guided by the biographical annotation scheme described in Chapter 5 on page 101. The participants also were provided with annotation instructions in 133 C HAPTER 6: H UMAN S TUDY the form of a PDF file (reproduced in Appendix B on page 202) which they were advised to print and consult while answering questions.9 Each participant did not classify all five hundred sentences. Instead, sentences were divided into stratified sets of one hundred sentences and five annotators classified each set of sentences. All sentences were derived from the biographical corpus described in Section 5.3 on page 118, and were therefore representative of a number of genres (for example, newspaper text, web news reports, short published obituaries, published fiction). For each set of sentences, forty-eight biographical sentences were selected (eight from each of the biographical six sub categories listed on on page 112) and fifty-two sentences were randomly selected from those untagged10 sentences in the biographical corpus.11 In other words, each stratified set of sentence consisted of approximately 50% biographical sentences and 50% non-biographical sentences. Of the twenty-five participants, thirteen were anonymous, and twelve provided personal information.12 Of the twelve who provided personal information, eleven were native English speakers (eight British English, Two HibernoEnglish, and one American English). One participant who provided personal information was a native speaker of Finnish with a near native standard of English. All those who provided information were aged between twenty and sixty and were university educated. An email containing the URL of the study was sent to possible participants. All participants approached to take part in the study were personally known to the researcher as family, friends or colleagues. 6.4.3 Results The agreement scores for each set of sentences calculated using Fleiss’s KAPPA statistic are shown in Table 6.5 on the following page. Note that the KAPPA scores for each sentence set is at or above the 0.67 level regarded as “good”. The mean KAPPA scores for all 5 sentence sets is 0.72, well above the 0.67 threshold. 9 As a minimum, participants were asked to leave the PDF file containing instructions open in a different window in order to consult it while classifying sentences. 10 Note that untagged sentences are those sentences that are not assigned a biographical tag and hence are considered non-biographical. 11 As some biographical subcategories were less well represented in the biographical corpus (for example, education ), two of the sentence sets use disproportionately more sentences from the key subcategory. 12 Four fields were provided for participant information: forename, family name, email address and age. 134 C HAPTER 6: H UMAN S TUDY Table 6.5: KAPPA Scores for Each Sentence Set Sentence Set Set 1 Set 2 Set 3 Set 4 Set 5 KAPPA Score 0.75 0.80 0.71 0.68 0.67 6.4.4 Discussion These results show that good agreement can be obtained between multiple classifiers over a range of sentences using the binary classification scheme developed in Chapter 5 on page 101. The overall agreement score of 0.72 is slightly lower than that obtained in the “binary” version of the pilot study (that is, the subsumption of the extended and core biographical categories), this can perhaps be explained by the nature of the data used in the pilot study; particularly the use of sentences that had been filtered of pronouns and personal names. The difficult classification decision for participants in the pilot study was deciding between the extended and core biographical categories, rather than between the biographical categories and the non-biographical category. In contrast to the pilot study, the main study uses randomly selected non-biographical sentences from the biographical corpus described in Chapter 5, which can be more challenging to the participant than those used in the pilot study. For example, “Mr Blair was also snubbed by radical politicians linked to Hizbullah”, although it references Tony Blair, is not biographical according to the annotation scheme used in the main study. The mean KAPPA (0.72) score exceeds the minimum conventional significance score of 0.67, but does not reach the 0.8 required for excellent agreement. Consider the data presented in Table 6.5. It can be seen that K APPA varies considerably for each of the five sentence sets, from the lower threshold level of 0.67 to the “excellent” level of 0.8. No score falls below C ARLETTA (1996)’s 0.67 threshold however, suggesting that even with challenging sentences, the biographical annotation scheme developed in Section 5.2 on page 109 yields good accuracy. It is important to qualify this judgment with the observation that commonly used agreement thresholds — unlike the significance levels of inferential statistics — are essentially arbitrary (C RAGGS and M C G EE W OOD, 2005). Therefore, even a KAPPA score of 0.9 would remain indicative (albeit very strongly indicative). 135 C HAPTER 6: H UMAN S TUDY 6.5 Conclusion This chapter has shown (in the main study) that an information orientated binary annotation scheme consistently yields high agreement over a wide range of sentences. This result supports the central hypothesis of this thesis, that people are able to reliably identify biographical sentences (where “reliable” means with a good standard of agreement on challenging data). Note that the classes of sentences in the gold standard data are determined by the annotated corpus, rather than on the judgements of experimental participants. That is, decisions about the biographical status of individual sentences were made by the researcher, using the biographical annotation scheme. Agreement between the researcher and the participants was very high however (93%).13 The remainder of this thesis uses the data gathered in the main study (that is, five hundred sentences with high agreement14 ) in a series of machine learning experiments in order to assess the accuracy of automatic sentence classification using a variety of sentence representations. It is important to stress that the gold standard data used in the machine learning experiments described in later chapters is derived from the researcher’s annotation efforts rather than those of the twenty-five participants involved in the main study. However, agreement between the researcher’s annotation and the participants’ annotation is very high (94%).15 13 Note that as there were five participants judging each sentence, it was straightforward to accept the participants’ majority decision as the sentence class, and compare this with the researcher’s decision. 14 Note that these five hundred sentences are reproduced in Appendix B. 15 As there were five participants for each set of 100 sentences in the main study, the majority decision for each sentence was recorded and compared to the annotation decision made by the researcher. 136 C HAPTER 7 Learning Algorithms for Biographical Classification This chapter compares six different learning algorithms using the “gold standard” data described in Chapter 5 and utilising a feature set consisting of the 500 most frequent unigrams in the the Dictionary of National Biography. This feature set was used as it contains a wide range of function words, as well as words that could intuitively be regarded as being especially characteristic of the biographical genre (“born”, “married” and so on). Deriving features from the “gold standard” data set was avoided, as it was suspected that using the gold standard data to derive features, to train a classifier, and to test that classifier would artificially inflate classification accuracy. The chapter serves as a “first pass” of the data, allowing indicative results to be drawn about the usefulness of different machine learning algorithms for the biographical sentence classification task. Later chapters concentrate on varying the feature sets used. The chapter is divided into five sections, Motivation, Procedure, Presentation of Results, Discussion and Conclusion. 7.1 Motivation In recent years there has been a steady trickle of published work comparing feature sets for genre classification (see Section 3.1.3 on page 59). However, there has been little work directed at the comparison of learning algorithms. Previously published research has focused on the comparison of feature sets using one or two algorithms. For example, F INN and K USHMERICK (2003) (see 63) uses the C4.5 decision tree algorithm (described on on page 39) in conjunction with various feature sets to assess whether news articles were subjective 137 C HAPTER 7: L EARNING A LGORITHMS FOR B IOGRAPHICAL C LASSIFICATION or objective, and whether reviews were positive or negative. F INN and K USH MERICK (2003) varied the feature set but not the learning algorithm. Z HOU ET AL . (2004) uses three learning algorithms — SVM, C4.5, and Naive Bayes (see Section 2.5.1 on page 38 for descriptions of these algorithms) — for biographical sentence classification. The main focus of this work was the identification of optimal features for biographical classification, rather than learning algorithms. Z HOU ET AL . (2004) compared the performance of the three algorithms on a “biographical corpus” annotated using the scheme described on page 103,1 and a feature set consisting of all the unigrams present in their biographical corpus. Z HOU ET AL . (2004) identifies Naive Bayes as the best performing algorithm (82.42%) followed by the C4.5 algorithm (75.75%) and finally the SVM algorithm (74.47%). The current work builds on that presented in Z HOU ET AL . (2004), but differs in that it explores the performance of six algorithms on a gold standard data set (described in Chapter 5) using a feature set composed of the five hundred most frequent unigrams in the Dictionary of National Biography. Most importantly, in addition to raw accuracy scores, a statistical test — the corrected re-sampled -test (see page 49) — is used to compare classifiers. 7.2 Experimental Procedure The bio features Perl script was used to created a W EKA ARFF (see Section 4.2 on page 86) file from the 500 gold standard sentences described in Section 5.2.1 and validated in Chapter 6. The feature representation chosen (and used on all experiments in this chapter) was a binary representation based on the most frequent five hundred unigrams in the Dictionary of National Biography.2 Further feature selections was not used as the purpose of this experiment is to compare the success of different learning algorithms with a constant, basic, feature set. Six learning algorithms were used: ZeroR — This is a baseline classifier that simple assumes all test instances belong to the most common class (see page 38) OneR — The attribute with most predictive power is used to classify all instances (see page 38). C4.5 — A decision tree algorithm (see page 39). Ripper — A rule based algorithm (see page 43). 1 While the annotation scheme describes ten biographical categories, the comparison of algorithms was based on a binary classification scheme (biographical and non-biographical) where the original biographical categories (work,fame, etc) are subsumed in one single biographical category. 2 The most frequent one hundred words in the Dictionary of National Biography are reproduced on page 145. A complete list of all five hundred words is available at http://www.dcs.shef.ac.uk/ mac/frequency lists/dnb 500frequent.txt.gz. 138 C HAPTER 7: L EARNING A LGORITHMS FOR B IOGRAPHICAL C LASSIFICATION Naive Bayes — The commonly used variant of Bayes that assumes all attributes are independent (see page 43). SVM — A Support Vector Machine based algorithm (see page 46). All experiments were run in the W EKA machine learning environment (see Section 4.2 on page 86), using the W EKA E XPERIMENTER interface.3 Each algorithm was assessed on the biographical data using 10 x 10 fold cross validation (see Section 2.5.2 on page 47) in order to allow for reliable comparisons between algorithms. A comparison between the results produced by 100 x 10 fold cross validation and 10 x 10 fold cross validation is also reported in order to informally test the adequacy of the 10 X 10 fold cross validation methodology. 7.3 Results Results of the 10 x 10 fold cross validation run for each algorithm are presented in Table 7.1 on the next page. Note that the Naive Bayes algorithm scores the highest accuracy at 80.66 percent, followed by the SVM algorithm at 77.47% and the C4.5 decision tree algorithm at 75.87%. All algorithms performed better than the baseline algorithm, ZeroR, at a highly significant level when subjected to the corrected re-sampled -test.4 The performance of the Naive Bayes algorithm — the most accurate algorithm — compared to the SVM algorithm — the second most accurate algorithm — fails to reach the required significance threshold, although the -value is low at 0.1. The Naive Bayes algorithm does perform better than all the other algorithms studied (apart from the SVM algorithm) at a highly significant level. In other words, the Naive Bayes algorithm performs better than the ZeroR, OneR, C4.5 and Ripper algorithms at a highly significant level, and it also consistently performs better than the SVM algorithm, although this difference fails to meet the significance threshold. Accuracy for the 10 x 10 fold cross validation is also depicted in Figure 7.1 on page 141. In order to test whether 10 x 10 fold cross validation was sufficient to gain a reliable result, the experiment was repeated using 100 x 10 fold cross validation. These results are presented in Table 7.2 on the next page, where it can be seen that the difference between the mean score for each algorithm is less than 0.5% for 10 x 10 fold cross validation and 100 x 10 fold cross validation. For example, the score for the Naive Bayes algorithm under 10 x 10 fold cross validation is 80.66%, and for 100 x 10 fold cross validation is 80.70%; a difference of 0.04 %. This finding is in line with B OUCKAERT and F RANK (2004)’s suggestion that 10 x 10 fold cross validation, in conjunction with the corrected re3 the W EKA EXPERIMENTER is a component of the W EKA machine learning toolkit, which facilitates the comparison of the performance of algorithms. 4 The term “significant” is used when and the term “highly significant” when . 139 C HAPTER 7: L EARNING A LGORITHMS FOR B IOGRAPHICAL C LASSIFICATION Table 7.1: Six Learning Algorithms Compared using “Gold Standard” Data and a Feature Representation Based on the 500 Most Frequent Unigrams in the DNB: 10 x 10 Fold Cross Validation Algorithm ZeroR OneR C4.5 Ripper Naive Bayes SVM Mean (%) 53.09 59.94 75.87 70.18 80.66 77.47 Standard Deviation 0.95 4.85 5.85 6.33 5.14 5.62 Table 7.2: Six Learning Algorithms Compared using “Gold Standard” Data and a Feature Representation Based on the 500 Most Frequent Unigrams in the DNB: 100 x 10 Fold Cross Validation Algorithm ZeroR OneR C4.5 Ripper Naive Bayes SVM Mean (%) 53.09 59.88 76.02 69.82 80.70 77.51 Standard Deviation 0.95 5.51 5.45 6.30 5.01 5.25 sampled t-test, is sufficient for making reliable inferences concerning classifier performance (see Section 2.5.2 on page 47 for more on evaluating classification algorithms).. 7.4 Discussion The results gained confirm Z HOU ET AL . (2004)’s finding that compared to the SVM and C4.5 algorithms, the Naive Bayes algorithm performs better on the biographical sentence classification task when using unigrams as features.5 Note however that Z HOU ET AL . (2004) used different implementations of these algorithms (that is, not the WEKA implementations used in this work). Z HOU ET AL . (2004) also used a different feature set (based on all unigrams from their biographical corpus) and different training data. The results gained in this research, and in Z HOU ET AL . (2004) are remarkably similar, with Naive Bayes providing the most accurate results (82.46% and 80.66% for Z HOU ET AL . (2004) and the current research, respectively). For Z HOU ET AL . (2004) however, the 5 Z HOU ET AL . (2004) used a greater number of unigrams than the 500 used here. 140 C HAPTER 7: L EARNING A LGORITHMS FOR B IOGRAPHICAL C LASSIFICATION Figure 7.1: Mean Performance of Learning Algorithms with 10 x 10 CrossValidation on “Gold Standard” Data using a Unigram Based Feature Representation. C4.5 algorithm performed better than the SVM, whereas, in the current work, the reverse was true (that is, the C4.5 algorithm scored 75.87% and the SVM 77.47%; a difference of 1.6%). It can also be noted that the performance of the C4.5 algorithm in the current research is within 0.12% of that reported by Z HOU ET AL . (2004). The success of Naive Bayes in this task — with its assumption that all features are independent and equally important — compared to more sophisticated algorithms (like C4.5) would be surprising if Naive Bayes had not been shown to be successful in other text classification domains (L EWIS, 1992b; M ANNING and S CH ÜTZE, 1999). The assumption of the Naive Bayes classifier that all features are independent of one another, allows the algorithm to “ignore” irrelevant features (that is, features that occur randomly with respect to the category of interest). The assumption of independence in Naive Bayes is in contrasted to, for example, the C4.5 algorithm, where irrelevant features damage classification accuracy, as the decision tree is likely to “split” on an irrelevant feature, leading to a the provision of sub-optimal data for subsequent decisions (W ITTEN and F RANK , 2005). 141 C HAPTER 7: L EARNING A LGORITHMS FOR B IOGRAPHICAL C LASSIFICATION The OneR algorithm was used to identify the single rule that provided maximum accuracy. On examination, this single rule was found to be the presence or absence of the “in “ feature. Of the 501 “gold standard” sentences, 219 contain “in” and 282 do not contain “in”. In the case of the 219 sentences that do contain “in”, 140 are biographical, and 79 are non-biographical. Example 7.1 shows a sentence from the training corpus that was correctly classified by the OneR algorithm, and Example 7.2 shows a sentence that was incorrectly classified. (7.1) And in 2000, aged 80 Doohan boldly went into fatherhood for the seventh time when his then 43-year-old wife gave birth to a daughter, Sarah (7.2) Linex liked the way I was thinking but he said that you’d never get the punters in and out quickly enough The relative success of the “in” rule can be attributed to the relatively common use of “in” as a temporal and geographical locator for biographically salient life events and states in biographical texts. For example “He was born in 1780”, “He lived in London for most of his adult life”. The almost half-and-half split between biographical and non-biographical texts in the training data may well have contributed to the selection of the “in” rule. If the data had been constituted from 10% biographical sentences, and 90% non-biographical sentences, the biographically salient instances of “in” may well have been “drowned out” by the non-biographical instances of “in”. This question cannot be resolved using the current data, however. The feature set used in this experiment was derived from the Dictionary of National Biography; a (predominantly) Nineteenth Century British cultural product (see Section 4.3.1 on page 87). The five hundred most frequent unigrams in the DNB were used. These included function words (“in, “the” and so on), as well as words we would intuitively consider to indicate biographical content (“born”, “died”, “married” and so on). 7.5 Conclusion This chapter has compared six different learning algorithms using the “gold standard” data described in Chapter 5 utilising a feature set consisting of the five hundred most frequent unigrams in the Dictionary of National Biography. We have found, like Z HOU ET AL . (2004), that the Naive Bayes algorithm produces the best results for binary biographical sentence classification, at least using the feature set and data used. Although the difference between Naive Bayes and the next best algorithm, a Support Vector Machine, failed to reach a satisfactory significance level (using the two-tailed corrected re-sampled test), the difference was close to the significance threshold ( ), which provides some tentative support for the claim that Naive Bayes is the best 142 C HAPTER 7: L EARNING A LGORITHMS FOR B IOGRAPHICAL C LASSIFICATION Figure 7.2: Root section of a C4.5 Decision Tree Derived From the Gold Standard Training Data school yes no Bio wife yes no Bio university yes no Bio won yes no Bio married yes Bio no york . . . performing classification algorithm for the biographical sentence classification task. It is also notable that on some feature sets the SVM algorithm outperforms Naive Bayes (see Table 9.1 on page 163 for an example). 143 C HAPTER 8 Feature Sets This section describes the different feature sets used in the empirical work in Chapter 9. The feature sets are divided into four groups, standard features, biographical features, empirically derived syntactic features and keyword-based features. Feature sets are collections of similar features. For example, the most frequent five hundred unigrams in the Dictionary of National Biography constitute a feature set. 8.1 Standard Features The standard features used are listed below: 2000 most frequent unigrams derived from the Dictionary of National Biography. 2000 most frequent unigrams with (function words removed) derived from the Dictionary of National Biography. 2000 most frequent unigrams derived from the Dictionary of National Biography stemmed using the Porter Stemmer. 2000 most frequent bigrams derived from the Dictionary of National Biography. 2000 most frequent trigrams derived from the Dictionary of National Biography. 319 function words.1 1 The list of English function words is available from the University of Glasgow, Department of Computer Science: http://www.dcs.gla.ac.uk/idom/ir resources/linguistic utils/stop word Accessed on 02-01-07. 144 C HAPTER 8: F EATURE S ETS Table 8.1: 100 Most Frequent Unigrams in the Dictionary of National Biography. Unigrams not Present in the 100 Most Frequent Unigrams in the British National Corpus are Italicised. the of in and to he a was his s on at by for with as that which from had an him but it is were this london who first sir be not her i one p been after john when or have years died ii made two son life time king became lord some also she college year st all c william daughter their published there under work may where england no great other house royal into himself new are appointed church they death born its d second english more married general henry before society 2 many thomas took These features are referred to as standard as they are based on feature identification methodologies commonly used in the text classification literature (S E BASTIANI , 2002). All frequencies were derived from the Dictionary of National Biography (DNB) based on the intuition that these words were likely to be especially characteristic of the biography genre. It can be seen from Table 8.1 there are marked differences between frequencies derived from the DNB biographical text and those from more general English text (in this case, frequency lists from the British National Corpus2 ). The frequencies obtained from biographical text include person names (“william”, “john”), place nouns (“london”, “england”), titles (“lord”, “king”) and life events (“born”, “died”, “appointed”, “married”). Stemming — the reduction of inflected word forms to a single canonical stem (for example, “marry” and “married” may become “marri”) — was achieved using a standard implementation of the Porter algorithm (P ORTER, 1980). Stemming is usually performed in order that the various inflections of a base word are not regarded as a separate word. 2 Frequency lists are available from Alan Kilgariff’s web site: http://www.itri.brighton.ac.uk/ Adam.Kilgarriff/bnc-readme.html Accessed on 02-01-07. 145 C HAPTER 8: F EATURE S ETS The most frequent bigrams and trigrams from the Dictionary of National Biography were used as features, based on two intuitions. The first is that frequent -grams in biographical text are likely to have special discriminatory power compared to -grams frequent in standard English text collections. The second is that -grams provide a computationally inexpensive method of capturing syntactic information (S ANTINI, 2004a). Examples of the most common bigrams and trigrams from the Dictionary of National Biography are shown in Tables 8.2 on the following page and 8.3 on the next page respectively. It can be seen that while many of these bigrams seem to be specific to the particular domain and subject matter of the Dictionary of National Biography (for example, “duke of”, “the english”, “the british”), many also seem to be plausibly characteristic of biographical sentences more generally (“his death”, “daughter of”, “died in” , “he married”, and so on). There are also a number of bigrams composed entirely of function words that could be expected to occur in a general corpus of English (for example, “but the”, “to a”, “of the”, and so on). The most frequent trigrams (Table 8.3 on the following page) are rarely constituted entirely of function words, and neither are they made up of general biographical phrases. Instead, many frequent trigrams seem highly specific to the Dictionary of National Biography, with its emphasis on British history and culture. Examples here would include; “the royal society”, “the royal academy”, “house of commons”, “the earl of”, and so on). Table 8.3 on the next page shows up the gender bias in the Dictionary of National Biography. Of the 100 trigrams presented, forty have exclusively male referents. Only two female referents occur in the most frequent one hundred trigrams, and both of these refer to “his wife” (“by his wife”, “and his wife”). 8.2 Biographical Features Four specifically biographical feature groups were used as part of the work. These features are not empirically derived, but rather based on intuitions regarding the likely characteristics of biographical sentences: 1. Pronoun: This is a boolean feature. If the sentence contains a pronoun (he, she, him, her, his, hers) then the feature is positive. 2. Name: Six boolean features are identified here, using a combination of gazetteers and FSAs: Title (for example, Mr, Ms, Captain). Company (for example, IBM, International Business Machines). Non Commercial Organisation (for example, Army, Parliament, Senate). 146 C HAPTER 8: F EATURE S ETS Table 8.2: 100 Most Frequent Bigrams from Dictionary of National Biography. of the on the with the as a to be to his the first the royal with a his father to a him to which was for his the house his death house of after the duke of the church in the at the he had in his one of that he to have was appointed where he his wife of which of st in which in london king s was not was buried and of but the who was he was and the and in which he and his daughter of and a son of that the earl of his own as the it is under the history of life of and to the english died at but he to the for the and was from the it was in a was the had been he died have been for a of sir with his the british when he died in the university was in returned to he married of his by the of a was a the king by his the same and he was born and on he became on his during the by a member of was elected the following to him published in the most Table 8.3: 100 Most Frequent Trigrams from Dictionary of National Biography. Trigrams with Male Referents are Italicised. one of the to have been a member of the same year the end of was educated at in the following in which he history of the the church of part in the of which he the age of president of the he had been he was made in the house at the age s life of of the first he was one at the same of the house at the end of his own he was appointed the house of and in the and was buried the duke of was born in was buried in which he was of his life seems to have the royal society s hist of which he had account of the the british museum of the church was appointed to where he was and on the he died in of the english whom he had the history of the royal academy is in the he was a member of the in the same the death of was born at the earl of said to have the university of as well as house of commons is said to he was educated he was also of the british of the king he became a he had a the son of the battle of he died on at the time one of his on the death a man of he was in 147 of the royal the king s was one of he was elected that he was he died at by his wife and he was the following year his father s and his wife that he had he was the when he was he returned to the same time part of the to the king he went to a fellow of and of the buried in the cal state papers the author of eldest son of C HAPTER 8: F EATURE S ETS Forename (for example, David, Dave). Surname (for example, Smith, Jones, Brown). Family relationship (for example, father, son, daughter). 3. Year: This boolean feature is triggered if the sentence contains a year (for example, 2005, 2005-06, or 2005-2010). 4. Date: This boolean feature is triggered if there is a date in the sentence. Dates for these purposes include any month name (for example, January, Jan) and also numerical dates of various kinds (for example, 09/09/2005, 9.9.2005, and so on3 ). 8.3 Syntactic Features Ten syntactic features, were identified as particularly appropriate for representing biographical texts, based on data published as part of the research project described by B IBER (1988) (see Section 2.1.2 on page 11), who made available comprehensive frequency counts of syntactic features by genre in a corpus constructed largely of the Lancaster-Oslo-Bergen (LOB) corpus (J OHANSSON ET AL ., 1978) (see page 16 for a list of the genre used). From this data, it is straightforward to calculate those features most and least prevalent in the biographical genre. For each syntactic feature (see Table 8.4 on the next page and Appendix C) the mean frequency across genres (excluding the biographical genre) was calculated for each syntactic feature. Then the syntactic features with the maximum distance from the mean for the biographical genre were identified. For example, if the mean frequency per thousand words for the feature “past tense” is 43.7, across all genres excluding biography, and the biography frequency is 68.4 per thousand, then the biography genre has 24.6 past tense features above the mean. All 67 features can be analysed in this way to produce a ranking of the features most distinctive of the biographical genre. Table 8.5 on page 150 shows the top twenty syntactic feature in rank order according to that feature’s distance from the mean (whether positive or negative). Another method for identifying biographically relevant features was tried, which used standard deviations (that is, z-scores) from the mean instead of raw distance (see Appendix C). The initial method of using features identified by their raw distance from the mean was favoured however, as those features identified using standard deviations from the mean included features that can sensibly be used only at the document level (for example, type/token ratios). Table 8.5 on page 150 shows the twenty most discriminatory features ranked by distance from the mean. Tables 8.6 on page 150 3 The regular expressions used to identify dates are reproduced at: http://www.dcs.shef.ac.uk/ mac/date regexp.txt 148 C HAPTER 8: F EATURE S ETS Table 8.4: Syntactic Features Used by B IBER (1988). past tense present tense time adverbials second person pronouns pronoun IT indefinite pronouns WH questions gerunds agentless passives BE as main verb THAT verb complements WH clauses present participle clauses past prt. WHIZ deletions THAT relatives: subj. position WH relatives: subj. position WH relatives: pied pipes adv. subordinator - cause adv. sub. - condition prepositions predictive adjectives type/token ratio conjuncts hedges emphatics demonstratives necessity modals public verbs suasive verbs contractions stranded prepositions split auxillaries non phrasal coordination analytic negation perfect aspect verbs place adverbials first person pronouns third person pronouns demonstrative pronouns DO as pro-verb nominalisations nouns BY passives existential THERE THAT adj complements infinitives past participle clauses present prt. WHIZ deletions THAT relatives: obj. position WH relatives: obj. position sentence relatives adv. sub. - concession adv. sub. - other attributive adjectives adverbs wordlength downturners amplifiers discourse particles possibility modals predictive modals private verbs SEEM/APPEAR that deletion split infinitives phrasal coordination synthetic negation and 8.7 on page 151 show the twenty most characteristic features of the biographical genre, and the twenty least characteristic syntactic features respectively. Additionally, Appendix C shows results for all the syntactic features identified by B IBER (1988). Biber’s original work was conducted in the late 1980s, at a time when natural language processing tools were less well developed. Highly accurate (97% +) part-of-speech tagging was not available for the original work, instead, in order to identify the linguistic features of interest, Biber relied heavily on a gazetteer based approach, augmented with simple pattern matching. For instance, when identifying past tense verbs, Biber’s work relied on the use of a 149 C HAPTER 8: F EATURE S ETS Table 8.5: Twenty Syntactic Features Most Characteristic of Biography Ranked by Maximum Distance from Mean. Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Distance 41.9 29.7 24.6 16.3 13.0 12.5 10.1 8.0 7.1 4.4 4.2 3.6 3.3 2.7 2.7 2.5 2.3 2.2 2.2 2.0 Feature Name present tense adverbs past tense prepositions nouns contractions second person pronouns first person pronouns attributive adjectives private verbs BE as main verb type/token ratio demonstrative pronouns pronoun IT predictive modals nominalisations analytic negation emphatics non phrasal coordination that deletion Non-bio mean 77.8 95.6 43.7 106.2 179.3 13.4 10.7 30.1 59.2 18.0 28.4 51.5 4.2 10.3 6.0 18.0 8.5 6.4 4.6 3.2 Biographical Mean 35.9 65.9 68.4 122.6 192.4 0.9 0.6 22.1 66.4 13.6 24.2 55.2 0.9 7.6 3.3 20.6 6.2 4.2 2.4 1.2 Table 8.6: Twenty Syntactic Features Characteristic of Biography Ranked by Positive Association with Biographical Genre. Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Distance 24.6 16.3 13.0 7.1 3.6 2.5 1.4 1.4 1.3 1.3 1.2 0.9 0.7 0.4 0.4 0.35 0.3 0.3 0.2 0.1 Feature Name past tense prepositions nouns attributive adjectives type/token ratio nominalisations agentless passives perfect aspect verbs phrasal coordination split auxiliaries infinitives demonstratives synthetic negation WH relatives: obj. position suasive verbs WH relatives: subj. position WH relatives: pied pipes third person pronouns BY passives adv. sub. - other 150 Non-bio Mean 43.7 106.2 179.3 59.2 51.5 18.0 8.4 9.1 3.5 5.2 15.6 9.7 1.8 1.4 2.7 2.05 0.6 33.9 0.6 0.9 Biographical Mean 68.4 122.6 192.4 66.4 55.2 20.6 9.9 10.6 4.9 6.6 16.9 10.7 2.6 1.9 3.2 2.4 1.0 34.3 0.9 1.1 C HAPTER 8: F EATURE S ETS Table 8.7: Twenty Syntactic Features Characteristic of Biography Ranked by Negative Association with Biographical Genre. Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Distance -41.9 -29.7 -12.5 -10.1 -8.0 -4.4 -4.2 -3.3 -2.7 -2.7 -2.3 -2.2 -2.2 -2.0 -1.9 -1.8 -1.7 -1.6 -1.3 -1.2 Feature Name present tense adverbs contractions second person pronouns first person pronouns private verbs BE as main verb demonstrative pronouns pronoun IT predictive modals analytic negation emphatics non phrasal coordination that deletion possibility modals predictive adjectives DO as pro-verb adv. sub. - condition stranded prepositions place adverbials Non-bio Mean 77.8 95.6 13.4 10.7 30.1 18.0 28.4 4.2 10.3 6.0 8.5 6.4 4.6 3.2 5.9 4.9 2.9 2.5 1.9 3.2 Biographical Mean 35.9 65.9 0.9 0.6 22.1 13.6 24.2 0.9 7.6 3.3 6.2 4.2 2.4 1.2 4.0 3.1 1.2 0.9 0.6 2.0 stored dictionary, and assumed that any word ending in –ed and longer than six letters was a past tense verb. The current work relied heavily on a standard Hidden Markov model based part-of-speech tagger, using a subset of the Penn Treebank tagset.4 Ten features were chosen, the five features most characteristic of biography, and the five features least characteristic. The five most characteristic features were: 1. Past Tense: Identified by the part-of-speech tagger. 2. Preposition: Identified by the part-of-speech tagger. 3. Noun: Identified by the part-of-speech tagger. 4. Attributive Adjective: These are adjectives that fit into the pattern ADJ + ADJ/N. For example, “big cat”, or “big scary cat”. 5. Nominalisation: These were Nouns identified by the part-of-speech tagger, ending in –tion, –ment, –ness, or –ity. 4 The Perl module Lingua-EN-Tagger available from CPAN. http://www.cpan.org Accessed on 02-01-07. 151 C HAPTER 8: F EATURE S ETS The five least characteristic features were: 1. Present Tense: Identified by the part-of-speech tagger. 2. Adverb: Identified by the part-of-speech tagger. 3. Contraction: Identified using a gazetteer of common contractions. 4. Second Person Pronouns: Identified using a gazetteer. 5. First Person Pronouns: Identified using a gazetteer. Note that the biographical texts used by B IBER (1988) excluded autobiography, hence the low frequency of first person pronouns. 8.4 Key-keyword Features A further alternative for selecting biographical features involves using keykeywords. T RIBBLE (1998) adopted a “keyword” methodology for genre analysis that is much more straightforward to execute than the multi-dimensional method (see Section 2.1.2 on page 21 for more on the motivation for using the key-keyword method for genre analysis). As the key-keyword methodology is designed to select those features especially distinctive of a given genre, it can also (we hypothesise) be employed as a feature selection method for genre classification purposes. First, two corpora were constructed, a biographical corpus consisting of 383 short biographical documents from Wikipedia and Chambers Dictionary of Biography (see Section 4.3 on page 87 for a description of these corpora) and a reference corpus (the B ROWN corpus, see Section 4.3 on page 87). 5 It is important that attempts are made to make the reference corpus balanced (that is, containing text from various different sources), hence the use of the B ROWN corpus, which, despite its roots in the 1960s, does cover a large number of text types (again, see Section 4.3 on page 87). Note that T RIBBLE (1998) found that the size of the reference corpus used is not of vital importance, a result also gained by X IAO and M C E NERY (2005), who discovered that the one million word FLOB corpus6 and the 100 million word British National Corpus,7 yielded a similar keyword list. This suggests that the one million word BROWN corpus is a suitable choice for the task in terms of its size and balance. However, one important difference between the biographical and reference corpora is that the reference corpus is entirely composed of American English, whereas the biographical corpus is composed of British English (Chambers) along with other English variants, including American and British English (Wikipedia biographies). 5 The biographical corpus consisted of 47,967 words taken from 383 documents. These documents were randomly selected from Wikipedia Biographies (194 documents used) and Chambers Biographies (189 documents used). Both these sources of biographical data are described in Section 4.3 on page 87. 6 Freiburg-LOB corpus http://khnt.hit.uib.no/icame/manuals/flob/INDEX.HTM Accessed on 02-01-07 7 British National Corpus http://www.natcorp.ox.ac.uk Accessed on 02-01-07 152 C HAPTER 8: F EATURE S ETS Two related methods for extracting key-keywords were used in this work. First, the naive key-keywords method. Second, the WordSmith key-keywords method. Note that the WordSmith method was used by T RIBBLE (1998). 8.4.1 Naive Key-keywords Method The process of identifying “naive” key-keywords can usefully be divided into two stages: 1. The most discriminatory one thousand biographical keywords were identified by comparing the biographical corpus (the Chambers and Wikipedia biographical documents) with a reference corpus (the BROWN corpus) using the feature selection method (described in Section 2.5.3 on page 50). These 1000 most discriminatory unigrams as identified by the method are referred to as “keywords” for the biographical genre. 2. The 1000 most discriminatory keywords as identified by the method were re-ranked according to the number of biographical documents in which they occur, remembering that there are 383 biographical documents in total. The resulting ranking is the naive key-keyword8 ranking. For example, if the unigram “born” occurs in 320 biographical documents, and the unigram “married” occurs in 205 biographical documents, then the unigram “born” will be ranked above the unigram “married” in the key-keywords list. That is, the unigram “born” will have a higher key-keyword ranking than “married”. The intuition here is that while a high ranked keyword may occur in only one or two biographical document, a high ranked naive key-keyword is likely to appear in many biographical documents. Table 8.8 on the following page presents the twenty most frequent unigrams in the biographical corpus, together with information about the number of biographical documents (that is, Chambers or Wikipedia biographies) in which the unigram occurs. Table 8.9 on page 155 shows the 20 unigrams with the highest naive key-keyword value (that is, of the most discriminatory 1000 keywords as identified by the algorithm, those 20 that appear in the most biographical documents). Note that column three of Table 8.8 and Table 8.9 refers to the proportion of biographical texts in which the keyword occurs, and column four gives the number of texts in which the keyword occurs (of which there were 383 in total). Note also that ordinary function words appear high on both lists (for example, “in”, and “and”). The word “in” is used disproportionately frequently in the biographical texts to indicate the time of a biographically significant event (for example, “He died in 1964”) or the location of an event (“He was born in London”). Of the 26,339 instances of “[iI]n”(that is “in” or “In”) in the BROWN corpus, only 607 (3%) were followed by a four digit date. When the 8 We have named this method the “naive” method as it is less computationally intensive than the WordSmith method. 153 C HAPTER 8: F EATURE S ETS Table 8.8: Unigrams in the Biographical Corpus Ranked by Frequency (with Additional Information about the Number of Biographical Documents in which the Unigram Occurs). Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Unigram the in of and he a was to his as for at on s with by she that from an % of Bio Docs in which Unigram is Present 97 94 87 89 78 81 83 79 63 53 49 50 42 37 44 38 17 26 39 40 No. of Bio Docs in which Unigram is Present 372 361 336 344 302 312 321 270 244 204 189 193 163 145 170 148 067 102 153 154 biographical texts were analysed, the proportion of instances of “in” followed by four digits was 26% (504 out of 1983 instances).9 Additionally, “in” occurs more than twice as often in the biographical texts as in the reference (B ROWN) corpus (4.24% and 2.09% respectively). It is possible that the large discrepancy in the frequency of “in” is likely to arise — at least partially — from the increased use of the word “in” to associate an event with a year in biographical text. 8.4.2 WordSmith Key-keywords Method The process for identifying WordSmith key-keywords falls into two stages: 1. For each of the 383 biographical documents a keyword list was produced (using the algorithm and the B ROWN corpus as a reference corpus.10 9 The regular expressions used to identify “in”, and “in” followed by a four digit year, were “ s[Ii]n( s|,|.)”, and “ s[Ii]n s d d d d( s|,|.)”, respectively. 10 Keywords were selected from each biographical document by calculating the value for each word type in that document against the reference corpus. The average value of all the word types in the biographical document was then calculated, and those word types that had a value greater than the average were selected as keywords for that biographical document. This operation was performed using the AntConc concordancing software. 154 C HAPTER 8: F EATURE S ETS Table 8.9: Unigrams in the Biographical Corpus Ranked by Naive Key-keyness (with Additional Information about the Number of Biographical Documents in which the Unigrams Occur). Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Unigram in and was he his born as at an became after first also died she later known years her work % of Bio Docs in which Unigram is Present 94 89 83 78 63 57 53 50 40 29 28 22 22 20 17 17 16 15 15 14 No. of Bio Docs in which Unigram is Present 361 344 321 302 244 222 204 193 154 112 110 086 086 078 067 066 065 061 059 057 2. A key-keyword list was produced by identifying those words that appeared as keywords in the greatest number of biographical documents “A ‘key key-word’ is one which is ‘key’ in more than one of a number of related texts. The more texts it is ‘key’ in, the more ‘key key’ it is.”11 This method can be contrasted with the naive key-keywords method. Instead of re-ranking the keywords according to the number of biographical documents in which they occur, the Wordsmith method simply ranks words according to the number of documents in which they are key. For example, if the unigram “born” is a keyword in 100 biographical documents and the unigram “married” is a keyword in 40 biographical documents, then the keyword “born” will have a higher WordSmith key-keyword ranking than “married”.12 11 WordSmith documentation: http://www.lexically.net/downloads/version4/ Accessed on 01-05-07. 12 In his original work on genre analysis, T RIBBLE (1998) used the WordSmith suite of programs to identify key-keywords (this approach was also adopted by X IAO and M C E NERY (2005)). The WordSmith program was not available for this work, but similar functionality was achieved using AntConc, a text analysis and concordancing tool developed by Laurence Anthony, at Waseda University, Tokyo. For the naive key-keywords method AntConc was used to identify biographical keywords against a reference corpus, using the feature selection method, then identified keywords were post-processed using Perl scripts in order to rank them by the proportion of biographical texts in which they occurred. For the WordSmith key-keywords, AntConc was used to generate keyword lists for each of the 383 biographical documents (again using the method and 155 C HAPTER 8: F EATURE S ETS Table 8.10: Unigrams in the Biographical Corpus Ranked by WordSmith Keykeyness (with Additional Information about the Number of Biographical Documents in which the Unigrams Occur. Rank Unigram 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ii usa actress stub honour centre iii edinburgh video bbc barry albums yorkshire vols uk medal lionel iraq honour eng % of Bio Docs in which Unigram is Key 6.8 5.0 3.4 2.9 2.1 2.1 1.8 1.8 1.6 1.6 1.6 1.6 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 No. of Bio Docs in which Unigram is Key 26 19 19 11 08 08 07 07 06 06 06 06 05 05 05 05 05 05 05 05 Table 8.10 shows twenty unigrams from the biographical corpus ranked by WordSmith key-keyness. It is noticeable that the unigrams identified using the WordSmith key-keyword method differ from those identified by the naive key-keyword method. The lack of function words among the highest ranking WordSmith key-keywords is striking, as is the appearance of unigrams that are perhaps tied to the particular biographical corpora used, rather than the biographical genre in general. For example, the unigrams “ii” and “iii” appear very high in the key-keyword list because of the convention of referring to monarchs and emperors using Roman numerals (for instance, “Selim II” or “Mahmud II” in Chambers biographies). Similarly the unigram “stub” — a word used by Wikipedia to indicate that an entry is a short summary — appears as a keyword in 2.9% of biographical documents. Place names (“edinburgh”, “uk” and “yorkshire”) also appear on the list, as well as a single personal name (“lionel”). It is noticeable that intuitively biographical words like, for instance, “born”, “married” or “died”, do not occur in the list. the B ROWN corpus as a reference corpus). A Perl script was then used to identify those unigrams that were key in the greatest number of documents. 156 C HAPTER 8: F EATURE S ETS 8.5 Conclusion This chapter has described various feature sets developed for this research work. It is important to understand the feature sets as they are referenced extensively in Chapters 9 and 10, where a series of experiments compares the performance of different feature sets on the biographical sentence classification task. 157 C HAPTER 9 Automatic Classification of Biographical Sentences This chapter presents a series of experiments based on the gold standard data described in Chapter 5 and validated in Chapter 6 on page 124. The gold standard data, together with the bio features feature extraction program and the WEKA machine learning environment (see Section 4.2 on page 86) provides a test-bed for assessing the classification power of different feature sets for the biographical sentence classification task. This chapter is divided into four sections. The first section describes the common procedure for all experiments, and the remaining three sections each reflects a different research theme: Syntactic Features — comparing the performance of syntactic and “bagof-words” based feature representations for the biographical categorisation task (see page 160). This section addresses Hypothesis 2 (““Bagof-words” style sentence representations augmented with syntactic features provide a more effective sentence representation for biographical sentence recognition than “bag-of-words” style representations alone.”) Lexical Features — analysing the performance of lexically based alternatives to the “bag-of-words” approach for the biographical categorisation task (see page 165). Exploring Keyness — Assessing methods for identifying optimal lexically based features, especially the concept of “keyness” and “key-keyness” (see page 170). 158 C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES 9.1 Procedure The bio features (see Section 4.2 on page 86) program was used to construct a feature matrix from the gold standard data of five hundred and one sentences (see Chapter 5) for all the sentence representations used. The resulting feature matrixes were fed to the WEKA machine learning environment. As it has been shown that the Naive Bayes and Support Vector Machine learning algorithms provide the best results on the gold standard biographical data using a 500 frequent unigrams derived from the Dictionary of National Biography (see Chapter 7 on page 137), these two algorithms were both used in all experiments, with decision tree algorithms — the WEKA implementation of the C4.5 algorithm — also used for data exploration purposes. While this chapter describes work using feature derived from the DNB (and other corpora) with the ”gold standard” sentences used as training and test data, other approaches are possible. We can usefully distinguish between: (A) Corpus used to select features. (B) Corpus used to train classifiers. (C) Corpus used to test trained classifiers. The current work primarily uses DNB data as (A) and the gold standard data as (B) and (C). One alternative to this strategy is the use of the “gold standard” data for all three categories (that is, using the gold standard thesis as a source of unigram features, as training data and as test data). However, this strategy has been avoided as it was suspected that using unigrams derived from the gold standard corpus would artificially inflate accuracy. This intuition was found to be well grounded when the Naive Bayes algorithm was used to classify the gold standard data using a set of one hundred unigram features derived from all the unigrams1 in the gold standard data set using the algorithm. The gold standard data itself was used as a source of biographical and non-biographical instances. The result, was, as expected a classification accuracy score higher than all other experiments at 83.90% (using 10 x 10 fold cross validation). This theme is explored further in Chapter 10, where we examine the portability of features identified by Z HOU ET AL . (2004). A further alternative involves a separation between training and test data. For example, using different sources for (A), (B) and (C). That is, derive features from one corpus, train a classifier on another corpus, and finally test the trained classifier on a third corpus.2 This approach has been shown to indicate good results (although it most be stressed that this result is only provisional and requires further work). The 500 hundred most frequent unigrams in the DNB (A) were used as a feature set, in conjunction with a training set consisting of the 1000 sentence sample from the filtered TREC corpus (TREC-F) described on 1 There 2 Note are 3504 unigram types, and 11,245 tokens in the “gold standard data”. that when this kind of approach is used, cross-validation cannot be conducted. 159 C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES page 115 and the 1000 sentence sample of the Chambers Biographical Dictionary (again described on page 115). The TREC-F sample sentences and the CHA-A sample sentences serve as non-biographical and biographical training data, respectively, and functioned as data source (B) (that is, training data). The model trained using the TREC-F/CHA-A data (using the 500 most frequent unigrams in the DNB as features) was then tested on all the gold standard data, achieving a classification accuracy of 75.64%. Although this result is interesting, this thesis is focused on the identification of appropriate features for biographical sentence classification. The topic of exportable models may be a fruitful area for future research. In line with good practice, the current chapter employs a 10 x 10 fold cross validation evaluation methodology (see Section 2.5.2). The corrected re-sampled test was used to compare algorithms for statistical significance (see Section 2.5.2 on page 48 for a discussion of issues in, and methods for, assessing classification). The Dictionary of National Biography was chosen as the main source for deriving features (although others were used) as it is the largest corpus of biographical text available for this work. 9.2 Syntactic Features The text classification literature has consistently shown that the use of syntactic features fails to improve classification accuracy (see Section 3.1.2 on page 57). Indeed S COTT and M ATWIN (1999) states that “it is probably not worth pursuing simple phrase based representations further”. Contrary to this trend in the topic-based text classification field, there is some evidence to show that syntactic features are appropriate for genre classification, as syntactic features, rather than topical words — it is suggested — can capture the non topical style of a text. S ANTINI (2004a) gained encouraging results from the use of part-of-speech trigrams (that is, data was first part-of-speech tagged and then the sequences of three tags most characteristic of each genre were used as features). Also, S TAMATATOS ET AL . (2000a) found that syntactic features (noun phrases, verb phrases, and so on) could improve accuracy for genre classification of modern Greek texts (Section 3.1.3 on page 59 describes this work more fully). It is important to emphasise that both S TAMATATOS ET AL . (2000a) and S ANTINI (2004a) are concerned with classifying at the document rather than the sentence level. Indeed S TAMATATOS ET AL . (2000a) states that a lower bound of one thousand words is desirable in order to increase accuracy. In this research we are concerned entirely with sentence classification; a different but related task. This section – in line with the general hypothesis that “bag-of-words” style sentence representations augmented by syntactic features provide a more effective representation for biographical sentence classification than “bag-of-words” style 160 C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES representations alone — explores different syntactic and pseudo-syntactic feature sets, and compares them to the standard “bag-of-words” approach. In this context, pseudo-syntactic features are word -grams where . They are referred to as pseudo-syntactic features because it is hypothesised that bigrams, trigrams and so on, provide a computationally inexpensive method for capturing syntactic information that does not require complex processing (for example, part-of-speech tagging, chunking and so on). The feature sets used in this experiment are described in Chapter 8, but briefly summarised below: The 2000 most frequent unigrams from the Dictionary of National Biography. This provided the baseline. The 2000 most frequent bigrams in the Dictionary of National Biography. The 2000 most frequent trigrams in the Dictionary of National Biography. Syntactic features (that is, features identified from a statistical analysis of the data presented in B IBER (1988)). Syntactic features and 2000 most frequent Dictionary of National Biography bigrams. Syntactic features and 2000 most frequent Dictionary of National Biography trigrams. The 2000 most frequent unigrams in the Dictionary of National Biography augmented with fifty most frequent bigrams in the Dictionary of National Biography. Syntactic features and the 2000 most frequent Dictionary of National Biography unigrams. The last two — unigrams and syntactic features, and unigrams and fifty bigrams – are included to facilitate the testing of the central hypothesis, that “bag-of-words” style representations augmented by syntactic features (or in the case of bigrams, pseudo-syntactic features) are better representations for the biographical classification task than “bag-of-words” representations alone. Note that Chapter 8 on page 144 comprehensively describes the feature sets used. 9.2.1 Results It can be seen in Table 9.1 on page 163 that unigrams alone perform well at 78.78%,3 with the success of -gram (where ) declining sharply (see Figure 9.1 on the next page for a comparison of the performance of unigram, bigram and trigram representations). It is notable that classification accuracy 3 Note that all percentages quoted were gained using Naive Bayes classification algorithm. 161 C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES Figure 9.1: Comparison of the Performance of Unigrams, Bigrams and Trigrams. Figure 9.2: Comparison of the Performance of Syntactic and Pseudo-Syntactic Features. 162 C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES Table 9.1: Performance of Syntactic and Pseudo-syntactic Features. Feature Set 2000 DNB Unigrams 2000 DNB Bigrams 2000 DNB Trigrams Biber Features Biber Features and DNB Unigrams Biber Features and DNB Bigrams Biber Features and DNB Trigrams 2000 DNB Unigrams and 50 DNB Bigrams Naive Bayes (%) 78.78 69.08 57.98 69.84 80.68 74.07 64.54 79.18 SVM (%) 78.18 71.28 61.45 66.61 77.72 72.42 69.30 77.30 Figure 9.3: Experimental and Null Hypotheses — Syntactic Features. Experimental Hypothesis: There is a difference between “bag-ofwords” style feature representations augmented with syntactic features and “bag-of-words” style representations alone for the biographical categorisation task. Null Hypothesis: There is no difference between the performance of “bag-of-words” style representation and “bag-of-words” style representations augmented by syntactic features. achieved using the most frequent 500 unigrams in the DNB — reported in Chapter 7 — yielded a result of 80.66%, almost 2% higher than that achieved using four times as many unigram features. It is notable that the two feature representations that augment “bag-of-words” style representations with some syntactic representation — or in the case of bigrams, pseudo-syntactic representations — fare better than the “bag-of-words” baseline (that is, unigrams). The resulting accuracy scores were 80.68% for unigrams augmented with syntactic features, and 79.18% for unigrams augmented with pseudo-syntactic features. Neither of these accuracy scores, however, reaches a significance level that would allow strong conclusions to be drawn (using the two tail corrected re-sampled -test) against the baseline unigram performance of 78.78%. In other other words, if we have the Experimental hypothesis and null hypothesis presented in Figure 9.3 (regarding syntactic features) and 9.4 on the next page (regarding pseudo-syntactic features) then we are not entitled to reject the null hypothesis in either case on the results presented here. 163 C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES Figure 9.4: Experimental and Null Hypotheses — Pseudo-Syntactic Features. Experimental Hypothesis: There is a difference between “bag-ofwords” style feature representations augmented with pseudosyntactic features (in this case, bigrams) and “bag-of-words” style representations alone for the biographical categorisation task. Null Hypothesis: There is no difference between the performance of “bag-of-words” style representation and “bag-of-words” style representations augmented by pseudo-syntactic features (in this case bigrams). 9.2.2 Discussion While the difference between the performance of the two feature sets augmented by syntactic features was not statistically significant compared to the unigram baseline, the performance of the unigram and syntactic features representation was almost 2% better than the unigram representation alone. Recall that the syntactic features consist of only ten features (including past tense and attributive adjectives; see Section 8.3 on page 148 for a complete list). These results are consistent with S ANTINI (2004a) and S TAMATATOS ET AL . (2000a) in that they suggest that there is a small accuracy gain in using syntactic features (although S ANTINI (2004a) and S TAMATATOS ET AL . (2000a) did not report whether the difference where statistically significant). Note that the syntactic features (that is, Biber features based on an analysis of B IBER (1988)’s data) performed better than the pseudo-syntactic features (80.68% and 79.18% respectively). Unlike F ÜRNKRANZ (1998), we found that classification accuracy for -grams markedly decreased when (see Figure 9.1 on page 162). F ÜRNKRANZ (1998) saw trigrams as the optimal -gram representation, with sequences greater than 3 resulting in a decrease in classification accuracy. The lack of success of trigrams in the current work can perhaps be attributed to the nature of the corpus from which the trigrams were derived. The Dictionary of National Biography contains much information that is specific to the culture in which it was produced. For instance, several of the most frequent trigrams refer to the British monarchy and particular British institutions (for example, “of the king” and “the british museum”) (see Section 8.3 on page 147). This experiment has shown that, unlike the case of topic orientated text classification, in biographical text classification, “bag-of-words” style representations augmented with syntactic features perform better than “bag-of-words” repre- 164 C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES sentations alone. The difference was not statistically significant using the corrected, re-sampled -test. It remains as an open question whether this kind of small increase in accuracy can be gained for genre classification more generally, or whether it is confined to the special case of biographical text classification. Also, other approaches to genre classification have focused on document classification, whereas sentence classification has been at the centre of the current research. This work does indicate however that the claim that syntactic features are unhelpful for text categorisation (made by M OSCHITTI and B ASILI (2004), and S COTT and M ATWIN (1999)) may apply only to topical categorisation and not to tasks (like genre orientated classification) where the stylistic elements of a text are important. 9.3 Lexical Methods This section explores whether the choice of frequent lexical items from a biographical corpus (in this case the 2000 most frequent unigrams in the Dictionary of National Biography) produces better accuracy for the biographical classification task than other lexeme based methods. Three alternative lexeme based methods are compared to a baseline — used in the previous section — of the 2000 most frequent unigrams in the Dictionary of National Biography. The first alternative representation is based on the intuition that function words can capture the non-topical content of text. Function words have been shown to be suboptimal in the authorship attribution research tradition (see Section 2.2.3 on page 27) compared to the use of synonym pairs (for example “while”/”whilst”), and it has been suggested that this is because function words are characteristic of genre rather than individual authorial style within a genre (H OLMES and F ORSYTH, 1995). Three hundred and nineteen function words were used as feature representations.4 The second alternative representation requires the use of stemming; the reduction of inflected word forms to their stem (root) form (see Section 8.1 on page 144). Stemming is a commonly used technique in the computational linguistics and information retrieval research traditions (W ITTEN ET AL ., 1999), and the Porter algorithm is a widely used stemming algorithm (P ORTER, 1980). Stemming allows inflected variants of the same stem (root) word to be represented by one feature. For example, instead of the two separate features “married” and “marry”, one feature will represent both unigrams (using the Porter stemmer, this single feature is “marri”). This reduction of inflected variants to a canonical form (it is suggested) will provide better classification accuracy for the biographical categorisation task, as key biographical words (for example, 4 The list of English function words is available from the University of Glasgow, Department of Computer Science: http://www.dcs.gla.ac.uk/idom/ir resources/linguistic utils/stop word Accessed on 02-01-07. 165 C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES Table 9.2: Performance of Alternative Lexical Methods. Feature Set 2000 DNB Unigrams (baseline) 319 Function Words 2000 DNB Unigrams (stemmed) 1713 DNB Unigrams (no function words) Naive Bayes (%) 78.78 75.43 79.93 72.37 SVM (%) 78.18 73.59 78.92 76.94 “work/worked”, “son/sons”, “live/lived/living”, and so on) will be represented by a single feature, and not “diffused” throughout the feature matrix. Recent work in topic orientated text classification has shown that stemming produces no advantages when compared to non-stemmed representations (for example, T OMAN ET AL . (2006)). This result may not hold, however for genre classification task, or the special case of biographical classification. A third approach involves — in contrast to the first approach — the removal of non-topical function words. This approach is commonly used in the topic orientated text classification community where it is referred to as stopwording (see Section 8.1), based on the intuition that topic neutral function words are unlikely to contribute to classification accuracy. In the case of biographical classification however, where the genre of the text is the target of the feature representation, classification accuracy (it is hypothesised) is likely to reduce with the removal of stopwords, compared to a baseline which includes those functional stopwords. Four feature sets were used in this experiment. They are summarised below, and described more extensively in Section 8.1: The 2000 most frequent unigrams from the Dictionary of National Biography. This provided the baseline. 319 function words. The 2000 most frequent unigrams from the Dictionary of National Biography in stemmed form. The 1713 most frequent unigrams from the Dictionary of National Biography with function words removed (that is, stopworded). 9.3.1 Results It can be seen from the data presented in Table 9.2 and Figure 9.5 on the following page that the stemmed DNB unigrams provided the best performance; 79.93% compared to the baseline DNB unigram representation of 78.92%. This accuracy improvement however is not statistically significant with respect to 166 C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES Figure 9.5: Comparison of the Performance of Differing Lexical Representations. the corrected re-sampled -test, hence the null hypothesis (presented in Figure 9.6 on the next page) cannot be rejected. The use of function words alone does not improve classification accuracy compared to the baseline for the biographical classification task. Rather, accuracy actually decreases from 78.78% in the case of Dictionary of National Biography unigrams, to 75.43 for function words. (see Table 9.2 on the preceding page). The null hypothesis presented in Figure 9.7 on the next page is rejected, but only because the use of function words alone decreases accuracy at a statistically significant level compared to the unigram baseline. The absence of function words in a feature representation identical in other respects to the DNB unigram representation (that is, the DNB unigram feature set with the function words removed) was shown to reduce categorisation accuracy (compared to the original 2000 feature DNB representation). This difference ) using the corrected was shown to be statistically highly significant ( re-sampled -test. Therefore, it is acceptable to reject the null hypothesis presented in Figure 9.8 on page 169. 167 C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES Figure 9.6: Experimental and Null Hypotheses — Stemming. Experimental Hypothesis: There is an accuracy difference between the performance of stemmed and plain unigrams (derived from a biographical corpus) for the biographical sentence classification task. Null Hypothesis: There is no accuracy difference between the performance of stemmed and plain unigrams (derived from a biographical corpus) for the biographical sentence classification task. Figure 9.7: Experimental and Null Hypotheses – Function Words. Experimental Hypothesis: There is an accuracy difference between function word based features and frequent unigrams (derived from a biographical corpus) for the biographical sentence categorisation task. Null Hypothesis: There is no difference between the performance of function word based feature representation, and frequent unigrams based representations (derived from a biographical corpus) for the biographical sentence categorisation task. 9.3.2 Discussion These results show that, for the biographical classification task the use of content neutral function words produces less accurate results than the use of unigrams derived from a biographical corpus (75.43% and 78.73%, respectively). This could be for a number of reasons. Perhaps the presence of a few archetypal biographical words (for example, “born”, “died”, “married”, and so on) is more strongly associated with biographical text than the use of a particular biographical style that can be identified using function words. In other words, while function words may do some of the work in biographical classification (particularly prepositions for identifying place and time; see Section 2.1.2 on page 11), archetypal biographical words are – it is suggested – helpful for identifying difficult cases. It is notable that the differences between the two accuracy scores is only 3.3%, a small difference when we consider that the function word feature set consists of only 319 features, and the frequent unigram feature set consists of 2000 features. Further evidence for the view that function words are important for the bi- 168 C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES Figure 9.8: Experimental and Null Hypotheses — Stopwords. Experimental Hypothesis: There is a difference between the performance of a feature set based on the 2000 most frequent unigrams in the Dictionary of National Biography with all function words removed, and the original, unmodified 2000 unigram representation, for the biographical categorisation task. Null Hypothesis: There is no difference between the performance of a feature set based on the 2000 most frequent unigrams in the Dictionary of National Biography with all function words removed and the 2000 most frequent unigrams in the Dictionary of National Biography, for the biographical categorisation task. ographical categorisation task is provided by the performance of the “stopworded” feature set (that is, the feature set that contains the most frequent 2000 unigrams in the Dictionary of National Biography minus function words). The “stopworded” feature set had the worst performance compared to the baseline (72.37% and 78.78, respectively — see Figure 9.5 on page 167). The “stopworded” feature set performed worse than the function word feature set (72.37% and 75.43%, respectively) despite the fact that the “stopworded” feature set contained 1713 features, and the function word feature set only 319 features. Cumulatively, this work would tend to support the suggestion made by H OLMES and F ORSYTH (1995) that function words are important for genre classification. It is possible that function words are important for capturing the stylistic content of text, this result supports hypothesis 2, that syntactic features — broadly understood — are important for genre classification. This claim would however require further investigation as the scope of this research is confined to biographical sentence classification. The best result was gained through stemming (79.93%). This accuracy was not however significantly better than the baseline (78.78% — gained using two thousand frequent DNB unigrams). One possible reason for the slight increase in accuracy achieved by the stemming algorithm is that the baseline feature set consists of many inflected forms of the same base word (for example, “act”, ”acted”, ”acting”) which are reduced in the stemmed feature set, making for a more compact and efficient representation where concepts are less “diffused” through the feature matrix. 169 C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES 9.4 Keywords T RIBBLE (1998) identified a methodology for selecting genre specific key-keywords. An overview of the basic feature selection technique is provided in Section 2.1.2 on page 21 and a description of the feature sets used is described in Section 8.4 on page 152 Two related methods for identifying key-keywords are used in this work. First, the naive key-keywords method. Second, the WordSmith key-keywords method (note that this is the method used by T RIBBLE (1998) and X IAO and M C E NERY (2005)). The important difference between the naive and WordSmith key-keyword methods is that the naive method ranks keywords according to the number of biographical documents in which the keyword occurs, and the WordSmith method ranks keywords according to the number of biographical documents in which the word is key.5 For the naive key-keyword method, if the keyword “born” occurs in forty-five biographical documents, it will be ranked above “marry”, which occurs in twenty-five biographical documents. For the WordSmith key-keyword method, if “lived” is a keyword in fifteen biographical documents, it will be ranked above “educated”, if “educated” is a keyword in only four biographical documents.6 For each key-keyword identification method, the five hundred top key-keywords (ranked by key-keyness) are retained. Feature selection is a commonly used technique in machine learning (W ITTEN and F RANK, 2005), and it has been shown that aggressive feature selection increases classification accuracy for some kinds of text classification tasks (YANG and P EDERSEN, 1997). It is hypothesised that key-keywords based methods will provide more genre representative features than the use of either frequent unigrams, or derived keywords, alone. Note that feature selection was not performed on the gold standard data. Rather, features were identified using a corpus constructed from Wikipedia and Chambers data, in order that unigram features characteristic of the biographical genre in general could be identified. The key-keyword methodology was utilised by T RIBBLE (1998) (and subsequently validated and explored by X IAO and M C E NERY (2005)) as a method of genre analysis that avoids the statistical and computational overheads of multidimensional analysis (see Section 2.1.2 on page 11).7 However, the method can easily apply to feature identification for text classification, as the aim of using the method is the same; identifying those features most representative of 5 The naive key-keyword method requires less significantly less processing than the WordSmith method, as for the WordSmith method a distinct keywords list must be generated for each biographical document. 6 These are examples are for explanatory purposes only and do not describe actual frequencies. 7 Note that that T RIBBLE (1998) applied the WordSmith key-keywords function to the genre problem. The software has existed since 1996 (see http://http://www.lexically.net/publications/publications.htm Present on 01-05-07.) 170 C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES a given genre. In the case of T RIBBLE (1998), this analysis is performed to cast light on genre differences, and in the current context, the analysis is performed to identify those unigram features most characteristic of the biographical genre to facilitate machine learning. Note that all the feature sets used in this experiment were derived from biographical documents from Wikipedia and The Chambers Biographical Dictionary — the biographical corpus — described in Section 8.4 on page 152. The B ROWN corpus was used as a reference corpus. The feature sets used are fully described in Chapter 8 on page 144, but summarised briefly below: The 500 most frequent unigrams from the biographical corpus. The 500 most discriminatory keywords identified by the algorithm. The 500 most discriminatory key-keywords identified using the naive keykeywords method. The 500 most discriminatory key-keywords identified using the WordSmith key-keywords method. The 500 most frequent unigrams are included as a baseline against which to test the performance of the keywords and key-keywords representations. The Dictionary of National Biography was not used as a biographical corpus because of the need for biographical documents of similar lengths. Dictionary of National Biography entries vary considerably in length. 9.4.1 Results Table 9.3 and Figure 9.9 on the following page show the 500 most frequent unigrams feature set performed at 81.25%. The keyword feature set achieved 76.86%. The WordSmith key-keywords and naive key-keywords achieved 68.34% and 78.92%, respectively. The difference between the 500 frequent unigram and the 500 naive key-keywords feature sets was not statistically significant. The difference between the 500 frequent unigrams and the WordSmith method was significant however, with the WordSmith key-keywords feature set performance significantly worse than the frequent unigram feature set. For the WordSmith key-keywords, the null hypothesis presented in Figure 9.10 on the next page can be rejected. This was a surprising result, as it was expected that both feature selection (that is, the keyword feature set) and both the key-keyword feature sets would achieve better results than the simple frequent unigram based representation. Indeed, the frequent unigram representation outperforms both the (keyword) feature set and the the WordSmith key keywords feature set at a statistically significant level ( using the corrected re-sampled -test).8 8 Note that the two-tailed test was used despite the expectation that the keyword and keykeyword features would perform better than the baseline. 171 C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES Figure 9.9: Comparison of the Performance of Keywords, Key-Keywords, and Frequencies. Table 9.3: Performance of Keyword and Key-keyword features Relative to a Baseline. Feature Set 500 Frequent Unigrams 500 Keywords 500 Naive Key-Keywords 500 WordSmith Key-Keywords Naive Bayes (%) 81.25 76.86 78.92 68.34 SVM (%) 76.07 76.90 78.32 63.11 Figure 9.10: Experimental and Null Hypotheses – Key-Keywords. Experimental Hypothesis: There is a difference between the performance of key-keyword based features and frequent unigrams (derived from a biographical corpus) for the biographical categorisation task. Null Hypothesis: There is no difference between the performance of key-keyword based features and frequent unigrams (derived from a biographical corpus) for the biographical categorisation task. 172 C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES Figure 9.11: Comparison of Partial Decision Trees for Each Feature Set. school FREQUENT WORDS no bio yes york bio yes no bio college yes no son bio . . . . . . university NAIVE KEYKEYWORDS bio bio phd yes no bio television no yes bio in . . . bio yes no yes married no edinburgh bio wife yes no yes no no career WORDSMITH KEYKEYWORDS yes no best son bio married bio no yes no yes wife bio won yes university yes no bio best no university KEYWORDS yes no bio yes bio . . . 9.4.2 Discussion These results show that, for the biographical categorisation task at least, the use of key-keywords reduces classification accuracy. It is important to note that feature selection was performed using external data (Wikipedia and Chambers as a biographical corpus, and the B ROWN corpus as a reference corpus), in order to avoid artificially inflating classification accuracy. 173 C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES In order to gain insight into the differing performance of the four feature sets, and the surprising success of the frequency feature set, the C4.5 decision tree algorithm (see Section 2.5.1 on page 39 for more on decision trees generally) was used to explore decision points in the four feature sets (although it is important to emphasise that the Naive Bayes algorithm does not depend on these decision points). Figure 9.11 on the page before shows that — for the top levels and with the exception of the WordSmith key-keywords representation — the trees are similar, with the major difference between the top performing frequency feature set, and the keywords and naive key-keyword feature sets, being that “school” is used as the root node of the frequency tree, and does not occur in the top part of the other trees. The WordSmith key-keyword tree is very different from the other three trees as there is little overlap between the features selected by the WordSmith method, and those selected by the alternatives. A partial explanation for the surprising results is that the “school” feature is a key discriminator in the gold standard biographical data, and while frequent enough in the Wikipedia and Chambers to warrant inclusion in the most frequent 500 unigrams, did not occur sufficiently frequently compared to the reference corpus to occur in the keywords list or in either of the the key-keyword lists. The differences between American and British English may be crucial here. The term school is used often in American English to describe what in British English, would be described as “University”. Additionally, the word “school” occurs frequently in compounds like “high school” and “elementary school”. The B ROWN corpus — the reference corpus used in this work — is a general corpus of American English and hence contains a higher proportion of this extended sense of “school” (for example, “high school”, “elementary school”) than a corpus of British English. It is notable that there is a substantial difference between the frequency based and naive key-keywords feature sets. Three hundred and fourteen words appear in the frequency list that do not appear in the naive key-keywords list (see Section 8.4 for a list of features). While some biographically important function words appear in the naive key-keywords list — for example, the preposition “in”, the connective “and” and the pronoun “he” — many are absent. For example, “the”, “of” and “to” appear in the frequency list but not in the naive key-keywords list. Similarly, words that we would intuitively regard as biographical appear in the frequency list — words like “lived” and “children” — but do not appear in the naive key-keywords list. The difference between the frequency based feature set and the WordSmith key-keyword feature set is even more marked than the difference between the frequency based feature set and the naive key-keywords feature set. 437 words occur in the frequency based feature set that do not occur in the WordSmith key-keywords feature set. Biographically relevant function words (like “the” and “of”) are missing from the WordSmith feature set, as are more obviously biographical words like “born” and “children”. 174 C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES There are a number of possible reasons why both the key-keyword feature sets failed to provide a better (in terms of classification accuracy) representation than the the simple frequency list: The differences between British and American English (discussed above). The gold standard corpus is drawn from sources of British English, as was the set of documents on which the frequency list was derived.9 Yet the reference corpus used by the feature selection algorithm consisted of American English. This may have effected the keywords selected by the algorithm from which the key-keywords were selected. The biographical corpus, consisting of the Wikipedia and Chambers data, while large enough to provide a “biographical” frequency list, was not large or varied enough to counter the inclusion of ostensibly non-biographical unigrams (for example, “neoclassical”, or “lanarkshire”) that occurred towards the top of both key-keyword lists. It is possible that the number of features used was too low for the benefits of the key-keyword approaches to be clear. Perhaps if more features were used in each case, key-keywords may outperform the simple frequency list approach. The frequency and keyword lists were derived from biographical documents rather than biographical sentences, whereas the classification task involved the classification of biographical sentences (more specifically, biographical sentences identified using the annotation scheme outlined in Chapter 5). The non-biographical sentences in the biographical documents counted equally with the biographical sentences in the frequency calculations. It is possible that the key-keywords method discarded many features that are characteristic of biographical sentences. This possibility is weakened however if we consider the high level of biographical sentences in Wikipedia and Chambers (85%+). It is possible that the key-keyword methods are capturing corpus specific features, rather than genre specific features, and that frequency lists derived from corpora of a given genre of interest provide a better insight into that genre. In other words, a frequency list derived from a corpus of a given genre may reflect that genre’s characteristics, better than keykeywords, which are too specific to the topic orientated idiosyncracies of the corpus. it is possible that for the WordSmith key-keyword feature sets, the biographical texts may have been too short to generate expected frequencies greater than or equal to five for each feature (necessary to ensure the reliability of the feature selection). This problem is addressed in the “further work” section in the concluding chapter. 9 Wikipedia contains a variety of national types of English. See: http://en.wikipedia.org/wiki/Wikipedia%3AManual of Style Accessed on 02-01-07. 175 C HAPTER 9: A UTOMATIC C LASSIFICATION OF B IOGRAPHICAL S ENTENCES Another interpretation of the results is that the nature of the current task — sentence classification rather than document classification — may not be suitable for a key-keyword approach as genre analysis techniques are only appropriate at the document level. Indeed, the Systemic Functional Linguistics tradition (see Section 2.1.1 on page 8) holds that single sentences cannot be described as having a genre. Rather (according to System Functional Linguistics theory), genre is a phenomenon of the discourse level. On the basis of the work presented in this section, key-keyword methodologies are not suitable techniques for the identification of unigram features for biographical sentence classification, as simple frequency counts provide better performance. This result is surprising however, as the opposite result — that key-keywords would prove to be a better feature set than frequent unigrams — was expected. The topic needs further work, however, before definitive conclusions can be drawn. 9.5 Conclusion This chapter has reported on investigations into feature sets for biographical sentence classification. The investigation has centred around three themes; the utility of syntactic representations, the utility of non-standard lexical representations, and the utility of “keyword” based methods for the biographical sentence classification task. The most important findings are: “Bag-of-Words” style features augmented by syntactic features increase classification accuracy for the biographical sentence classification task, compared to the use of “bag-of-words” features alone (although not at a statistically significant level.) Stemming increases classification accuracy compared to the use of plain frequencies (although not at a statistically significant level). The use of key-keyword based methods provides lower classification accuracy than the use of frequent unigrams alone. The next chapter examines the exportability of features for the biographical classification task. 176 C HAPTER 10 Portability of Feature Sets This chapter explores the portability of the biographical features identified by Z HOU ET AL . (2004), who identified 5062 unigram features from the University of Southern California biographical corpus (USC) (see Section 4.3.6 on page 94), and assessed these features using the USC corpus as a test/training set. Classification accuracy for the USC derived features on the USC corpus of biographical sentences was very high at 82.45%. This chapter explores whether those 5062 unigrams are portable for use in classifying other biographical sentence corpora, in this case the “gold standard” data constructed as part of this work (see Chapter 5 on page 101). The issue of whether unigram features ought to be derived from the same corpus that is used for testing and training data, is also addressed. The chapter is divided into five sections; motivation, experimental procedure, results, discussion and a brief conclusion. 10.1 Motivation The identification of a feature set that performs well in a variety of different biographical sentence classification situations is important if we are to have confidence applying that feature set to the biographical sentence classification task generally. Z HOU ET AL . (2004) tested various feature sets (using the USC corpus as a test/training set) for the binary biographical sentence classification task using the Naive Bayes classification algorithm.1 The feature sets used included bigrams and trigrams (all derived from the USC corpus). The best performing feature set consisted of all those unigrams that occurred within the biographically tagged clauses of the USC corpus. These unigrams were used on the intuition that they would provide exemplary biographical unigrams and that limiting unigrams to those which occur in biographical clauses would reduce 1 Note that the USC ten class biographical annotation scheme can be reduced to a binary scheme simply by regarding each sentence which contains a tagged biographical clause as biographical, and a sentence that contains no such clause, as non-biographical. 177 C HAPTER 10: P ORTABILITY OF F EATURE S ETS the number of features in the feature set that do not contribute to classification accuracy. One danger in such an approach however, is the possibility that the unigrams harvested from the USC biographical clauses are too specific to that corpus, and will not “port” well for classifying other biographical data. The USC corpus, while it contains biographical text, and was designed as a biographical corpus, is limited to short web biographies of only a few major historical persons (for example, Marilyn Monroe, Martin Luther King, and so on). It is possible that web biographies are a sub-genre of short biographies, that do not represent the entire range of short biographies adequately. Additionally, the use of only several biographical subjects could result in derived features that are too specific to those individuals. For example, Monroe is included in the 5062 unigram features, yet it is not obvious how the inclusion of a “Monroe” unigram feature would aid a general purpose biographical sentence classifier. In this chapter, in order to test the portability of the 5062 USC derived features identified by Z HOU ET AL . (2004), we use these 5062 unigram features in conjunction with the Naive Bayes classification algorithm to classify the gold standard biographical sentences developed as part of the current research project and described in Chapter 5 on page 101. In order to provide a point of contrast against which we can judge the performance of the USC derived features, we use a frequency list of 5062 unigrams derived from biographical dictionaries. 10.2 Experimental Procedure Our first step required identifying all those unigrams found within tagged biographical clauses in the USC corpus, which was achieved using UNIX text processing utilities. See Figure 10.1 on the following page for an illustration of the biographical unigram extraction process. The second step involved the identification of a baseline against which the USC features could be assessed. The baseline feature set consisted of the 5062 most frequent unigrams from a set of texts constructed from two biographical dictionaries. The text collection consisted of 320,000 word tokens and was collected from the following two sources: 100,000 word tokens from the Chambers Biographical Dictionary (see Section 4.3.2 on page 89). 220,000 words from the Dictionary of National Biography. Only entries of less than six hundred words were used, on the intuition that these entries would contain less historical and political background information, and more explicitly biographical material. See page 87 for a description of the Dictionary of National Biography. 178 C HAPTER 10: P ORTABILITY OF F EATURE S ETS Figure 10.1: Biographical Unigram Extraction from the USC Corpus E XTRACT FROM USC C ORPUS : Martin Luther King, Jr., bio (January 15,1929-April 4, 1968) /bio bio was born /bio Michael Luther King, Jr., but later had his name changed to Martin. His grandfather began the family’s long tenure as pastors of the Ebenezer Baptist Church in Atlanta, serving from 1914 to 1931; his father has served from then until the present, and from 1960 until his death Martin Luther acted as co-pastor. edu Martin Luther attended segregated public schools in Georgia, graduating from high school at the age of fifteen /edu ; edu he received the B. A. degree in 1948 from Morehouse College /edu , a distinguished Negro institution of Atlanta from which both his father and grandfather had been graduated. E XTRACTED U NIGRAM F EATURES : january 15 1929 1968 was born attended segregated public georgia graduating from at the age he received his 1948 from morehouse April martin schools high of BA college 4 luther in school fifteen degree A much larger text collection could be used as the basis for the frequency counts (the Dictionary of National Biography consists of almost thirty four million word tokens), but only a small subset of the Dictionary of National Biography was used in order that a high proportion of non-Dictionary of National Biography biographical text — text from the Chambers Biographical Dictionary — could be included in the corpus. This decision was made to prevent the resulting frequency list reflecting the idiosyncracies of the Dictionary of National Biography, rather than representative of short biographies more generally. A frequency list from the two biographical corpora was obtained and the most frequent 5062 unigrams retained; to create an equal number of features to those derived from the USC corpus. Each feature set was then used in a 10 x 10 stratified cross validation using the Naive Bayes learning algorithm2 on the gold standard biographical data developed as part of this research project (see Chapter 5 on page 101). The WEKA implementation of the Naive Bayes learning algorithm was used. Note that in this chapter we are interested in the portability of feature sets for the biographical sentence classification task, rather than the portability of 179 C HAPTER 10: P ORTABILITY OF F EATURE S ETS Figure 10.2: Comparison of the Performance of Unigrams Derived from USC Annotated Clauses and Biographical Dictionary Unigram Frequency Counts. trained classifiers (models). 10.3 Results The mean results of the 10 x 10 fold stratified cross validation are presented in Table 10.1 on the next page. Note that the means for the two feature sets are almost identical for both the Naive Bayes and SVM algorithm. The accuracy score for each run of the 10 fold stratified cross validation is shown in Figure 10.2, where it can be seen that the mean figures reported do not mask highly deviated results. Table 10.1 shows that the classification accuracies for each feature set are almost identical (only 0.03% separates them). The difference between the two feature sets was subject to statistical testing, using the corrected re-sampled t-test, and it was found that there was no statistically significant difference between the two feature sets (see Figure 10.3 on the next page for the experimental and null hypotheses). 2 The Support Vector Machine (SVM) algorithm was also used. 180 C HAPTER 10: P ORTABILITY OF F EATURE S ETS Table 10.1: Classification Accuracies of the USC and DNB/Chambers Derived Features on Gold Standard Data. Feature Set USC Features DNB/Chambers Mean Accuracy Naive Bayes (%) 76.61 76.58 Mean Accuracy SVM (%) 79.33 79.32 Figure 10.3: Experimental and Null Hypotheses: USC and Biographical Dictionary Derived Features Experimental Hypothesis: There is a difference in classification accuracy between a feature set based on 5062 unigrams derived from biographical clauses in the USC corpus and the most frequent 5062 unigram features in a sample of the DNB and Chambers biographical dictionaries. Null Hypothesis: There is no difference in classification accuracy between the 5062 unigram feature set derived from biographical clauses in the USC corpus and the most frequent 5062 unigrams features in a sample of the DNB and Chambers biographical dictionaries. 10.4 Discussion The results gained in this chapter show that the feature identification strategy adopted by Z HOU ET AL . (2004) — using only those unigrams that appear in biographical clauses in the USC corpus as features — provides a similar level of portability to the use of frequent unigrams derived from the Dictionary of National Biography. In other words, the features identified by Z HOU ET AL . (2004) when ported for use on other biographical data perform at an almost identical accuracy level as “plain” frequent unigrams which can be identified automatically from biographical corpora and require no intensive annotation effort. There are at least four possible reasons for the fact that these “hand identified” unigram features do not provide superior results: 1. The USC corpus consists of numerous biographies of the same small set of individuals (for example, Marilyn Monroe, Einstein and so on). This means that many person specific unigrams (for example; names, birthplaces and so on) would appear repeatedly in the biographical clauses, reducing the variability of the resulting frequency list. That is, if we have one hundred different individuals, we could conceivably have one hundred different birth places. In contrast, if we have one hundred biogra- 181 C HAPTER 10: P ORTABILITY OF F EATURE S ETS phies of one individual, we are likely to have only one birthplace, thus reducing the number and variety of features. 2. All biographies used in the USC corpus are harvested from the web. It is possible that the particular constraints imposed by web publishing fail to reflect the qualities of short biographies more generally. 3. The USC corpus is too small (at approximately 170,000 word tokens) and the number of annotated biographical clauses too few, to provide a list of representative biographical unigram features. It is possible that in order to gain better features, and thus improve classification accuracy, the corpus it would be necessary to increase the size of the corpus, which in turn would require more annotation effort. It is also notable that the text sample taken from DNB/Chambers was also relatively small (although approximately twice the size of the USC corpus). It is possible that increasing the size of both the USC corpus and the sample from DNB/Chambers may effect classification accuracy. 4. It is possible that the inconsistent biographical tagging employed in the USC corpus reduced the quality of the derived feature set for the purposes of biographical sentence classification. That is, in some biographies, only biographical words are tagged rather than clauses, resulting in a unigram feature set that perhaps excludes biographically important unigrams. See Figure 4.1 on page 95 for some examples of this inconsistent annotation taken from the Curie section of the USC Corpus). The results obtained were surprising, as it was initially thought that there was likely to be some increase in classification accuracy using Z HOU ET AL . (2004)’s labour intensive feature identification process, compared to a simple unigram frequency list derived from biographical dictionaries. It was also thought that the very high accuracy score — 82.42% — achieved by Z HOU ET AL . (2004) on the USC corpus may be reduced when the feature set was tested on the gold standard data. That is, it was expected that when the 5062 features derived from USC biographical clauses were applied to another corpus of biographical data, the classification performance would drop, but not to the extent that classification performance provided was almost identical to the “base-line” biographical dictionary frequencies feature set. On the face of it, the almost equal performance of both the feature sets – the USC derived features and the biographical dictionary frequencies — could be seen to indicate that there is some upper ceiling on the performance of unigram features for biographical sentence classification. This theory is belied however when we consider that previous chapters have shown that we can achieve classification accuracy above 76.2% on the gold standard data with unigram based methods. For example, a feature set consisting of only five hundred frequent unigrams from a corpus of Wikipedia/Chambers biographies achieved an accuracy of 81.25% on the gold standard data using the Naive Bayes learning algorithm (see Section 9.3 on page 172). 182 C HAPTER 10: P ORTABILITY OF F EATURE S ETS 10.5 Conclusion This chapter has compared the best performing feature set identified by Z HOU ET AL . (2004) to an equally sized feature set consisting of frequent unigrams derived from a sample of the Dictionary of National Biography and Chambers Dictionary of Biography. When the two feature sets were compared using 10 x 10 fold cross validation, using the gold standard corpus developed in Chapter 5 on page 101 and the Naive Bayes algorithm, the performance of the feature sets was almost identical (with only 0.02% difference). This suggests that the strategy adopted by Z HOU ET AL . (2004) for the identification of appropriate biographical features, while it delivers high classification accuracy for the USC biographical corpus, does not endow any additional benefits above and beyond the use of straightforward unigram frequencies derived from biographical dictionaries, when applied to alternative biographical data. 183 C HAPTER 11 Conclusion This thesis presented and explored the general hypothesis that biographical sentences can be reliably identified at the sentence level using automatic methods. This concluding chapter summarises the thesis in terms of contributions made, before outlining areas for possible future work. 11.1 Contributions The general claim that biographical writing can be identified at the sentence level using automatic methods is broken down into two sub-hypotheses: Hypothesis 1 Humans can reliably identify biographical sentences without the contextual support provided by a discourse or document structure. Hypothesis 2 “Bag-of-words” style sentence representations augmented with syntactic features provide a more effective sentence representation for biographical sentence recognition than “bag-of-words” style representations alone. Hypothesis 1 is addressed in Chapters 5 and 6, while the machine learning chapters — Chapters 7, 8, 9 and 10 — are concerned with Hypothesis 2 and the general hypothesis that humans can reliably identify biographical sentences without the contextual support provided by a discourse or document structure. The contributions made by the thesis can usefully be divided into two main groups, reflecting the two sub-hypotheses. The main hypothesis (and two sub-hypotheses) provides a framework for the thesis, but other research questions are addressed within that framework (for example, the utility of the key-keywords methodology (see page 21). 184 C HAPTER 11: C ONCLUSION 11.1.1 Hypothesis 1, Annotation Scheme and Human Study An annotation scheme for biographical sentences was developed (Chapter 5). The scheme was heavily influenced by existing schemes (like the Text Encoding Initiative biographical scheme, and the biographical guidelines used to construct the Dictionary of National Biography). The new scheme was specifically designed to identify the kind of biographical sentences that occur in short biographical summaries (like Wikipedia biographical entries). It is demonstrated with numerous examples that the scheme delivers excellent coverage for the texts of interest. The annotation scheme is also validated by the human study reported in Chapter 6, where it is shown that there is a good level of agreement between annotators asked to classify sentences according to the scheme. A biographical corpus was developed as part of this work (Chapter 5), based on the new annotation scheme. The corpus, although not large, is constructed from various sources, including news text from the Guardian newspaper and extracts from the STOP corpus. As with the annotation scheme, the biographical corpus is annotated at the sentence level. A human study (Chapter 6) was conducted which involved an online experiment with twenty-five participants. The study demonstrated that human classifiers can agree whether a sentence is biographical or non-biographical (given the annotation guidelines developed in Chapter 5) with good reliability. That is, agreement between participants in the study on the status of sentences as biographical or non-biographical was good, using an appropriate variant of the K APPA agreement statistic. The cumulative force of Chapters 5 and 6 is to provide strong evidence in support of Hypothesis 1 (Humans can reliably identify biographical sentences without the contextual support provided by a discourse or document structure). Chapter 5 describes a set of clear guidelines for identifying biographical sentences (that is, the annotation scheme) and Chapter 6, validates that decision procedure, showing that people are able to identify biographical sentences with good reliability. 11.1.2 Hypothesis 2, Automatic Biographical Sentence Classification Chapter 7 explores the performance of six learning algorithms using a 10 x 10 cross validation methodology (see Section 2.5.2) employing “gold standard” data derived from the biographically annotated corpus described in Chapter 5 (and validated in the human study reported in Chapter 6). The six different algorithms were Naive Bayes, a Support Vector Machine classifier, the C4.5 decision tree algorithm, the Ripper rule learning algorithm, the “One Rule” algorithm, and a baseline that classified all test data as belonging to the most 185 C HAPTER 11: C ONCLUSION frequent class in the training data (the “Zero Rule” algorithm). On the basis of the experimental work in this chapter, Naive Bayes was the best performing algorithm, but not at a statistically significant level compared to the second most accurate algorithm, the Support Vector Machine classifier. It should be noted however, that the the Naive Bayes algorithm performed significantly better than the other learning algorithms tested, apart from the Support Vector Machine classifier. Additionally, the two most successful algorithm (SVM and Naive Bayes) are used in all the machine learning experiments, and while Naive Bayes performs better in most instances, there are some feature sets where the Support Vector Machine classifier performs better (for example, trigrams derived from the Dictionary of National Biography — see Table 9.1 on page 163). Chapter 9 explores a core theme of the thesis, that topic neutral syntactic features are useful for biographical sentence classification. The thesis has characterised biographical sentence classification as a genre classification problem, where topic neutral features (in this case syntactic features) are useful. The work reported in Chapter 9 shows that syntactic features (identified empirically from the data produced by B IBER (1988)) increase classification accuracy, albeit not at a statistically significant level. Chapter 9 provides some limited support for the contention that -grams (where ) increase classification accuracy for genre classification tasks, as -grams provide a low effort strategy for encoding syntactic data (hence the description “pseudo-syntactic features”). This support is weak however. The difference between the baseline unigram representation and the same representation augmented by bigrams was less that 1%; far too small to be judged significant using the corrected re-sampled t-test. Additionally, this chapter suggests that S COTT and M ATWIN (1999)’s contention, that it is “probably not worth pursuing simple phrase based representations further”, while it may apply to topical text categorisation does not apply to biographical sentence classification. Chapter 9 provides strong support for the view that non-topical features are important for the biographical classification task. A baseline feature set of 2000 frequent unigrams was tested against the same feature set with all function words removed. The difference between the performance of the two feature sets was highly significant. This result is in line with the view of biographical sentence classification as a genre classification task, where topic neutral stylistic features (like function words) are very important. If function words were irrelevant and only topic related words important, then there would be no substantial difference between the classification performance of the two feature sets. Chapter 9 shows that for the biographical sentence classification task, stemming (that is, reducing morphological complex words to a canonical form) slightly improves classification accuracy, compared to the use of “plain” unigrams. The difference was not statistically significant, however. 186 C HAPTER 11: C ONCLUSION Chapter 9 suggests that the key-keyword based methods (described in Section 8.4 on page 152) do not provide the optimal feature selection method for the biographical classification task, as unigram frequencies derived from a biographical corpus performed better. The method was originally developed as a genre analysis tool, yet the paucity of topic neutral features (that is, function words) generated by the key-keywords method suggests that its use as a genre analysis tool may be overstated. Chapter 10 shows that the lexical feature selection method adopted by Z HOU ET AL . (2004) performs at a near identical level to a feature set consisting of frequent unigrams automatically derived from biographical dictionaries. Z HOU ET AL . (2004)’s feature set was derived from the USC corpus, which as the reader will recall, is annotated at the clause level for biographical information. Z HOU ET AL . (2004) derived unigram features from biographical clauses alone. That is, the only unigrams used were those that occurred within biographically tagged clauses. Z HOU ET AL . (2004) achieved very high classification accuracy with this method using USC test data. Chapter 10 shows however that when the features identified by Z HOU ET AL . (2004) were used to classify the gold standard data created for this work, classification accuracy is almost identical to simpler, frequency list based unigram feature sets which can be derived automatically and do not require an extensive annotation effort. 11.2 Future Work Taking the work reported in this thesis as a starting point, this section suggests areas for possible future research. These areas fall into five broad areas: 1. Implementations of a biographical sentence classifier as a module within a wider system. 2. Improving binary biographical sentence classification. 3. Extensions of the biographical sentence classification techniques described in this thesis both to other genres, and to whole document classification. 4. Extending the use of empirically identified syntactic features to other text classification problems (for example, gender and age based classification.) 5. Investigating methods of genre analysis in the light of issues raised by the current work. Each of these five areas for future work is now examined in turn. 187 C HAPTER 11: C ONCLUSION 11.2.1 Biographical Sentence Classifier Module A biographical sentence classifier based on the optimal performing feature set/classification algorithm combination (that is, unigrams and syntactic features) trained using the data created as part of this project (and perhaps additional data) could be used as part of a biographically orientated Multiple Document Summarisation system. Of course, a biographical sentence classifier would be only a small component of an effective Multiple Document Summarisation system, as redundancy removal, temporal ordering of output and so on, would still be required (see Chapter 3 on page 53). A biographical sentence classifier could serve as a useful tool in the context of journalistic research, where it is often important to identify biographical sentences in vast amounts of text. A biographical sentence classifier could be incorporated into a system that highlights sentences of biographical interest in electronic texts, allowing a journalist or researcher to identify sentences of interest without reading an entire — perhaps lengthy — article or document. 11.2.2 Improving Biographical Sentence Classification It is possible that a larger corpus of training data may improve results. Of course, this would require a significant annotation effort. It would be interesting to see what kind of improvements (if any) were gained by trebling or quadrupling the volume of training data. It is possible that increasing or decreasing the number of features used may increase classification performance. Note that a feature set of 500 unigrams produced the greatest accuracy in this study (see Table 9.3 on page 172). It is also possible that performance could be improved using a mixed model feature set consisting of unigrams and syntactic features. It is possible that the ratio of biographical to non-biographical sentences in the training data may need to be changed in order to construct an effective model for situations in which biographical sentences are very sparse. The current training data consists of 50% biographical sentences and 50% non-biographical sentences. It would be of interest to assess the performance of different ratios of biographical to non-biographical training data on varying sources of text. 11.2.3 Extensions to the Biographical Sentence Classification Task Instead of biographical sentence classification, similar methods (that is, the use of unigrams augmented by empirically identified syntactic features) could be 188 C HAPTER 11: C ONCLUSION used for biographical document classification. As genre is primarily a discourse level, rather than sentence level phenomenon according to System Functional Grammar theory, and to a lesser extent Multi-dimensional Analysis (see Chapter 2), it is hypothesised that the use of syntactic features should increase classification accuracy at a higher level that achieved for sentence classification. Documents could be given a biographical “score” reflecting their biographical content which could be useful in the context of search engine results. Instead of the binary classification analysed in the current work, the fine grained biographical annotation scheme and tagged corpus developed during this work and described in Chapter 5 could be used to assess the feasibility of classifying sentences according to biographical type (that is into one of the six biographical categories, key information, fame, character, relationship. education and work). 11.2.4 Other Text Classification Tasks Although not directly applicable to biographical or genre classification, the use of syntactic features as indicative of non-propositional (stylistic) content has been important in this thesis. There are several directions in which stylistic classification could be developed; the traditional case is stylometry (authorship attribution) outlined in Chapter 2. Two interesting applications are gender classification (already explored to a certain extent by A RGAMON ET AL . (2003)), and author age text classification (that is, the attempt at discerning stylistic features that discriminate between younger and older people). Both these applications would require the construction of an appropriate corpus. 11.2.5 Genre Analysis Section 2.1.2 on page 21 suggests that the key-keyword method, used by T RIB BLE (1998) as an alternative to Biber style multi-dimensional analysis, does not capture the distinctive features of a given genre as function words are underrepresented. It would be worthwhile to explore the usefulness of the key-keyword approach in other genre classification tasks (for example, the identification of news text) in order to test whether similar results were obtained across different genres. Additionally, different statistical methods (that is, feature selection methods) could be used in conjunction with a larger biographical corpus consisting of lengthier biographical documents in order to more fully explore the limits of key-keywords based techniques. It would also be useful to repeat the key-keyword experiments with a multi-genre reference corpus consisting of British rather than American English. This work could utilise the WordSmith software. 189 C HAPTER 11: C ONCLUSION 11.3 Conclusion In summary then, this thesis has addressed the issue of whether biographical writing can be reliably identified at the sentence level using automatic methods, using first a human study, which established that people could perform the task with good agreement, before going on to consider whether the task could be performed automatically using machine learning algorithms and an appropriate feature representation. The later chapters of the theses focused on exploring possible feature representations for the task, and weak evidence was discovered that “bag-of-words” style unigram features, augmented by syntactic features, perform better than “bag-of-words” style features alone. In other words syntactic (stylistic) features may well be useful for biographical sentence classification. 190 A PPENDIX A Human Study: Pilot Study A.1 Introduction This appendix describes in detail the pilot web based human study experiment reported in Section 6.3 on page 130. Recall that the experiment was designed to determine human ability in distinguishing between biographical and non biographical sentences from a number of different data-sources (including The Dictionary of National Biography and the TREC corpus). Fifteen participants were involved in the study. After reading the provided instructions, the participants were invited to categorise 100 sentences as either; core biographical, extended biographical, or non-biographical. The first section reproduces task instructions, the second section presents the 100 sentences used in the study, and the final section sets forth the classification data collected. A.2 Task Description The online questionnaire should take less than 20 minutes to complete; that includes time for reading the instructions and answering 100 short, simple questions. The task is to classify each of the 100 sentences below into one of three categories; 1) core biographical, 2) extended biographical, 3) non biographical. Each question consists of a sentence, which the participant classifies as belonging to one of the three categories. The 3 categories are described in turn below A.2.1 Core Biographical Category If a person is mentioned, is that person the subject of the sentence? Is the sentence about that person? Is the purpose of the sentence to inform us about that specific individual? Is the sentence designed to give information about who the person is? Relevant information here could be details of birth and 191 A PPENDIX A: H UMAN S TUDY: P ILOT S TUDY death dates, education history, nationality, employment, achievements, marital status, number of children, etc. If the central purpose of a sentence is to convey information about an individual, then that sentence can be classified as core biographical. Note that the person doesn’t have to be mentioned by name, “he” or “she” is adequate, as long as it’s clear from the context that all the “he”-s, “she”-s, “him”-s and “her”-s refer to the same person. Here are three examples of sentences that have been classified as core biographical: He was jailed for a year in 1959 but, given an unconditional pardon, became Minister of National Resources (1961), then Prime Minister (1963), President of the Malawi (formerly Nyasaland) Republic (1966), and Life President (1971). His intellect, wit and love of France are reflected in his third novel, Flaubert’s Parrot (1984), in which a retired doctor discovers the stuffed parrot which was said to have stood upon Gustave Flaubert’s desk. Ann West was born at New Scone, Perthshire, Scotland, on 17 May 1825, the daughter of Mary Brough and her husband, John West, a cotton handloom weaver. A.2.2 Extended Biographical Category If a person is mentioned, is that person incidental to the meaning of the sentence? Is the sentence about something else (say, an event or organisation) and the person just mentioned in passing. The distinction between extended biographical sentences and core biographical sentences is that in extended sentences, while a person is mentioned (either by name or by “he”, “she”, “him” or “her”), the sentence isn’t about them. Here are two examples of sentences that have been classified as extended biographical. “This new consumer is a pretty empowered person,” said Wendy Everett, director of a study commissioned by the Robert Wood Johnson Foundation. At last year’s Conference on Retroviruses and Opportunistic Infections, Dr. David Ho and others from the Aaron Diamond AIDS Research Center at Rockefeller University presented evidence that the virus probably first infected humans in the 1940s or early ’50s. A.2.3 Non Biographical Category. Non biographical sentences are easy to identify because they don’t contain the names of people or references to people (“he”, “she”, “him”, “her”). Here are two examples of sentences that have been classified as non biographical: Of the 6 million notebooks Taiwan turned out last year, Quanta produced 1.3 million sets, accounting for about 8 percent of the world output. 192 A PPENDIX A: H UMAN S TUDY: P ILOT S TUDY A.3 Task Questions 1. He lies buried in an obscure corner of the Little Neck burial-ground at Bullock’s Cove, Swansey, Rhode Island. 2. The Long Parliament tried to have him restored in 1641–2, but without effect, and from 1644 onwards Sir Edmond Prideaux [q.v.], later attorney-general under the Commonwealth, was somewhat precariously ensconced as postmastergeneral. 3. Catastrophic coverage paying all costs would kick in after $4,000 in annual out-of-pocket spending by a beneficiary. 4. Born in Widnes, Lancashire, he studied at the universities of Liverpool and Cambridge 5. He shot dead six people and wounded another seven. 6. Daoud Mohammed, a 28-year-old soldier, was resting, surrounded by dozens of Kalashnikov rifles, rocket launchers and boxes of ammunition. 7. He was taken prisoner by the Japanese when Singapore fell and died in a prison camp in Formosa. 8. He was consular chaplain to the British residents at Monte Video from 6 May 1854 to 31 December 1858. 9. He must have survived his father, if at all, only a short time, as his widow married Robert de Ros in 1191, and the date of his father’s death being uncertain it may be doubted whether he succeeded to Annandale. 10. Born in Karlsruhe, he developed a two-stroke engine from 1877 to 1879,and founded a factory for its manufacture, leaving in 1883 when his backers refused to finance a mobile engine. 11. On 3 May 1823 he was admitted commoner of St. Edmund Hall, Oxford. 12. His mother, who came of a Yorkshire family, was a foundation member of the Independent Labour Party and the British Communist Party, a Cooperator, and a member of the Ashton and District Weavers’ Association until she died. 13. Young walked to force Buford home, and Sosa added two insurance runs with a double to right. Rick Aguilera got his 17th save, while Daniel Garibay (2-3) was the winner. 14. His family moved to England during the Franco-Prussian War, and settled there in 1872. 15. He moved to Paris in 1829. 16. In 1679 he brought an accusation against the Duchess of Richmond, which on investigation proved to be false, and he was forbidden to attend the court. 193 A PPENDIX A: H UMAN S TUDY: P ILOT S TUDY 17. It would eliminate some 240 miles of levees and canals as well as construct above ground reservoirs, underground aquifers and develop new wetlands. 18. He entered the corps of Royal Military Surveyors and Draughtsmen as cadet on 20 Aug. 1808, and became a favourite pupil of John Bonnycastle, the mathematician. 19. Johnson attributed this in large part to President Clinton’s silence on the matter until recently. 20. After the Revolution he became curator of paintings at the Hermitage Museum, but in 1928 settled in Paris. 21. Bove said he would appeal any sentence and vowed to continue his battle internationally 22. In 1649 and 1651 he was charged with conveying money, letters, and intelligence to the Royalists overseas, and acquitted on both occasions. 23. Of his sons, Henry St. Clair is noticed separately; another son, J. Murray Wilkins, was the last rector of Southwell collegiate church before it became a cathedral. 24. Born in Philadelphia, Pennsylvania, he worked as a journalist and magazine editor before turning to fiction. 25. Clinton had appointed Ward to a judgeship in 1989, and Ward also was a Democratic state representative when Clinton was Arkansas governor. 26. He took part in local religious and philanthropic work, edited a controversial magazine, the Watchman’s Lantern and in 1849 entered the Liverpool town council. 27. He argues strenuously against the mass, and inveighs against the medieval practice of regarding the mass as a vicarious and solitary sacrifice, at each celebration, of the one atoning death, but always holds that Christ is present with all His benefits in the sacrament, that the elements of bread and wine are not bare and naked signs of the body and blood of Christ. 28. Here he played the title-part in Cyrano de Bergerac; but his excursions into romance were not appreciated by the public. 29. The Post, saying it had obtained a copy of the report, said in Sunday editions that the 200-page document makes a direct correlation between the vulnerability of things like the Lincoln Memorial and Washington Monument and funding for the U.S. Park Police, the law enforcement arm of the park service. 30. The profits from this enterprise enabled him to set up his own small ironworks at St Pancras in London 31. He was the adopted son of the astrologer, William Lilly, who constantly makes reference in his works to Coley’s merit as a man and as a professor of mathematics and occult science. 194 A PPENDIX A: H UMAN S TUDY: P ILOT S TUDY 32. At times he speaks as an eye-witness, especially in his account of the foreign expeditions in which he took part. He quotes at some length the speeches of the king, the petitions or remonstrances of the parliament, and other original documents. 33. He devised equations which enabled both the thermal energy and that due to baroclinicity to be calculated for a developing cyclone. 34. Two people have been killed and at least another 80 injured after a terrace at an island winery in Lake Erie collapsed this afternoon. 35. It also could serve as a motto for the Tour, still trying to recover from a doping scandal that nearly did in the 1998 competition and sullied the image of a beloved summer ritual. 36. Had Curry been found guilty of the sexual assault charge, he would have faced a possible 20-year-term. 37. Dubthach Maccu Lugir, 5th cent termed in later documents mac hui Lugair, was chief poet and brehon of Laogaire, king of Ireland, at the time of St. Patrick’s mission 38. He was born in Örebro, a place he frequently satirized in later life, taking revenge for the humiliation he had suffered as a stout and painfully shy youth. 39. Clergy living in concubinage within his diocese were to be deprived of their benefices; all candidates for ordination were to take a vow of chastity; the unworthy were to be excluded from ordination; charity and hospitality were enjoined on rectors; tithes were to be paid regularly; detainers of tithes were to be severely punished (cf. Ann. Tewkesbury, pp. 148, 149); vicars were to be priests and hold only one cure; non-residence was condemned; deacons were forbidden to hear confessions, impose penances, or baptise, save in emergencies; confirmation was to follow one year after baptism. 40. President Clinton was sued Friday by an Arkansas Supreme Court committee seeking to strip him of his law license. 41. This is probably the better likeness, bearing witness to his son-in-law’s description of him he was of a fair, fresh, ruddy complexion, temperate in his diet, fasting often. 42. The information about the Horman case was contained in a release of 505 previously classified documents, most from State Department files. 43. In early studies of the North Sea plaice population he noted its remarkable constancy, despite the high natural mortality rates of the early stages of fish. 44. The draft program calls for a minimum of 5 percent growth annually, which would lead to a 150 percent increase in the gross domestic product by 2010. 45. Fidel Castro’s government launched a new series of demonstrations Saturday in the wake of Elian Gonzalez’s return, calling out more than 300,000 195 A PPENDIX A: H UMAN S TUDY: P ILOT S TUDY people from across eastern Cuba to protest U.S. policies that it says harm this island’s citizens. 46. 1. A Synoptical Table of British Organic Remains, 1830, 8vo and 4to, in which, for the first time, all the known British fossils were enumerated. 47. Children are washed infrequently in basins. 48. As a blindfold player he was not surpassed even by Blackburne, and as an analyst he probably had no equal. 49. Pelling was a stout defender of the Anglican church against both Roman catholics and dissenters. 50. Superseded in papal favour by the sculptor Alessandro Algardi, Bernini concentrated on private commissions, the most famous of which is the Cornaro Chapel in the Church of Santa Maria della Vittoria 51. Besides his wife and niece, survivors include a brother-in-law, two sistersin-law, and 17 other nieces and nephews 52. An interest in gunshot wounds led him to treat the wounded from the Battle of Corunna (1809), and after Waterloo he organized a hospital in Brussels. 53. Airlines last year staved off legislative action by promising to treat customers better and to be more forthright with passengers all the way through their travel experience. 54. Whitehead’s last imprisonment was at the Poultry Compter, London, whither the lord mayor, Sir Robert Jefferies, sent him on 11 Feb. 1685, for preaching at Devonshire House.was given to the world in an anonymous tract, Thoughts on General Gravitation, and Views thence arising as to the State of the Universe. 55. While Clinton has been campaigning across New York for a year, Lazio didn’t formally join the race until May 20, the day after Republican Mayor Rudolph Giuliani quit the contest because of prostate cancer. 56. Grants with these objects in view were made by the commission. 57. He was buried at Brompton cemetery on 26 June, when most of the prominent British chess players were represented at his graveside. 58. Wills was an unusually brilliant conversationist, and some of his more ambitious poems show much of the dramatic power which descended to his son, William Gorman Wills. 59. Those with incomes between 135 percent and 150 percent of poverty – about $12,600 for an individual and $16,900 for a couple – would have their monthly premiums subsidized on a sliding scale. 60. That was among the recommendations included in the four-member panel’s report aimed at improving the agency’s personal search procedures. 61. Government budget cuts were partly to blame for the high numbers, the report said. 196 A PPENDIX A: H UMAN S TUDY: P ILOT S TUDY 62. Beneath the debate over policy differences, though, lie competing political calculations. 63. He was educated at a school in Ayr and at the university of Edinburgh. 64. She is now receiving more attention from feminist critics, in the light of her continual artistic struggle with the question of female experience. 65. He founded the Congress of Roman Frontier Studies in 1949, and was Professor of Roman-British History and Archaeology at Durham (1956-71), and became founder-chairman of the Vindolanda Trust in 1970. 66. The state Legislature in 1995 allowed for the creation of charter schools, which are outside the control of local boards of education and are free of many state mandates and regulations. 67. Her conventional education at home was relieved by holidays with relations in Germany, during one of which visits she met Prince Aribert of Anhalt. 68. He was, however, prevented from proceeding further than Tirwill (probably Turovli on the Dwina), where he was imprisoned in irons for thirty-six days, probably at the instigation of rival traders and ambassadors from Danzig, Lubeck, and Hamburg, who, moreover, prevailed upon the king of Poland to stop all traffic through his dominions of the English trading to Muscovy. 69. Witnesses reported seeing Carolyn Waldron reading a book and standing in the middle of the platform moments before she fell onto the tracks about 2 a.m., police spokesman Alan Krawitz said. 70. Of several essays read by him before the Royal Irish Academy, one on the Spontaneous Association of Ideas was said by Archbishop Richard Whately to overturn Dugald Stewart’s theory on the same subject. 71. The villagers said they feared meeting Herrero inside of Eloxochitlan, a PRI stronghold 72. His practical training started at his father’s mill, where he was given a lathe and built small working steam engines. 73. He left a widow, Elizabeth, and three children, all under age. 74. His reputation as a preacher grew rapidly. 75. Twelve minutes earlier, Shui had won a penalty kick when she was hooked by Simone Gomes. 76. His grandfather, William Blackman Ellis, artist, naturalist, and taxidermist, who took him as a child for walks in Arundel Park, taught him much about the flora and fauna of the area, and this background of a love of nature and of skill in craftsmanship doubtless sowed the seed in him of a passion to perfect such love and skills in himself and, through teaching, to develop them in others. 77. She first went to New York City in 1936, refused the offer of a staff position on Life magazine, and thereafter saw her work included in important exhibitions in the USA. 197 A PPENDIX A: H UMAN S TUDY: P ILOT S TUDY 78. The word was the Supremes were getting back together. 79. Economists cautioned that the durable-goods data tend to be volatile, but worried that the report signified that the manufacturing sector has not cooled off as much as many analysts had believed. 80. During his six-year term, Zedillo has overseen a series of democratic reforms, the most important of which was his decision to abandon the longstanding practice of having the outgoing president handpick his successor. 81. After seven months of bitter emotions and plenty of political heat, the case of Elian Gonzalez finally was resolved under long-standing rules on parents’ rights and immigration law – rules that some say need to change. 82. Pete Harnisch, just off the disabled list, got his first victory of the season and drove in the go-ahead runs with a bases-loaded single Friday night as the Cincinnati Reds beat the Arizona Diamondbacks 5-4. 83. Nothing is known of his education except that he did not lay claim to any degree. 84. The central element of the design, the sculpture depicting The Ecstasy of Saint Theresa, is one of the great works of the Baroque period. 85. Born in Jagtvejen, Copenhagen, Denmark, she went to Tasmania with her parents in 1891. 86. In reality however, his stature was tall. 87. The legislation also gives the government the power to force other local councils to accept their share of refugee claimants to alleviate the pressure on port towns. 88. Yet Harry Potter is the type of fad a mother can love. 89. The report said changing people’s behavior saves more lives than spending money on expensive institutions and equipment. 90. Born in Grantchester, Cambridgeshire, England, the son of geneticist William Bateson, he studied physical anthropology at Cambridge, but made his career in the USA. 91. Medicare would operate a standard prescription drug benefit, the same for everyone, with some help from benefit management companies that many private health plans use. 92. With respect to his great work it has been pointed out that in his specific definitions he was loose and unsystematic, but that passages in his prefaces and descriptions are fine, and at the same time simple and natural. 93. Democrats also complained that such a broad bill was probably unconstitutional and doomed for repeal. The GOP version included year-round nonprofit activity, whereas Democrats wanted to limit it to activity within a month or two of an election. 94. Attack dogs, though, tend to be favored by neo-Nazis and other young toughs, usually in low-income areas where the dogs are brandished like weapons. 198 A PPENDIX A: H UMAN S TUDY: P ILOT S TUDY 95. But British bands Oasis and the Pet Shop Boys pulled out of their scheduled Saturday night appearances 96. The Assembly, meanwhile, accepted a complex sex crimes bill containing some provisions that its liberal Democratic members have had problems with philosophically in the past. 97. Gaunt, Elizabeth, 1685, executed for treason, was the wife of William Gaunt, a yeoman of the parish of St. Mary’s, Whitechapel. She was an anabaptist, and, according to Burnet, spent her life doing good, visiting gaols, and looking after the poor of every persuasion. 98. It is said, but on no very certain authority, that he learnt engraving in Denmark from Simon van den Passe, and in Holland from Hendrik Hondius, and that he followed Hondius’s two sons to England. 99. Born in Sheffield, he started his career as a goalkeeper with Chesterfield and Leicester City but was transferred to Stoke City because Peter Shilton (1949–) was also on the Leicester staff. 100. When the Parliament was dissolved by military force, Allen was one of the opponents bitterly attacked by Cromwell, and he was arrested by the army for a short time. A.4 Participant Responses This section presents the responses of the fifteen participants. Rows represent questions (that is, the one hundred sentences listed in the previous section) and columns represent participants’ responses, with c corresponding to the core biographical category, e to the extended category, and n referring to the non-biographical category. Table A.1: Pilot Study Data. Rows are Questions and Columns are Participants. Question 1 2 3 4 5 6 7 8 9 10 11 12 13 1 c e n e e c c c e c e e e 2 e e n e n c n e e e n n c 3 c c n e e c e e e e e e c 4 c c n c c c c c c e c e e 5 c e n c c e c e e c c e e 6 c c n c e e c c c c c c e 7 c c n c c c c c c c c c e 199 8 c c n c c c c c c c c c e 9 c c n c c e c c c c c c n 10 c e n c c e e c c c c e e 11 12 13 14 15 c c c c c c c c c c n n n n n c c c c c c c c c c e c c c c c c c c c c c c c c c c c c c c c c c c c c c c c e e c e e e c e c e continued on next page A PPENDIX A: H UMAN S TUDY: P ILOT S TUDY Table A.1: continued from previous page Question 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 1 n c e n e e c c c e c c c e e n c e c c n n c c c n c e e n n n n n e e c e c n e e n c 2 e e e n e c e c e c e e e e e n e e e e n n c c e n c n e n n c n n e c n e e n c c n e 3 e e e n e c e c e e e c c e e n c e e e n n c c e e c e e n n e n n e e c e e n c c n e 4 c c c n c c c c c e c e c e c n e e c e n n c c c n c c n e n n n n c c e e c n e e n c 5 e c c n c e c c c e c e c e c e e e c e n n c c c e e c n e n e n e c c e e c n e e n c 6 c c c n c e c e c e c e c e e n e c e e n n e c c n e e n n n e n n e c e c c n e e n c 7 c c c n c e c c c c c c c c c n c c c c n n c c c e c c e c n e n n c c c c c n c c n c 200 8 c c c n c e c e c e c c c e c n c c c c n n c c c n c e n e n n n n c c c c c n c e n c 9 c c c n c e c e c c c c c e c e c c c c e n c c c c e c n e n e n n c c c c c n c e n c 10 e c e n c e c c e e c e c e c e e e c c n n e c c e c n n c n e n e c c e e c n e n c c 11 12 13 14 15 c e c c e c c c c c c c c c c n n n n n c c c c c e e e e e c c c c c e c c c e c c c c c c e c c e c c c c c c e c e e c c c c c c e e e e c c c c c n n n n n c c c c c c e c c c e c c c e c c c c c n n e n n n n n n n c c c c e c c c c c c c c c c n e e n n c c c c c c e c c e n n e n n e e e c e n n n n n n e e n n n n n n n n n e n n c c c c e c c c c c c c c c c e e c c e c c c c c n n n n n c c c c c c c e c e n n n n n c c c c c continued on next page A PPENDIX A: H UMAN S TUDY: P ILOT S TUDY Table A.1: continued from previous page Question 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 1 c n n n n c c c n c e e c e c c c c e c n n c e c c n c c n n n c n c n c e n c c c e 2 c n n n n e e e n e e c e e e e e c e e n n c c c e n e e c c n e n e n n e n c e e c 3 c n n n n e e e n e e c e c e e e c e e c n e c c e c e n n n n e n e n n n n c e e c 4 c n n n n c c c n c e c e n c e c e c c n n e e e c n c c n c n c n e n n c n c c c c 5 c n n e n c c c n e e e e c c c c e e e e n e e e c e c c n e n c n c n e e n c e c c 6 e n n n n c e c n e e e e n c c c e e c n n e e e e n c e n n n c n e n n n n c c c e 7 c n n n n c c c n c c c c c c c c c c c e n c e c c n c c n c n c n c n n e n c c c c 201 8 c n n n n c c c n c c e c e c c c e c c e n c e c n n c c n n n c n c n n n n c c c e 9 c n n n n c e c n c c e c e c c c e c c e n e e e c e c c n n c c n c n n n n c c c c 10 e n n e n e n e n e n e c e e e n e n e n e e e n e n e e c c e c c e n c n n e n c e 11 c n n n n c c c n c c c e e c c c c c c n n c n c c n c c n n n c n c n n n n c c c c 12 c n n n n c c c n c c e e e c c c c c c e n c e c c n c c n n n c n e n n n n c c c c 13 c n n n n c c c n c c c c e c c c c c c c n c c c c n c c n e n c n c n n e n c c c c 14 c n n e n c c c n c c e e e c c c c c c e n c c c c n c c n n c c n c n n e e c c c c 15 c n n n n c e c n c e c n e c e c c e c e n c n c c n c e n n n c n e n n n n c e c e A PPENDIX B Human Study: Main Study This appendix provides further information about the human study described in Section 6.4 on page 132. The first section reproduces the instructions provided to annotators. These instructions were provided as a HTML page and a printable PDF file. The second section lists the sentences used, divided into five sets of approximately one hundred sentences. The third section presents tables of agreement data for the sentences.1 B.1 Instructions to Participants B.1.1 Introduction This study involves assessing human ability to judge whether a sentence is biographical or non-biographical when that sentence is shown out of context. You will be presented with 100 sentences, some of which are biographical, and some of which are non-biographical, and asked to make a judgment about each one. It is important that you read this document carefully, and perhaps refer to it while making your decisions. If you think that a sentence does not fall into any of the categories given, then please mark it as non-biographical. A sentence is biographical if it contains information from one (or more) of the six biographical categories mentioned on page 203. If it does not contain information from one of these six categories then it is to be marked non-biographical. For example, the following sentence contains information about place of residence and education: “Born in England, he studied at Cambridge, before becoming a naturalized American citizen and living in New York City for most of his adult life.” A sentence is biographical for these purposes if and only if it contains biographical information according to the guidelines given on page 203 (that is, the six biographical categories). Just because a sentence contains information about someone, doesn’t mean that it is biographical (according to this scheme). 1 This data is available electronically by emailing [email protected] 202 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY A sentence is biographical if it contains biographical information. For instance, “Gordon Brown, the Chancellor of the Exchequer, attended a meeting of European Finance Ministers today”, is biographical simply on the basis that it provides job information about Gordon Brown (that is, that he is Chancellor of the Exchequer). Similarly, a sentence is biographical even if the biographical information (often embedded in a clause) is only a very small part of the sentence (for example, “Former Daily Mirror journalist James Hipwell says voicemail hacking has long been widespread in tabloid newspapers, and is lifting the lid on the dubious journalistic practices he observed during his time at the paper”). The reference of the biographical information does not have to be explicitly named in the sentence — “he” or “she” is enough (for example, “She attended Manchester University in the late 1960’s”). Remember that a sentence can belong to one or more biographical categories. For instance, “Steven Irwin, noted Australian naturalist and television presenter has been killed in a tragic accident off the Australian coast” gives information about nationality, death and job role (that is, key information, and work information). Remember that a sentence must refer to an individual to be biographical. “He lost his life in a road traffic accident” is biographical. “12 people were seriously hurt” is non-biographical for these purposes Events that happen to a person after that person is dead are non-biographical. (With the exception of honours awarded after death, like, for instance, the Victoria Cross.) Remember, the sentence may contain information about a person, but unless that information falls into the six categories mentioned (that is, key life facts, fame, character, relationships, education and work) then it is not biographical. Whimsical or anecdotal information about a person, unless it falls into one or more of the six biographical categories, is to be classed as non-biographical. For example, “He then saw a tank, which was carrying his substantial winnings blown to pieces before his eyes” would count as non-biographical as it does not fall under any of the six biographical categories. Major surgery or abiding health concerns are to be classed as biographical Remember, biographically relevant information (for example, job titles) may be contained (or buried in) in very long sentences. The day-to-day activities of politicians – meetings attended, conferences addressed, etc. — are not to be considered biographical unless the sentences contain information that is biographical according to the six classes identified (for example, the sentence mentions a job title or award). B.1.2 Six Biographical Categories Key Life Facts These are central key facts about a person’s life, common to all people. 203 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY Information about date of birth, or date of death, or age at death. Names and alternate names (for example, nicknames). Place of birth: “Orr was born in Ann Arbor, Michigan but was raised in Evansville, Indiana”. Place of death: “He died of a heart attack while holidaying in the resort town of Sochi on the Black Sea coast”. Nationality: “He became a naturalized citizen of the United States in 1941”. Cause of death: “He died of a heart attack in Bandra, Mumbai”. Longstanding illnesses or medical conditions: “He stepped down from the position on grounds of poor health in February 2004”. Place of residence: “Sontag lived in Sarajevo for many months of the Sarajevo siege”. Physical appearance: “With his movie star good looks he was a crowd favourite”. Major threats to health and wellbeing (for example, assassination attempts, car crashes). Fame What a person is famous for. This kind of information can be broadly positive (for example, awards, prizes, honours) or negative (for example, scandal, jail terms, and so on). For example: “His study of Dalton won him the Whitbread prize” or “In 1976, heroin landed him in Los Angeles county jail”. Character Attitudes, qualities, character traits, and political or religious views. For example, “He was raised Catholic, the faith of his mother” or “Jones is recalled as a gentle and unassuming man.” Relationships Information concerning relationships with intimate partners or sexual orientation. Relationships with parents, siblings, children, social acquaintances or friends. For example: “His mother died when he was eleven” or “Nine people testified against him at the trial, including another wife he tried to set on fire”. Education Institutions attended, dates, evaluative judgements on time in education, educational choices, qualifications awarded. For example: “Corman studied for his master’s degree at the University of Michigan, but dropped out when two credits short of completion.” Work References to positions, starting jobs, resigning from jobs, job titles, affiliations (for example, employer or organizations), personal wealth, areas of interest, lists of publications, films, and so on. For example: “He returned to England 204 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY in 1967 to work for the offshore pirate radio station Wonderful Radio, London”, or, “Gordon Brown, British Chancellor of the Exchequer”. B.1.3 Example Sentences Here is a list of pre-classified sentences (with a brief statement for each sentence explaining why it has been classified this way). There are 10 example sentences. 1. A year later, on the evening of 7 June 1954, he killed himself. biographical: non-biographical Explanation: This sentence refers to the date of a person’s death — a key life event. It doesn’t matter that “he” is used instead of a proper name. 2. With an awareness of death and the miracle of life at the foundation of his work, Saul Bellow’s novels brought him huge success, and both the Nobel and Pulitzer Prize. biographical: non-biographical Explanation: This sentence is biographical because it makes clear that a person — Saul Bellow — is the recipient of a prize. This counts as a fame event. 3. His ability to exude an almost violent enthusiasm, talk extremely loudly and, seemingly, live a charmed life grabbing some of the world’s most poisonous creatures out of the bush, spawned a growing cult for “red in tooth and claw” wildlife television. biographical: non-biographical Explanation: This sentence describes the qualities and abilities (“violent enthusiasm”, “talk extremely loudly”) of a person (“His”) and is thus a character sentence and characterised as biographical. 4. During spring 1953 he was also being invited to the Greenbaums’ house from time to time, for Franz Greenbaum, whom the Manchester intellectual establishment did not consider a very respectable figure, was not bound by the strict Freudian view of relations between therapist and client. 205 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY biographical: non-biographical Explanation: This sentence describes the relationship between “he” and Franz Greenbaum and is hence biographical. Also, more importantly, “Fran Greenbaum” is described as someone not considered to be a respectable figure; a character sentence. 5. “I’ve come clean ’cause I need help.” biographical: non-biographical Explanation: This sentence does not contain biographical information according to our definition. Although it does tell you something about the individual (“I need help”) this information falls under none of the six categories (key, fame, character, relationships, education and work). 6. The probation period ended in April 1953. biographical: non-biographical Explanation: This example sentence does not contain information about an individual, although it mentions a “probation period” this could easily apply to a company or organization, rather than a person. 7. It contained samples from a mysterious growth on the wall. biographical: non-biographical Explanation: This example sentence does not contain information about an individual and is not biographical 8. While away in Canada, John had a letter from Alan. biographical: non-biographical Explanation: This sentence is not biographical according to the six biographical categories. While it does refer to an event (that is, John receiving a letter), this is not in itself noteworthy with respect to the six biographical categories. 206 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY 9. In his acceptance lecture, Bellow criticised modern writers for presenting a limited and distracted picture of mankind. biographical: non-biographical Explanation: This sentence is not biographical. Although it mentions Saul Bellow, it does not fall into within the six categories outlined. 10. McClaren has also chosen to omit senior internationals David James and Sol Campbell, while Theo Walcott and Scott Carson have been sent on Under-21 duty. biographical: non-biographical Explanation: This sentence is biographical as it contains work information (“senior internationals, David James and Sol Campbell”). This kind of construction (that is, “job title, name” or “name, job title”) is particularly important, especially in news text. B.2 Sentences Sentences are divided into five sets of 100 (see Section 6.4 on page 132 for a description of the methodology used). Those sentences are derived from the small biographical corpus described in Chapter 5 and retain their biographical tags. Note that tags were removed when the sentences were presented to study participants, and also that the abscence of tags indicates that the sentence was not classified as biographical using the scheme developed in Chapter 5. B.2.1 Set One 1. He became pointed, intimate. 2. If you look at the names in Norfolk, there’s a lot that are the same. 3. I saw what he meant. 4. work I was invited by Sean but sat deliberately out of the limelight with my friend the English comedian Eric Sykes, who was also in the picture. /work 5. Former president Amin Gemayel, a sharp critic of Hizbullah, described parts of the speech as dangerous. 207 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY 6. relationships work I was particularly delighted when I came across the following piece of dialogue in a book of dialogue-criticism (Invitation to Learning, New York, 1942) mainly by the American scholars and writers, Huntington Cairns, Allen Tate (soon to be one of my closest friends) and Mark van Doren. /work /relationships 7. It’s been superb, for myself and my family. 8. This could be Lisa’s chance. 9. key . Home was Croydon, where she lived with her divorced mother in a council flat, supported by social security, supplemented occasionally by haphazard maintenance payments from her father, who was in the Merchant Navy and had not been seen since Val was five. /key 10. To a hole in one. 11. Earlier, I made a lot of what I thought were beautiful shots with much backlighting and many effects, absolutely none of which were motivated by anything in the film at all. 12. relationships Recently I asked Evelyn Waugh’s eldest son, Auberon (Bron) if he remembers his father’s reaction on getting those proofs of The Comforters while in the middle of writing his Pinfold . /relationships 13. The curtain was about to go up. 14. character She was disappointed. /character 15. fame For more than 20 years Ronnie Barker was one of the leading figures of British television comedy. /fame 16. character In this poem we see their shared Jewishness, and the “irreverence” (as some would see it) they each had for the Tradition — at least for that view of it which some espoused; we also see a shared disdain for rabbinic (and priestly) logic, to them both a form of mental death. /character 17. “Trust Peach to exaggerate them.” 18. education And so he had acquired an old-fashioned classical education, with gaps where teachers had been made redundant or classroom chaos had reigned. /education 19. fame With an awareness of death and the miracle of life at the foundation of his work, Saul Bellow’s novels brought him huge success, and both the Nobel and Pulitzer Prize. /fame 20. Fatherly men patted her head admiringly and older ladies frowned at the sight of her. 21. character key Reports claimed that the elfin figured star’s weight plunged terrifyingly until she tipped the scales at a mere five stones. /key /character 22. But that was just not there. 23. It’s true what they say about that. 208 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY 24. Under the new regime we would be the first province to disenfranchise them. 25. Margaret had failed by four votes to win outright. 26. key I took refuge first at Aylesford in Kent at the Carmelite monastery, and next at nearby Allington Castle, near Maidstone, a Carmelite stronghold of tertiary nuns. /key 27. She’s going out of her wits. 28. I worked 26 years in the same job and had to quit. 29. If he is happy, the boy thought, then I’m glad he went. 30. This, of course, made me feel very cheerful. 31. key In the middle of 1955, before I had finished my first novel, I moved back to London, fully restored and brimming with plans. /key 32. character work Alan Maclean, who was the best-liked editor in London, asked me to write a novel for his firm; they would commission it (a thing unheard of, for first novels, in those days). /work /character 33. education In 1986 he was twenty-nine, a graduate of Prince Albert College, London (1978) and a PhD of the same university (1985). /education 34. work It was to be directed by the great French director Ren Clement. /work 35. In the past, Mr Blair has proclaimed himself as the change-maker. 36. fame In the two short years that followed her first record Kylie became one of the entertainment phenomena of the 1980s. /fame 37. relationships We were very attached to each other, there in the office at 50 Old Brompton Road, with one light bulb, bare boards on the floor, a long table which was the packing department, and Peter always retreating to his own tiny office to take phone calls from his uncles; one of them worked at Zwemmer the booksellers and gave us intellectual advice, and the other was a psychiatrist. /relationships 38. “Working on his conference speech,” came the reply. 39. Peach felt a little awed by its grand rooms and suspicious of those fat babies that Lais called cherubs peeking down at her from the ceilings. 40. It is a cabinet ’we’ . . . 41. I don’t care about the Superdome. 42. key A year later, on the evening of 7 June 1954, he killed himself. /key 43. Increasingly, Kate felt depressed by Toby’s sexual behaviour, which disgusted and bewildered her. 44. It’s just something I’ve eaten. 209 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY 45. relationships By spring 1953 he was also being invited to the Greenbaums’ house from time to time, for Franz Greenbaum, whom the Manchester intellectual establishment did not consider a very respectable figure, was not bound by the strict Freudian view of relations between therapist and client. /relationships 46. You’re asking me if I’m an accessory after the fact to first-degree murder. 47. education She attended the Actors Studio in New York, famous for teaching an intense style known as the Method, beloved of actors such as Marlon Brando and James Dean. /education 48. key Something strange was not surprising, because, foolishly, I had been taking dexedrine as an appetite suppressant, so that I would feel less hungry. /key 49. work relationships I found a friend in Father Frank O’Malley, a kind of lay-psychologist and Jungian. /relationships /work 50. work The examination of Robin’s PhD thesis, on the logical foundations of physics, had to be postponed since Stephen Toulmin, the philosopher of science, had decided he could not undertake it after all. /work 51. education The wrath of her disappointment had been the instrument of his education, which had taken place in a perpetual rush from site to site of a hastily amalgamated three-school comprehensive, the Aneurin Bevan school, combining Glasdale Old Grammar School, St Thomas a Beckett’s C of E Secondary School and the Clothiers’ Guild Technical Modern School. /education 52. work Minutes of the University Council show that this had been decided by January or February 1953. to appoint him to a specially created Readership in the Theory of Computing when the five years of the old position ran out on 29 September. /work 53. fame From 1944 and the publication of Bellow’s first novel Dangling Man, the writer and teacher produced a body of work that ensured his position as one of America’s most powerful voices. /fame 54. All she wore was 11 beads, and eight of them were perspiration. 55. character He was a small man, with very soft, startling black hair and small regular features. /character 56. education Star Trek’s impact became apparent when he was awarded an honorary doctorate in Engineering from the Milwaukee School of Engineering, after half the students there said that Scotty had inspired them to take up the subject. /education 57. Several other motorists who had refused to pay “bail” were also given their keys back. 58. work Franco was the Fascist Dictator of Spain at the time. /work 59. He lags it up well, but the US pair are made to pay moments later when Donald takes advantage of Garcia’s accurate iron by rolling in a 12-footer. 210 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY 60. Shame on you, Tony Blair. 61. relationships Father O’Malley and his cousin Teresa Walshe had found the place for me. /relationships 62. And it will be good for us all. 63. character She thought him “obsessed by sex” /character 64. “I take one 10mg tablet each night and I feel about 60% better.” 65. work I have been in over seventy-three films in thirty years and by the time you read this it will probably be seventy-six. /work 66. So Douglas and John should be released from their obligation to me and allowed to stand, since either had a better chance than I did. 67. This is the way I work. 68. character A keen and powerful debater, he was not amused at the dreariness of the Executive’s meetings, the small talk and the administration (never his strong point). /character 69. work In 1953 on my return from Edinburgh, feeling desperately weak, I wrote a review, in the Church of England Newspaper, of T.S. Eliot’s play The Confidential Clerk which was first performed at the Edinburgh Festival. /work 70. I looked up and Bardot was grinning as she dusted breadcrumbs from her hands. 71. key Blackadder, a Scot, believed British writings should stay in Britain and be studied by the British. /key 72. Kate said, “Oh, do take it off, Toby!” 73. fame In 1988 alone she sold a remarkable £25 million worth of records around the world, earning herself around £5 million. /fame 74. He suggested that the raids could even have been timed to distract attention from criticisms of the government’s stance on the Lebanon crisis. 75. education After the war, Doohan spent two years studying acting at New York City’s Neighborhood Playhouse, where he later taught. /education 76. The two men had spoken “almost daily” in August as Mr Blair supported the push for a UN resolution to end the Israeli offensive. 77. Panic ensued as such brands as Watney’s Red Barrel, Worthington E and Whitbread Tankard rapidly dominated the market. 78. fame She was garnering awards from Japan to Israel and Ireland, embarking on a movie career that seemed certain to lead to Hollywood stardom — she had even achieved the final confirmation of her status as a member of the elite band entitled to call themselves superstars, a wax image of herself at Madame Tussaud’s /fame 211 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY 79. There is no doubt that a majority of her cabinet colleagues were glad to see the back of her, but they gave their advice freely and frankly. 80. education key Born in Bedford in 1929, Barker went to school in Oxford, became an architecture student and even toyed with the idea of becoming a bank manager, the archetypal middle-class profession he would later parody so effectively in his comic sketches. /key /education 81. relationships About my grandmother Flory Zogoiby, Epifania da Gama’s opposite number, her equal in years although closer to me by a generation: a decade before the century’s turn Fearless Flory would haunt the boys’ school playground, teasing adolescent males with swishings of skirts and sing-song sneers, and with a twig would scratch challenges into the earth- step across this line . /relationships 82. “And we are fighting for the possibility that good and decent people across the Middle East can raise up societies based on freedom, and tolerance, and personal dignity.” 83. fame He received the National Book Award, his first of three, in 1954 for The Adventures of Augie March and, 10 years later, his international reputation was assured with Herzog. /fame 84. fame This brought him a Pulitzer Prize in 1975, and a year later, Bellow became the seventh American writer to receive the Nobel Prize for Literature. /fame 85. “But this also will have to be checked into the hold.” 86. key character Wilmot was a dissolute courtier at the Restoration court of Charles II, a lecher and a drunk, but also a poet who could treat himself and his world with satiric coolness and who helped to establish the tradition of English satiric verse and assisted Dryden in the writing of Marriage-a-la-Mode. /character /key 87. Throughout the 1950s, Hilda worked tirelessly to better the condition of African women, despite being banned from 28 organisations. 88. relationships One person who encouraged this development was Lyn Newman, who became another of the small group of human beings whom Alan could trust. /relationships 89. fame He was much loved and admired for his appearances in the long-running series The Two Ronnies, with Ronnie Corbett, as prison inmate Fletcher, in the series Porridge, and as Arkwright, the bumbling, stuttering, sex-obsessed shopkeeper in Open All Hours. /fame 90. fame It so happened that in 1954, in the crucial months of my illness, my name was beginning to flourish in the literary world. /fame 91. “I have great respect for David; he was a fantastic captain, a great player and still is.” 92. That was a Saturday. 212 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY 93. key Through the psychedelic years he was a schoolboy in a depressed Lancashire cotton town, untouched alike by Liverpool noise and London turmoil. /key 94. education He had done what was hoped of him, always, had four A’s at A Level, a First, a PhD. /education 95. “No,” George smiles, and his family burst into tears. 96. relationships The house was owned by Mrs Lazzari (Tiny), a wonderful Irish widow who had been married to an Italian cellist (“so I understand the Artist”) /relationships 97. “Are you saying my sister is going to die?” 98. And at every turn is the ubiquitous Fred Scuttle, constantly at our service with his peaked cap crazily askew, eager eyes blinking madly through wire-rimmed glasses and fingers enthusiastically splayed in a ragged salute. 99. work The last three Best Actor Academy Awards have been won by British stars: Daniel Day Lewis as a horribly deformed man in My Left Foot; Jeremy Irons as a man on trial for the alleged attempted murder of his wife in Reversal of Fortune, and Anthony Hopkins, in 1992, who played Hannibal Lecter the homicidal cannibal in The Silence of the Lambs /work 100. “If you lit a match in our kitchen, it’d go up with a roar.” 101. character education work He worked in menial jobs to pay his way through college where he studied journalism and became a radical. /work /education /character B.2.2 Set Two 1. Kendra and Maliyah were joined at mid-torso, with some shared organs and just two legs. 2. Just a teensy little kiss, he said. 3. Along with using their laptops on board, business travellers have become used to taking all their luggage on board with them, so they can get to their meetings as quickly as possible, without having to wait to collect their bags. 4. work I worked at Peter Owen’s three days a week, and at home wrote stories and my second novel, a kind of adventure story, Robinson . /work 5. He has a terrific sense of humour and he carries on running jokes from the day before. 6. character Her eyes, according to Alan Watkins of the Observer, took on a manic quality when talking about Europe, while her teeth were such as ’to gobble you up’. /character 213 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY 7. Allardyce is aware that he and Redknapp will be under intense scrutiny and he called a meeting of his senior players yesterday to discuss the fall-out from the Panorama programme. 8. Ms Hewitt will not present it as a cut in investment, but that may be the consequence of blocking uneconomic schemes. 9. fame In that year, he picked up the English and European footballer of the year awards. /fame 10. work I was secretary, proof-reader, editor, publicity girl; Mrs Bool was secretary, office manager and filing clerk; and Erna Horne, a rather myopic thick-lensed German refugee, was the book keeper. /work 11. work Alan had been urged to look for new young talent, and got my address from Tony Strachan, who was then working at Macmillan. /work 12. relationships key They had seen one some months earlier, a puppy of fourteen weeks with a beautiful smoky fur, belonging to Raymond’s wife, Charlotte (the Raymond Greenes were then living in Oxford where Raymond had a medical practice), and this led Greene to buy one /key . /relationships 13. But they all miss the point. 14. relationships He lived with Val, whom he had met at a Freshers’ tea party in the Student Union when he was eighteen. /relationships 15. character While Benny’s deliberately low key arrival to start making a new series may seem somewhat incongruous for a millionaire entertainer whose programmes are shown all over the world, it is typical of the workaday beginnings from which a Benny Hill Show is produced. /character 16. John said that he now doubted whether I could get the support of the Cabinet. 17. “It has obviously caused a lot of offence and for that I unreservedly apologise,” he said but added: “Words like inbreeding and outbreeding are very professional, genetic terms.” 18. Writing in The Times, Dunkley commented: “When I was about four my mother managed to reduce me to an almost hysterical fit of giggling by promising to show me her new water otter and then producing a kettle.” 19. work It came from Alan Maclean, the fiction editor of Macmillan, London, a much larger publisher than any I had so far dealt with. /work 20. character key Blackadder, a Scot, believed British writings should stay in Britain and be studied by the British. /key /character 21. Fielding caught the unspectacular tab, leaving a twenty on the plate. 22. key In 1930 he had begun writing the biography of John Wilmot, the second Earl of Rochester, who was born on to April 1647 and died at the age of thirty-three. /key 214 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY 23. education key Nicknamed Robin at school, he attended Aberdeen Grammar School before studying English Literature at Edinburgh University. /key /education 24. “It is difficult to describe how it feels to get someone back who you were told you had lost for ever.” 25. “To talk again.” 26. We were there for six days and only got out on medical grounds because of the baby. 27. Here was a steadfast friend but, as I quickly saw, one in the deepest distress. 28. “Especially Maman and Gerard.” 29. I thought it was a reference to Through the Looking Glass, where Humpty Dumpty says Inpenetrability. 30. fame In 2004 Barker was honoured with a Bafta tribute award and celebration evening for his contribution to comedy. /fame 31. Woods finally gets a birdie putt to drop, before Harrington sparks wild celebrations among the crowd by following him straight in. 32. fame Together with Denis Law and Bobby Charlton, Best formed a triumvirate that inspired Manchester United to League Championships in 1965 and 1967 and the European Cup in 1968. /fame 33. work While waiting for my novel to appear, I worked part time at Peter Owen the publisher. /work 34. fame He is widely regarded as one of the greatest players to have graced the British game. /fame 35. Now how do we market this. 36. There is an enormous amount of pressure on me. 37. character He was to turn to Catholicism and make a death-bed repentance. /character 38. Shall I let you in?’ 39. But don’t despair, my friends. 40. work relationships I found a friend in Father Frank O’Malley, a kind of lay-psychologist and Jungian. /relationships /work 41. relationships character ’Benny’s not the kind to sweep up to the studio in a huge limousine like some showbiz superstar,’ explained Dennis Kirkland, a former floor manager who has been producing the show for seven years and is one of Benny’s’ few close friends. /character /relationships 42. Nothing else of this correspondence has survived; my suspicion is that it probably held the most revealing and sophisticated psychological comment that he ever put into letters. 215 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY 43. fame At 18, he won the first of 37 international caps for Northern Ireland and was being hailed as the new Stanley Matthews. /fame 44. But she fervently hopes that you will give Lisa a chance and that you will not be disappointed. 45. fame 1984: Jailed for drink-driving offence /fame 46. What is she hoping for? 47. Donald finds the green in two to leave Garcia with a 25-footer for birdie. 2.50pm Betting update: The US, who drifted out as long as 11-2 during the morning’s play, have come back in to 7-2, with Europe drifting out to 4-11 from 1-6 at one point earlier today. 48. “You must have had a very bad mother, if you do something like this,” she told Mr Diop before stomping off. 49. Others, including Professor Kemp, are sure that the Mona Lisa was not cut down. 50. I feel dizzy as well as sick. Why don’t you lie down? I’ve been lying down. 51. fame In the late 1970s he was three times the British Academy’s best light entertainment performer, and in 1975 he took the Royal Television Society’s award for outstanding creative achievement. /fame 52. He added: There has been a lot more intelligence. 53. 54. character He was, as always, charming, thoughtful and loyal. /character fame Although he rarely missed a game in his early career, he started causing problems at Old Trafford, and in 1971 was suspended for a fortnight for failing to catch a train for a game at Chelsea. /fame 55. Once more Kate hit the bathroom floor. 56. Oh, how sad! Toby disappeared into the bathroom and emerged about ten minutes later. 57. She was trying to hide the elation, but I could see it there. 58. “This is a common threat to all of us and we should respond with a common purpose and a common solidarity and common cause,” he said. 59. Play Dirty was due to be shot in a town called Almeria in southern Spain, and as there was no airport there I had to fly to Madrid and then take a train. 60. education Charles himself was educated at St Albans school and read natural sciences and law at Trinity College, Cambridge. /education 61. “I crushed it up and gave it to him in a bottle with a soft drink,” Sienie recalls. 62. work On that occasion, while I was seeing my agent, Tiny wandered off by herself; she came back bringing with her for lunch my friend, Joe 216 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY McCrindle, owner and editor of Transatlantic Review, who had visited at Baldwin Crescent. /work 63. It seems odd. 64. I said, “George”, and he said, “What?” 65. fame Manchester United football legend George Best will be remembered for his dazzling skill on the pitch, and for his champagne lifestyle away from it. /fame 66. education relationships Hawthorne, one of identical twins, was educated in Belfast, at the Methodist college and Queen’s University. /relationships /education 67. character Poor Geoffrey had just been unlucky in his seating for Robin Maxwell-Hyslop had always been a man to avoid. /character 68. education But John’s first love was sailing: he was educated at the Nautical College in Pangbourne. /education 69. relationships This might appear to indicate that Blackadder and Cropper worked harmoniously together on behalf of Ash. /relationships 70. work In the course of that year the proofs went round among literary people, one of whom was Gabriel Fielding, a very good novelist; his real name was Alan Barnsley, a medical doctor, practising in Maidstone. /work 71. work I was particularly delighted when I came across the following piece of dialogue in a book of dialogue-criticism (Invitation to Learning, New York, 1942) mainly by the American scholars and writers, Huntington Cairns, Allen Tate (soon to be one of my closest friends) and Mark van Doren. /work 72. However, those caught up in the Superdome misery, many from the city’s mostly black and poorer areas, appear largely ambivalent about its reopening. 73. education relationships Before long, her mother moved to Ilfracombe, Devon, where Patricia went to school aged three, having already taught herself to read from newspapers. /relationships /education 74. The probation period ended in April 1953. 75. The economy is very bad. 76. education Either side of the second world war, John went to Corpus Christi College, Cambridge, where he graduated with an honours degree in economics. /education 77. relationships key He visited Hinchingbrook House, home of the Earls of Sandwich, one of whom had married one of Rochester’s daughters. /key /relationships 78. character I stressed his stamina, his integrity and his ministerial experience. /character 217 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY 79. education key She went on to secretarial college in London, defying her father, who did not want her to leave home, and in 1959 got a job at JWT, soon becoming a copywriter. /key /education 80. It was reported that she disturbed the prowler when she arrived back unexpectedly at her family’s Melbourne home. 81. He don’t half talk a lot of . . . nonsense. 82. And he said, “Yes.” 83. However, many - including Mr Olmert - have questioned whether such a policy would work in the long term. 84. Sometimes it gets to me, I give so much time and energy to everyone else, that there is nothing left for me. That is when I think What about me? 85. Over the next two hours or so, each Cabinet minister came in, sat down on the sofa in front of me and gave me his views. 86. During her days on Neighbours, she recalled how people were only too willing to vent their jealousies publicly. 87. key character Wilmot was a dissolute courtier at the Restoration court of Charles II, a lecher and a drunk, but also a poet who could treat himself and his world with satiric coolness and who helped to establish the tradition of English satiric verse and assisted Dryden in the /key writing of Marriage-a-la-Mode. /character 88. work relationships key Val left him for the first time since they had set up house, and went briefly home. /key /relationships /work 89. He took Stilnox in 1999 and reported an improvement in balance, coordination, speech and hearing. 90. key Alas not now, since Sir Hugh died in 1987. /key 91. education work Nykvist studied photography, and spent a year at Cinecitta in Rome, before joining the Swedish production company Sandrews in 1941 as assistant director of photography. /work /education 92. character Producer Kirkland well remembers that the origins of one Benny Hill Show lay in the arrival on his desk of a dog-eared piece of cardboard covered with what looked like Egyptian hieroglyphics. /character 93. work Matter-of-fact Hugo Manning, a night-journalist who worked on Reuters, and also a poet and amateur philosopher, was a great source of moral support. /work 94. relationships But his favourite walking companion was always Hugh, though in later years it was mostly for visits to secondhand bookshops. /relationships 95. fame Best was a footballing genius. 218 /fame A PPENDIX B: H UMAN S TUDY: M AIN S TUDY 96. Since silence was also Roland’s only form of aggression they would continue in this way for days, or, one terrible time when Roland directly criticised Male Ventriloquism, for weeks. 97. work education After studying at Columbia University he served in the US army in Germany during the war and, returning to Columbia, took another degree, in journalism. /education /work 98. On 10 May Alan sent a letter to Maria Greenbaum, describing a complete solution to a solitaire puzzle, and ending: I hope you all have a very nice holiday in Italian Switzerland. 99. At least 41 people were killed when a concrete building collapsed in Jinxiang, an industrial town in Cangnan county close to where the typhoon hit land. 100. I said to Robert: It’s going to be all right, isn’t it? She is so like he said. B.2.3 Set Three 1. work She acquired an IBM golfball typewriter and did academic typing at home in the evenings and various well-paid temping jobs during the day. /work 2. relationships I now had a short talk with Alan Clark, Minister of State at the Ministry of Defence, and a gallant friend, who came round to lift my spirits with the encouraging advice that I should fight on at all costs. /relationships 3. key But the accompanying champagne and playboy lifestyle degenerated into alcoholism, bankruptcy, a prison sentence and, eventually, a liver transplant. /key 4. key Many of his novels were set in Chicago where his poor RussianJewish parents moved when he was a child. /key 5. By now I was in hysterics and Bardot noticed this and probably thought I was laughing at her. 6. character Well I have my problems too, sister, but I don’t have yours, I’m not allergic to the twentieth century. /character 7. He smiled in innocent self-reproach, then swung sternly and made the reverse V-sign at the watchful waiter. 8. A second report published yesterday by the London Resilience Forum, representing the emergency authorities, concluded that not a single life was lost because of poor planning. 9. Ibid., 12 June 1931. 10. The story was eventually made into a movie starring Johnny Depp and Benicio Del Toro. 11. fame Arthur Miller was America’s foremost post-war playwright. 219 /fame A PPENDIX B: H UMAN S TUDY: M AIN S TUDY 12. We made our way to Green Park and as we were sitting there Roderick suddenly said: What are we going to do, Noelle? Do? I mean, how much longer are we going on meeting like this? 13. work relationships character Benny’s not the kind to sweep up to the studio in a huge limousine like some showbiz superstar,’ explained Dennis Kirkland, a former floor manager who has been producing the show for seven years and is one of Benny’s’ few close friends. /character /relationships /work 14. Steinberg, introducing Klein’s brilliant novel The Second Scroll, draws attention to, the obsessive theme of the discovered poetry (of New Israel) is the miraculous, and the key image necessary to explain the remarkable vitality, the rebirth evidenced in every aspect of life, is the miracle. 15. Many Blairites would agree with that assessment. 16. Val did very badly. 17. There’s no economic management in this country. 18. work Significantly, he adduced the work of the American poet Wallace Stevens at this point, a man torn between the profession of law and the poetic muse, whose view of lost faith and a disconnected tradition imbued his poetry with a wistfulness and a challenge that was taken very seriously by Leonard and Layton; or, perhaps, viewed by them as a satisfactory replacement. /work 19. character He had a way with words, and perhaps this had too easily convinced me that he and I always put the same construction upon them. /character 20. work key In 1930 he had begun writing the biography of John Wilmot, the second Earl of Rochester, who was born on to April 1647 and died at the age of thirty-three. /key /work 21. That was for him to continue to fight for a place. 22. “The study suggests that with increasing sea surface temperatures, we can expect more intense hurricanes,” Dr Gillett added. 23. fame She was one of the most celebrated actresses of the 1960s and 1970s, winning five Academy Award nominations and an Oscar itself for her role in The Miracle Worker. /fame 24. fame Then, in 1984, he was convicted of drink-driving and assaulting a policeman, and was jailed for 12 weeks. /fame 25. work This was the year in which Leonard was elected president of McGill’s Debating Society. /work 26. relationships Peter was the Thatcherite brother of the ’wet’ Charlie Morrison, and son of the John Morrison who had been the chairman of the ’22 when I was first elected in 1959. /relationships 220 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY 27. work He was now essentially unemployed, scraping a living on parttime tutoring, dogsbodying for Blackadder and some restaurant dishwashing. /work 28. “Is there a lot of fist-pumping, and yeah-shouting?” asks Ravi Motha. 29. relationships key Although divorced four times, Saul Bellow’s own much-stated belief in the miracle of life was reinforced when his fifth wife made him a father again at the age of 84. /key /relationships 30. He paused. Antonio Pisello, he said, Tony Cazzo — from Staten Island. 31. education As king, he was instrumental in promoting the University of the South Pacific, and from 1970 was its first chancellor. /education 32. fame A year later, he was dropped from the team again for failing to attend training, and was ordered to leave the house he had built in Cheshire and move into lodgings near Old Trafford. /fame 33. education He was then sent to Australia to study: first at Newington College, in Stanmore, New South Wales, then at Sydney University (1938-42), where he read arts and law - and became the first Tongan in history to graduate. /education 34. But as a former Chief Whip - and how often in recent days had I wished that he still held that office - he knew that support for me in the Cabinet had collapsed. 35. “I was born and raised in New Orleans but I don’t want to go back, not to the city and definitely not to the Superdome.” 36. fame Yet, ask anyone their memory of Anne Bancroft and it’s the image of the bored housewife in The Graduate listening to Dustin Hoffman asking the question “Mrs Robinson, you’re trying to seduce me, aren’t you?” /fame 37. work key As well as citing a decline in health for his reason for retiring, Barker said he always felt he should quit while he was ahead, and he had no further ambitions. /key /work 38. Next, I seemed to realize that this word-game went through other books by other authors. 39. Have a good night. 40. On one occasion he refused to shake her hand, and on another lost his temper and swore to “rope her” - the choice of verb was not lost on female voters - “like a heifer”. 41. Since the 7/7 bombings in London last year, ministers in the Home Office had been “very actively engaged” in discussing with members of Muslim communities the threat facing “all of us” and had already acted on nine of the 12 points outlined in an anti-terrorism plan drawn up after the London bombings, Mr Reid said. 42. education work She kept in touch with Cornwall and strongly supported projects for the county’s regeneration; was a vice president 221 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY of the History of Advertising Trust, a member of the Monopolies and Mergers Commission, and on the council of Brunel University (they gave her an honorary doctorate in 1996, the year she was made an OBE); was involved with National Trust Enterprises, the English-Speaking Union and the then Administrative Staff College, Henley ... the list goes on and /education on. /work 43. character But he was a man of the Left. /character 44. Television critic Chris Dunkley groaned with displeasure when Benny insisted on reviving one particular old chestnut in one of his early shows for Thames Television. 45. character His formative years were also steeped in his Jewish heritage, but he turned from this “suffocating orthodoxy” to enjoy the works of such writers as Mark Twain and Edgar Allan Poe. /character 46. work It seemed incongruous to them that the sweet teenager they knew was not only surviving life in the toughest of all trades, but that she was also winning something of a reputation as a tough cookie, a determined career girl refusing to be deflected from her dreams. /work 47. relationships work Francis Maude, Angus’s son and Minister of State at the Foreign Office, whom I regarded as a reliable ally, told me that he passionately supported the things I believed in, that he would back me as long as I went on, but that he did not believe I could win. /work /relationships 48. They had the lights on us and it was so hot we were pouring water on to the heads of the elderly to try to keep them cool, but they were passing out. 49. “The money used has nothing to do with housing or individual allocations, so we’re not competing.” 50. The time came when nobody would cross the lines she went on drawing, with fearsome precision, across the gullies and open spaces of her childhood years. 51. fame Beyond the glare of stardom and the Pulitzer Prize which he won for Salesman, he sought to provoke his audience into questioning society and authority. /fame 52. character I am addicted to the twentieth century. 53. fame Arthur Miller became famous overnight /fame /character 54. Airey Neave and Margaret Thatcher have come to see me and we’re absolutely agreed that there should be no increase in your licence fee unless you put things right... 55. work The Ash Factory was funded by a small grant from London University and a much larger one from the Newsome Foundation in Albuquerque, a charitable Trust of which Mortimer Cropper was a Trustee. /work 222 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY 56. character Able to deliver the great tongue-twisting speeches required of his characters, Barker pronounced himself “completely boring” without a script. /character 57. fame His work continued to be loved and admired in the UK and in 1995 his Broken Glass won the prestigious Olivier Award for best play. /fame 58. McGinley decides to have a crack at the 16th green in two from the first cut of the right rough, but comes up short to find a watery landing spot. 59. key relationships His personal life became increasingly more difficult, with bouts of alcoholism, bankruptcy and the failure of his first marriage. /relationships /key 60. Fifty-six people, including the bombers, died in the attacks, with more than 700 injured. 61. As they saw it, she was unlikely to defeat Michael in a second ballot. 62. work relationships key Home was Croydon, where she lived with her divorced mother in a council flat, supported by social security, supplemented occasionally by haphazard maintenance payments from her father, who was in the Merchant Navy and had not been seen /relationships /work since Val was five. /key 63. “I will stay with her,” said Leonie, walking to the severe white door behind which her granddaughter lay. 64. Perhaps I should have done. 65. At that time, the southern states’ rigid segregation laws, which had been in force since the end of the Civil War in 1865, demanded separation of the races on buses, in restaurants and other public areas. 66. By the time the proofs came to him in mid-August 1931 he still had not found a satisfactory title and, in some desperation, chose one previously put forward by his publisher for his preceding novel, a suggestion he had not then taken up - Rumour at Nightfall . 67. She swept Peach off and plunged her into a bath of cool water, gradually adding ice until the coolness penetrated Peach’s very bones. 68. On the surface it is a good action story, based on fact, with a moral to it and some controversy. 69. character You know, the thing I want more than anything else — you could call it my dream in life — is to make lots of money. /character 70. One cabinet minister involved in bridge building between Mr Brown and Tony Blair during the past fortnight put the challenge for Mr Brown like this: What will matter is the language in which he speaks about Tony ... 71. But I was glad to have someone unambiguously on my side even in defeat. 223 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY 72. 73. 74. education After graduation she spent a year at the University of Texas, Austin, to acquire a teaching certificate and taught history and social studies for one year in state schools. /education character Barker was a man of contradictions. /character fame Indeed, it is claimed that he once lost more than $6m in one night at the Monte Carlo casino. /fame 75. He is linking giving up Hizbullah’s weapons to regime change in Lebanon and ... to drastic changes on the level of the Lebanese government, Mr Gemayel said. 76. education key relationships She was born the only child of hard-working, blue-collar parents, Ona and Cecil Willis, in the small town of Lakeview in east central Texas, and attended high school in Waco. /relationships /key /education 77. education key Born in Bedford in 1929, Barker went to school in Oxford, became an architecture student and even toyed with the idea of becoming a bank manager, the archetypal middle-class profession he would later parody so effectively in his comic sketches. /key /education 78. work After Solomon’s desertion, Flory took over as caretaker of blue ceramic tiles and Joseph Rabban’s copper plates, claiming the post with a gleaming ferocity that silenced all rumbles of opposition to her appointment. /work 79. education key Evans spent his infancy in Aberkenfig, Glamorgan, went to school in Suffolk, did his national service and, in 1953, when he was 23, emigrated to New Zealand as a labourer. /key /education 80. Last month Mr Bruce warned the campaign had stagnated and said the country needed a coalition of angry people committed to slaying the dragon immediately. 81. character He was modest about his writing skills and often submitted his scripts under pseudonyms, in order for them to be judged on their own merits. /character 82. Your last letter arrived in the middle of a crisis about ’Den Norske Gutt’, so I have not been able to give my attention yet to the really vital part about theory of perception.... 83. education work She left school at the age of 14, determined to make a career as a dancer. /work /education 84. The materials used to make the overhead bins in airlines have been strengthened and so heavier items can be stored without harming the passengers, said Mr Bowden. 85. work relationships key They had seen one some months earlier, a puppy of fourteen weeks with a beautiful smoky fur, belonging to Raymond’s wife, Charlotte (the Raymond Greenes were then living in 224 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY Oxford where Raymond had a medical practice), and this led Greene to /work buy one /key . /relationships 86. work Council. relationships His father was a minor official in the County /relationships /work 87. education His education as crown prince had begun at a school run by the Free Wesleyan Church and continued at Tupou College, where, as an academically bright teenager, he obtained his leaving certificate at the age of 14. /education 88. The Ordeal of Gilbert Pinfold was the result, published in the summer of 1957. 89. relationships work Ronnie Barker first worked with Ronnie Corbett in The Frost Report and Frost on Sunday, programmes for which /relationships he also wrote scripts. /work 90. He told his people, these forces are participating in joint exercises with Saudi Arabia. 91. She recorded in her diary: “The fish shop sells china on one side and flies on the other.” 92. I suppose reading must come in quite handy at times like these. 93. Fifteen of the hijackers were thought to have been Saudi nationals. 94. fame character He claimed that the experience made him turn over a new leaf, but in 1990 millions watched his infamous drunken performance on the Wogan television chat show /character . /fame 95. Don’t you think that is fascinating? Roderick said he did. Mr Claverham has something very ancient in his own home, I told Lisa. They have found remains of a Roman settlement on the land. How wonderful! cried Lisa. 96. education character Her speaking ability revealed itself early on and she entered Baylor University on a debating scholarship /character . /education 97. He was also chatting with family and friends. 98. The draw is 12s. 2.48pm - Montgomerie/Westwood v Campbell/Taylor a/s (4) The American pairing hit back nicely with a birdie at the par-five. 99. She would hold the beautiful earrings for her or slide sparkling rings on to Lais’s white fingers, touching the long lacquered nails wonderingly, her mouth copying Lais’s pout as she applied the lovely shiny red lipstick. 100. Linex liked the way I was thinking, but he said that you’d never get the punters in and out quickly enough. 101. work She worked in the City and in teaching hospitals, in shipping firms and art galleries. /work 225 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY B.2.4 1. Set Four fame In 1999 Rosa Parks was awarded the Congressional Medal of Freedom. /fame 2. For sheer incident it almost rivals the Arnold story. 3. character A slight figure, 5ft 8in tall and weighing 10 stone, he dazzled the crowds with his skill. /character 4. work He anticipated Swift in his “Satyr Against Mankind” with its scathing denunciation of rationalism and optimism, contrasting human perfidy and the instinctive wisdom of the animal world. /work 5. They formed anagrams and crosswords. 6. ’And he never wastes anything. 7. key James Montgomery Doohan (he shared a name with his most famous character) was not, in fact, a Scot but a Canadian. /key 8. fame He is best-known for his 1972 account of a drug-addled Nevada trip, Fear and Loathing in Las Vegas. /fame 9. fame Rosa Parks was arrested for her refusal to give up her bus seat /fame 10. Last August, he said Mr Blair should learn the lessons of Iraq and make a pledge to the party conference that he would not launch anymore preemptive strikes. 11. work relationships character ”Benny’s not the kind to sweep up to the studio in a huge limousine like some showbiz superstar,” explained Dennis Kirkland, a former floor manager who has been producing the show for seven years and is one of Benny’s’ few close friends. /character /relationships /work 12. work He had taken the risk of giving up a secure and promising career with The Times; the risk of accepting a salary from his publishers on the understanding that he would produce saleable novels; the risk, financially forced on him, of removing himself from the London literary scene and into the country. /work 13. “She’s trying to fight it- she’s ignored me since the first day of shooting.” 14. No evidence. 15. What about my teeth? she asked, thinking of her mother. 16. There are also complaints that taxpayers’ money allegedly being sent to Hizbullah in Lebanon would be better spent at home. 17. character He also had a reputation as something of a diplomat, who understood the intricacies of foreign policy, especially the importance of the Saudi dynasty’s relationship with the United States. /character 18. relationships key George Best was born in Belfast, the son of a shipyard worker. He was spotted by a Manchester United scout while still at school. /key /relationships 226 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY 19. key As a captain in the Royal Canadian Artillery Regiment, he lost a finger on the first morning of the D-Day landings in Normandy. /key 20. Another granny has been mob-raped in her sock by black boys and skinheads. 21. character education work He worked in menial jobs to pay his way through college where he studied journalism and became a radical. /work /education /character 22. relationships And in 2000, aged 80, Doohan boldly went into fatherhood for the seventh time when his then 43-year-old wife gave birth to a daughter, Sarah. /relationships 23. character But soon the shy, unworldly boy from Belfast was caught up in the trappings of fame. /character 24. While fellow professionals may be understandingly tolerant of a comic who can give old material a new shine, those less personally concerned with the difficulties of creating new scripts are not always so charitable. 25. Five traffic policemen were taken for questioning, while one was reported to have run away. 26. Several days previously I had counseled caution - Michael was doing a trawl of those who were committed to him. 27. The audience applauded. 28. Nick Ridley, no longer in the Cabinet but a figure of more than equivalent weight, also assured me of his complete support. 29. His eyes glittered strangely in his masklike makeup: Kate thought he looked like a novelette villain. You’ve been reading too much Barbara Cartland, she told herself. 30. He was following the route taken by the Parliamentary Army during the Civil War - “over the final ridge of the Cotswolds, to Chipping Norton”, and his personal experience of this journey appears in the opening of his biography — the level wash of fields. . . divided by grey walls, lapping round the small church and rising to the height of the gravestones in a foam of nettles before dwindling out against the black rise of Wychwood. 31. Snuff movies — now this is evidence. And then his manner, the force field he gave off, it changed, not for long. 32. It’s as simple as that. 33. Being in Northern Ireland, he was not closely in touch with parliamentary opinion and could not himself offer an authoritative view of my prospects. 34. There is also the matter of Mr Brown’s personal style. 35. relationships work King Fahd, who ascended the Saudi throne in 1982, was one of seven sons of the founder of Saudi Arabia, King Abdel-Aziz, and his favourite wife, Hassa. /work /relationships 227 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY 36. Only mine drinks. 37. Maliyah was to begin kidney dialysis in preparation for receiving a kidney from her mother in three to six months. 38. Was I driving like a twat? asked Hammond, before walking gingerly to the toilet. 39. relationships He admired her courage, he said, but it did not last. /relationships 40. Re: 4.44pm. 41. He was real shy. 42. fame Her Oscar nominations were for The Pumpkin Eater (1964), The Graduate (1967), The Turning Point (1977) and Agnes of God (1985). /fame 43. One particular example of this, the American novel Finistre Fritz Peters, Finistre (Gollancz, 1951). which had appeared in 1951, was much admired by Alan. 44. Ten per centum. 45. key In March 2000 he spent several weeks in hospital with a liver problem, almost certainly a result of his drinking. /key 46. Toms, sitting 12 feet to the right of the hole in three, misses what he expected to have for a win, and Europe are dormy two. 5.04pm - Casey/Howell 4up (10) Apologies for the lack of coverage on this match, but with Europe well in control, Sky have deemed it too boring to spend any broadcast time in the past 45 minutes or so. 47. character The young Fahd was known as a technocrat and political wheeler-dealer. He knew about internal security and about how the kingdom needed to be defended from within. /character 48. Bellow always said that of all his heroes, Henderson most resembled himself, and the book remained one of his favourites. 49. At least I hope we are. I hope so, too. 50. character In due course his liberal views caught him in the McCarthy anti-communist witch-hunt. /character 51. Such ideas are buzzing through Benny’s brain months before a new show goes into production. 52. I called on Janet Dare. 53. Remember, once you are here as a French citizen you may find it impossible to leave. 54. work Max Bygraves, another entertainer who has watched Benny develop over more than 40 years, is similarly impressed, “he’s a very fine comedian,” says Max. /work 228 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY 55. I have to learn things or sign cards for fans. If I lived alone I would nearly go crazy. 56. fame Found guilty of breaking the law which required black people to give up their bus seats to whites, Rosa Parks was fined $14. /fame 57. I felt, too, that the novel as an art form was essentially a variation of a poem. 58. Kylie ran sobbing out of the studios and did what was still the most natural thing in the world for a 20-year-old girl — she ran home to mummy. 59. relationships His personal life became increasingly more difficult, with bouts of alcoholism, bankruptcy and the failure of his first marriage. /relationships 60. The foreigners in Ottawa constitute an ominous threat to the integrity and autonomy of our province. 61. work The bearer of the ill-tidings was her newly appointed PPS Sir Peter Morrison. /work 62. Typically, he had made a game out of the difficult process of breaking the ice, so that when among friends, notably Robin and his friend Christopher Bennett, they would share what Alan chose to call ’sagas’ or ’sagaettes’. 63. There was chaos in several of the committee rooms which were packed with Tory MPs, enraged that the press had been the first to know. 64. She has told colleagues that financial stringency would become even more necessary under a Gordon Brown premiership, after five years of rapid growth in the health service budget reaches an end in March 2008. 65. Mason and Ford also reckoned without Hawthorne’s note-taking. 66. key relationships Though he had married again in 1995 and had gained regular employment on television and as an after-dinner speaker, his alcoholism continued to plague his mind and body. /relationships /key 67. She’d be still protecting the people of the city she loved, defending the nation she loved, keeping it from harm. 68. key He was born in New York in 1915. His father owned a garment factory but faced financial ruin after the Great Crash of 1929. /key 69. key fame She was born Rosa Louise McCauley on 4 February, 1913, in Tuskegee, Alabama, family illness interrupted her high school education, but she graduated from the all-African American Booker T Washington High School in 1928, and attended Alabama State College in Montgomery for a short time. /fame /key 70. Carefully she sniffed the fruit, dug her nails in the skin, peeled it in one long length; then she took a whole day to eat it, sucking each segment carefully, savouring the fragrant juice that spurted into her mouth. 229 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY 71. James Father Jim Deshotels, 50, is a nurse and Jesuit priest who tended to the Superdome’s injured and sick refugees for five days. 72. I am certain I did not convert any colleague who might have been watching the box; but Michael did remain the public’s favourite throughout both elections. 73. Using a card keyboard, she spells out answers to questions I have for her. 74. character He claimed that the experience made him turn over a new leaf, but in 1990 millions watched his infamous drunken performance on the Wogan television chat show /character 75. work relationships key They had seen one some months earlier, a puppy of fourteen weeks with a beautiful smoky fur, belonging to Raymond’s wife, Charlotte (the Raymond Greenes were then living in Oxford where Raymond had a medical practice), and this led Greene to /relationships /work buy one /key 76. fame Bancroft won an Oscar for her role in The Miracle Worker /fame 77. These were amazing admissions: practically all western interviews with Chinese leaders before and since have been bland and dull, but Fallaci got Deng to speak extraordinarily frankly by Chinese standards. 78. work relationships Peter was the Thatcherite brother of the ’wet’ Charlie Morrison, and son of the John Morrison who had been the chairman of the ’22 when I was first elected in 1959. /relationships /work 79. Israel has been carving out a five mile deep security zone north of the Lebanese border over the past fortnight, but Wednesday’s security cabinet decision authorised the armed forces to extend the zone as far as the Litani River, 18 miles north of the border, and beyond. 80. work ’Along with Buckingham Palace and the Tower of London, our Teddington studios appear to be well and truly on the American tourist route,’ laughs Dennis Kirkland, producer and director of the Benny Hill Show. /work 81. This will put all NHS trusts on the same footing as foundation hospitals, whose investments are rigorously supervised by Monitor, their regulator. 82. (He was also a stringent scholar.) 83. fame Bancroft secured parts on Broadway, and in 1958, won her first Tony opposite Henry Fonda in Two for the Seesaw. /fame 84. key Born in Vancouver, British Columbia, in 1920, his early life, like that of his contemporaries, was dominated by World War II. /key 85. I was almost sure that he would be, all the same. 86. He tried to convince me that the Cabinet were misreading the situation, that I was being misled and that with a vigorous campaign it would still be possible to turn things round. 87. It has been a very moving experience. 230 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY 88. fame relationships His split and eventual divorce from his wife - with Mr Cook revealing an affair with his secretary to his wife Margaret as they prepared to head off on holiday after a phone call from Downing Street - caused a welter of embarrassing headlines. /relationships /fame 89. work Comedian Bob Monkhouse, who has no mean memory for jokes himself, recalls writing for the radio show Calling All Forces back in 1951 when Benny, then an up-and-coming comic, appeared with film star Diana Dors and funnyman Arthur Askey. /work 90. It appeared that Cranley Onslow, the admirable chairman of the ’22, had read out the results in the wrong committee room. 91. key His liver was said to be functioning at only 20%. /key 92. Research has shown that some 345 children in Norfolk suffer from type 1 diabetes - more than double the 160 predicted cases for the county. 93. No. 94. relationships Following their breakup in 1961, Miller married the renowned photographer Inge Morath, whom he met on the set of the film The Misfits, which he wrote and which starred Monroe. /relationships 95. relationships Many were astonished when he married Marilyn Monroe in 1956. /relationships 96. I had a longing to do so, and a burning curiosity to see Lady Constance even more than the Roman remains. 97. work key In 1930 he had begun writing the biography of John Wilmot, the second Earl of Rochester, who was born on to April 1647 and died at the age of thirty-three. /key /work 98. character His heyday occurred during the swinging sixties, and, with his good looks, he brought a pop star image to the game for the first time. /character 99. character He had a reputation as a playboy in his youth, with allegations of womanising, drinking and gambling to excess. /character 100. You need earnestness and all you’ve got to succeed in this profession, I can assure you. B.2.5 Set Five 1. fame On previous occasions, Irwin, known worldwide for his Discovery Channel programmes, was allegedly killed by a black mamba and a komodo dragon. /fame 2. work Mrs Thatcher’s supporters, and especially her ’Court’, have found it hard to come to terms with her resignation. /work 3. She’s not bad actually. 231 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY 4. character A natural political loner, he had built up a base of support on the backbenches that he had, perhaps, never enjoyed as a cabinet minister. /character 5. Their first visitor, Hugh Greene, must have had impressed upon him the primitiveness and isolation of the young couple’s living conditions: “We haven’t too much room.” 6. relationships His split and eventual divorce from his wife - with Mr Cook revealing an affair with his secretary to his wife Margaret as they prepared to head off on holiday after a phone call from Downing Street - caused a welter of embarrassing headlines. /relationships 7. HSBC said the changes to its overdraft rules were designed to bring greater clarity about what an overdraft service is, how customers apply for an overdraft and how fees are charged, though it conceded they were also in part about helping to reduce its bad debts. 8. work He appeared in several more plays, and also broke into radio. He was in 300 editions of The Navy Lark as A B Johnson. /work 9. He raced through the alleys of the Jewish quarter down to the waterfront where cantilevered Chinese fishing nets were spread out against the sky; but the fish he sought did not leap out of the waves. 10. work The Thatcherites listened to their lost leader and voted for “dear John” in ignorance of the fact that his political hero was Iain Macleod. /work 11. work However, they only really discovered each other in 1960 after Bergman had become one of the world’s leading directors, though Nykvist had co-shot the expressionistic Sawdust and Tinsel seven years previously. /work 12. They argue that litigation could follow if schools became too involved in other areas. 13. fame Ms Dworkin sparked international debate by arguing that pornography was a violation of women’s rights and a precursor to rape. /fame 14. In all 15 bodies had been recovered, with others thought to be trapped in the wreckage of carriages left dangling in midair. 15. key Mr Cook was born Robert Finlayson Cook on 28 February 1946 at Bellshill, Lanarkshire. /key 16. I feel the same as I ever did, he said at the time, which is that I don’t believe that a man has to become an informer in order to practice his profession freely in the United States. 17. I will say that for her. 18. William Waldegrave, my most recent Cabinet appointment, arrived next. 19. To me, the Dome is a place I’ve been to twice to help people. 20. Being adequately provided for, he was able to book himself into a downtown hotel which cost him three dollars per night, though he often failed 232 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY to make it back to the hotel, finding the cosmopolitan and nocturnal life of the town there entirely to his liking: consecration dismantled! 21. It was not an evasion, nor a disguised threat, nor a way of abandoning my cause without admitting the fact. 22. character Some in the media liked to picture her as tough and hard and difficult, but she was soft and with a lovely voice and a good sense of humour /character 23. Pyle remained bemused by obsessive adulation and his unwitting appropriation into the Canterbury scene: his lyrics to Richard Sinclair’s What’s Rattlin quote One question we all dread/ What’s doing Mike Ratledge (a reference to fans asking him about the Softs keyboard player, with whom he never collaborated). 24. relationships work Francis Maude, Angus’s son and Minister of State at the Foreign Office, whom I regarded as a reliable ally, told me that he passionately supported the things I believed in, that he would back me as long as I went on, but that he did not believe I could win. /work /relationships 25. I understood. Is she very bad? No. 26. In them, Riaan responds to questioning, nods and shakes his head, drinks through a straw, often laughs and says, ’Hello.’ 27. fame But it was with Cries and Whispers (1972), for which Nykvist won an Academy Award, that the real breakthrough came. /fame 28. fame character key John Young, who has died aged 85, will have a prominent place in the Brewers’ Hall of Fame, revered as the father of the real ale revolution, an iconoclast who believed in good traditional beer drunk in good traditional pubs. /key /character /fame 29. Really quite a treat, in many ways. 30. Roland did want this. 31. fame work He first gained attention for his work on Barabbas (1953) and Karin Mansdotter (1954) with Alf Sjoberg, Sweden’s most important post-war director before the advent of Bergman. /work /fame 32. Val’s papers were bland and minimal, in large confident handwriting, well laid-out. Male Ventriloquism was judged to be good work and discounted by the examiners as probably largely by Roland, which was doubly unjust, since he had refused to look at it, and did not agree with its central proposition, which was that Randolph Henry Ash neither liked nor understood women, that his female speakers were constructs of his own fear and aggression, that even the poem-cycle, Ask to Embla, was the work, not of love but of narcissism, the poet addressing his Anima. 33. key Officially, after a relatively short break at the time, he resumed many of his duties using a wheelchair and stick. /key 34. God was generous to us and granted us this victory against our enemy. 233 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY 35. We sat up for two hours asking him questions and he answered all of them. 36. work Barker himself, however, was among many viewers who regarded his portrayal of Fletcher in Porridge as the best work he ever did. /work 37. I slipped out near Times Square. 38. Once asked to reveal his favourite joke, he trotted out the well-worn tale of the Member of Parliament who was visiting a mental home and was amazed to discover that a beautifully designed flower bed was the work of an inmate. 39. Robin’s interests were more uniformly distributed. 40. work Chris and I had worked together for many years from the time when he was Director of the Conservative Research Department until I brought him into the Cabinet in 1989. /work 41. But Lais. 42. relationships After an unhappy three-year marriage to builder Martin May, Anne Bancroft married the comedian-director Mel Brooks in 1964. /relationships 43. Yesterday’s announcement comes weeks after two of Britain’s biggest credit card providers revealed they were tightening up their borrowing rules. 44. relationships key In a statement to the Aspen Daily News, Thompson’s son, Juan, said: On February 20, Dr Hunter S Thompson took his life with a gunshot to the head at his fortified compound in Woody Creek, Colorado. /key /relationships 45. That’s collective responsibility. 46. Yes, I do have fundamental convictions . . . but we do have very lively discussions because that is the way I operate. 47. He sat with his hands on his chin in a sage velvet wing chair on one side of the fireplace, while she sat on the other, reporting to him twice a week. 48. HSBC is the latest big lender to amend its borrowing terms amid continuing concern about soaring consumer debt. 49. And I asked him, These movies — they exist? Sure. 50. fame In 1996, she received the Presidential Medal of Freedom before being awarded the United States’ highest civilian honour, the Congressional Gold Medal, in 1999. /fame 51. If no more a masterwork than the spin-off novel, it did - like Roots - alert many people to something of which they were ignorant, especially when eventually shown in Germany, where the statute of limitation on Nazi criminals was then lifted. 234 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY 52. fame In 2001, Ms Dworkin won the American book award for writing Scapegoat: the Jews, Israel and Women’s Liberation. /fame 53. fame Irwin was criticised for holding his infant son near a crocodile pool while feeding chickens to a four-metre long crocodile. /fame 54. character As he said later: Mrs Parks was a married woman, she was morally clean, and she had fairly good academic training. . . /character 55. 56. relationships /relationships key Thompson’s son, Juan, found his body. /key key His chosen successor, his half-brother Abdullah, is the head of the National Guard, the tribal army largely responsible for the kingdom’s internal security. /key 57. ’I ’ave a car outside and we will all go to my ’otel fur a drink.’ 58. Her gifts of war came down to her from some unknown ancestor; and though her adversaries grabbed her hair and called her Jewess they never vanquished her. 59. I have sat there dumbstruck with admiration at the switches he has pulled. 60. All the copies I’ve seen are quite thinly painted; that is, the surface imitates what they think Leonardo did. 61. He said the government needed an overarching commitment to social justice - not a leadership soap opera. 62. relationships However, after eventually marrying his mistress, Gaynor Regan, in a secret ceremony, many of Mr Cook’s troubles seemed behind him as Labour approached the 2001 general election. /relationships 63. But Toby sat up, pouted and said in an odd, little-girl voice. Why can’t Toby have nice things like you do? He pulled her on to the bed beside him and murmured, “Toby loves looking pretty, Toby loves dressing up like this, but promise it’s a secret between us, between two girl friends?” 64. key He was diabetic, for many years a heavy smoker and suffered a stroke in 1995. /key 65. Then he violently shoved her down the small flight of stairs that led off their bedroom to the bathroom. 66. fame In 1975 John Young was made a CBE to mark his work in brewing and for charity: he was chairman of the National Hospital for Nervous Diseases in Bloomsbury and raised millions of pounds to build new wards and install modern equipment. /fame 67. key She was born Anna Maria Louise Italiano in New York’s Bronx in 1931, and began acting as Anne Marno. But it was felt this name sounded too ethnic, so she opted instead for Bancroft. /key 68. I was not surprised. 235 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY 69. character His ex-wife wrote a book in which she said of her former husband; his self-regard was easily punctured and his reaction was protracted and troublesome. /character 70. key Nicknamed Robin at school, he attended Aberdeen Grammar School before studying English Literature at Edinburgh University. /key 71. fame Later in the year he was attacked for allegedly filming too close to penguins, seals and humpback whales in the Antarctic /fame 72. So they do that in the morning, then head here. 73. He walked unaided and didn’t need a wheelchair, the source said. 74. key work Former Foreign Secretary Robin Cook, 59, has died after collapsing while out hill walking in Scotland. /work /key 75. Haunting melodies are counterbalanced elsewhere by hard funk rhythms, while lyrically the themes are often stark or angry, as Pyle alludes to the political and cultural attitudes that led him to emigrate to France in the early 1980s. 76. relationships work Widowed in 1977, she founded the Rosa and Raymond Parks Institute for Self Development a decade later, to develop leadership among Detroit’s young people. /work /relationships 77. But I had written off my next visitor, Malcolm Rifkind, in advance. 78. By the time the sale finishes on September 21, it is expected to have smashed the UK’s eBay transaction record, the 103,000 raised by Margaret Thatcher’s handbag. 79. character key On 1 December 1955, the 42-year-old seamstress, and member of the Montgomery chapter of the National Association for the Advancement of Colored People (NAACP), was sitting on a bus when a white man demanded to take her seat. /key /character 80. relationships key He was born in New York in 1915; his father owned a garment factory but faced financial ruin after the Great Crash of 1929. /key /relationships 81. She didn’t mention it the next day, but that evening Toby, having had rather a lot of brandy after the quiche aux pinards, said sarcastically, I don’t think spinach tart is one of your stronger points, darling, and proceeded upstairs. 82. ’No jokes,’ warned the massive figure of Sir Peter Tapsell, another of Michael’s inner circle. 83. work However, he joined Aylesbury Repertory Company in 1948, while still in his teens, before taking to the West End stage at the invitation of Sir Peter Hall, where he appeared in Mourning Becomes Her in 1955. /work 84. character His stance on the Iraq war - and his resignation speech - only enhanced his reputation as a man of principle and a great Parliamentarian. /character 236 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY 85. At the bottom of Mud Lane was a pump, said to be haunted by the ghost of a dancing bear. 86. key The Labour MP for Livingston was considered one of the Commons’ most intelligent MPs and one of its most skilled debaters. /key 87. Marks and Spencer was rated the most ethical retailer, scoring 3.27 on a scale from one to five. 88. No one yet knows exactly how a sleeping pill could wake up the seemingly dead brain cells, but Nel and Clauss have a hypothesis. 89. Professor Pacey referred to Layton as a poet of revolutionary individualism, and there can be no doubt that that individualism was a common tie, and not merely religiously but in every way. 90. relationships work Ronnie Barker first worked with Ronnie Corbett in The Frost Report and Frost on Sunday, programmes for which he also wrote scripts. /work /relationships 91. key My digs in London were now 13 Baldwin Crescent, Camberwell, in a less fashionable part than in my old Kensington haunts. /key 92. character An austere and respected figure, Crown Prince Abdullah is untainted by corruption, while being regarded by many as less enthusiastically pro-American than King Fahd. /character 93. It was like that shot in the arm they’d given her in the hospital, it made her feel that she could do anything — and made her want to do something . 94. character relationships After marrying Raymond Parks in 1932, she became involved in the NAACP, where she gained a reputation as a militant and a feminist and was the driving force in campaigns to encourage black voter registration. /relationships /character 95. relationships He was the fourth of his siblings to be king. Two of his brothers lost power violently - one was deposed in a coup; the other was assassinated. /relationships 96. character Her agent of 30 years, Elaine Markson, said: some in the media liked to picture her as tough and hard and difficult, but she was soft and with a lovely voice and a good sense of humour. /character 97. From the aspect of method, I could see that to create a character who suffered from verbal illusions on the printed page would be clumsy. 98. As we walked Brigitte introduced us to her two friends. B.3 Agreement Data Note that agreement data presented in the following tables does not map directly to the questions presented in Section B.2. 237 A PPENDIX B: H UMAN S TUDY: M AIN S TUDY B.3.1 Set 1 Agreement Data Table B.1: Agreement Data for Set 1 Sentence 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 Rater 1 non-bio non-bio non-bio bio bio bio non-bio non-bio bio non-bio non-bio bio non-bio non-bio bio non-bio non-bio bio bio non-bio bio non-bio non-bio non-bio non-bio non-bio non-bio bio non-bio non-bio bio bio bio bio non-bio bio bio non-bio non-bio non-bio non-bio bio Rater 2 bio non-bio non-bio bio bio bio non-bio non-bio bio non-bio non-bio bio non-bio non-bio bio bio bio bio bio non-bio bio non-bio non-bio non-bio bio non-bio non-bio non-bio bio non-bio bio bio bio bio non-bio bio bio non-bio non-bio non-bio non-bio bio 238 Rater 3 Rater 4 Rater 5 non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio bio bio bio bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio bio bio bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio non-bio bio non-bio bio bio bio bio bio bio bio bio bio bio non-bio non-bio non-bio non-bio non-bio bio bio non-bio bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio bio non-bio bio bio bio continued on next page A PPENDIX B: H UMAN S TUDY: M AIN S TUDY Table B.1: continued from previous page Sentence 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 Rater 1 non-bio non-bio bio non-bio bio bio bio bio bio bio bio non-bio bio bio non-bio bio non-bio non-bio bio non-bio non-bio bio bio non-bio non-bio bio bio non-bio bio non-bio bio non-bio bio non-bio non-bio bio non-bio bio bio bio bio bio bio bio Rater 2 non-bio non-bio bio non-bio bio non-bio bio bio bio bio bio non-bio non-bio bio non-bio bio bio non-bio bio non-bio non-bio non-bio bio non-bio non-bio non-bio bio non-bio bio non-bio bio non-bio bio non-bio non-bio bio non-bio bio bio non-bio bio bio non-bio bio 239 Rater 3 Rater 4 Rater 5 non-bio non-bio non-bio non-bio non-bio non-bio bio non-bio bio non-bio non-bio non-bio bio bio bio bio bio non-bio bio non-bio bio bio bio bio bio bio bio bio bio bio bio bio bio non-bio non-bio non-bio bio bio bio bio bio bio non-bio bio non-bio bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio bio bio bio bio non-bio bio non-bio non-bio non-bio bio bio bio bio bio bio non-bio non-bio non-bio bio bio bio continued on next page A PPENDIX B: H UMAN S TUDY: M AIN S TUDY Table B.1: continued from previous page Sentence 87 88 89 90 91 92 93 94 95 96 97 98 99 100 B.3.2 Rater 1 bio non-bio bio bio bio non-bio bio bio non-bio bio non-bio non-bio bio non-bio Rater 2 bio non-bio bio bio non-bio non-bio bio bio non-bio bio non-bio non-bio bio non-bio Rater 3 bio bio bio bio bio bio bio bio non-bio non-bio non-bio non-bio bio non-bio Rater 4 bio non-bio bio non-bio bio non-bio bio bio non-bio non-bio non-bio non-bio bio non-bio Rater 5 bio non-bio bio non-bio non-bio non-bio bio bio non-bio non-bio non-bio non-bio bio non-bio Set 2 Agreement Data Table B.2: Agreement Data for Set 1 Sentence 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Rater 1 bio non-bio non-bio bio bio bio non-bio non-bio bio bio bio bio non-bio bio non-bio non-bio non-bio non-bio bio bio non-bio bio bio Rater 2 bio non-bio non-bio bio bio bio non-bio non-bio bio bio bio bio non-bio bio bio non-bio bio non-bio bio bio non-bio bio bio 240 Rater 3 Rater 4 Rater 5 bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio non-bio bio bio bio bio bio non-bio non-bio non-bio non-bio non-bio bio non-bio bio bio bio bio bio bio bio non-bio bio bio bio non-bio non-bio non-bio bio bio bio bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio bio non-bio bio bio bio bio bio bio bio non-bio non-bio bio bio bio bio bio bio continued on next page A PPENDIX B: H UMAN S TUDY: M AIN S TUDY Table B.2: continued from previous page Sentence 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 Rater 1 non-bio non-bio non-bio bio non-bio non-bio bio bio bio bio bio non-bio non-bio bio non-bio non-bio bio bio non-bio bio non-bio bio non-bio non-bio non-bio non-bio non-bio bio non-bio bio bio non-bio non-bio non-bio non-bio non-bio bio non-bio bio non-bio non-bio bio bio non-bio Rater 2 non-bio non-bio non-bio non-bio non-bio non-bio bio non-bio bio bio bio non-bio non-bio bio non-bio non-bio bio bio non-bio bio non-bio bio non-bio non-bio non-bio non-bio non-bio bio non-bio non-bio bio non-bio non-bio non-bio non-bio non-bio bio non-bio bio non-bio non-bio bio bio non-bio 241 Rater 3 Rater 4 Rater 5 non-bio non-bio non-bio bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio bio bio bio bio bio bio bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio bio bio bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio bio non-bio non-bio bio bio bio bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio non-bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio bio bio bio non-bio non-bio bio continued on next page A PPENDIX B: H UMAN S TUDY: M AIN S TUDY Table B.2: continued from previous page Sentence 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 B.3.3 Rater 1 bio non-bio bio bio non-bio bio non-bio non-bio bio bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio bio bio non-bio bio bio bio non-bio bio non-bio non-bio non-bio Rater 2 bio non-bio bio non-bio non-bio bio non-bio non-bio bio non-bio bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio bio bio non-bio bio non-bio bio non-bio bio non-bio non-bio non-bio Rater 3 bio non-bio non-bio bio non-bio bio non-bio non-bio bio bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio bio non-bio bio bio bio bio bio bio bio non-bio bio non-bio non-bio non-bio Rater 4 bio non-bio bio bio non-bio bio non-bio non-bio bio bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio bio bio bio bio bio bio non-bio bio non-bio non-bio bio Rater 5 bio non-bio bio bio non-bio bio non-bio non-bio bio bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio bio bio bio bio bio bio non-bio bio non-bio non-bio non-bio Set 3 Agreement Data Table B.3: Agreement Data for Set 3 Sentence 1 2 3 4 Rater 1 bio bio bio bio Rater 2 bio bio bio bio 242 Rater 3 Rater 4 Rater 5 bio bio bio bio bio bio bio bio bio non-bio non-bio bio continued on next page A PPENDIX B: H UMAN S TUDY: M AIN S TUDY Table B.3: continued from previous page Sentence 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 Rater 1 non-bio non-bio non-bio non-bio non-bio non-bio bio non-bio bio bio non-bio non-bio non-bio non-bio non-bio bio non-bio non-bio bio bio bio bio bio non-bio bio non-bio bio bio bio bio bio non-bio bio non-bio non-bio non-bio non-bio bio bio bio bio bio bio non-bio Rater 2 non-bio non-bio non-bio non-bio non-bio bio bio non-bio bio non-bio non-bio non-bio non-bio bio bio bio non-bio bio bio bio bio bio bio non-bio bio bio bio bio bio bio bio bio bio non-bio non-bio bio non-bio bio bio bio bio bio bio non-bio 243 Rater 3 Rater 4 Rater 5 non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio bio bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio bio non-bio bio bio bio non-bio non-bio non-bio bio bio bio bio non-bio non-bio non-bio non-bio non-bio bio bio bio bio bio bio bio bio bio bio bio bio bio bio bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio bio bio bio bio bio bio bio bio bio bio bio non-bio bio bio non-bio non-bio bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio bio bio bio non-bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio continued on next page A PPENDIX B: H UMAN S TUDY: M AIN S TUDY Table B.3: continued from previous page Sentence 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 Rater 1 non-bio non-bio bio non-bio bio non-bio bio non-bio bio non-bio bio non-bio non-bio bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio bio non-bio bio non-bio bio bio bio bio non-bio bio non-bio bio non-bio non-bio bio bio non-bio bio non-bio non-bio non-bio Rater 2 non-bio bio bio non-bio bio non-bio bio bio bio non-bio bio non-bio non-bio bio bio non-bio non-bio bio non-bio non-bio bio non-bio non-bio bio bio non-bio non-bio bio bio bio bio non-bio bio non-bio bio non-bio bio bio bio non-bio bio non-bio non-bio non-bio 244 Rater 3 Rater 4 Rater 5 non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio bio non-bio bio bio bio bio non-bio non-bio non-bio bio bio bio non-bio bio non-bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio non-bio non-bio bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio non-bio bio bio bio bio non-bio non-bio non-bio non-bio bio bio bio bio bio bio bio bio bio bio bio bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio bio bio bio non-bio non-bio bio bio bio bio bio bio bio bio bio bio bio non-bio non-bio bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio continued on next page A PPENDIX B: H UMAN S TUDY: M AIN S TUDY Table B.3: continued from previous page Sentence 93 94 95 96 97 98 99 100 101 B.3.4 Rater 1 non-bio bio bio bio non-bio non-bio non-bio non-bio bio Rater 2 non-bio bio non-bio bio non-bio non-bio non-bio non-bio bio Rater 3 non-bio non-bio non-bio bio non-bio non-bio non-bio non-bio bio Rater 4 non-bio bio non-bio bio non-bio non-bio non-bio non-bio bio Rater 5 non-bio bio non-bio bio non-bio non-bio non-bio non-bio bio Set 4 Agreement Data Table B.4: Agreement Data for Set 4 Sentence 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Rater 1 bio non-bio bio bio non-bio non-bio bio bio bio non-bio non-bio bio non-bio non-bio non-bio non-bio non-bio bio bio non-bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio Rater 2 bio non-bio bio bio non-bio non-bio bio bio bio non-bio bio non-bio non-bio non-bio non-bio bio bio bio bio bio bio bio bio non-bio non-bio non-bio non-bio non-bio 245 Rater 3 Rater 4 Rater 5 bio bio bio non-bio non-bio non-bio bio bio bio non-bio non-bio bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio bio bio bio bio bio bio non-bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio bio non-bio bio bio bio bio bio bio bio non-bio non-bio non-bio bio bio bio bio bio bio bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio bio non-bio non-bio continued on next page A PPENDIX B: H UMAN S TUDY: M AIN S TUDY Table B.4: continued from previous page Sentence 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 Rater 1 non-bio bio non-bio non-bio non-bio non-bio bio non-bio bio non-bio non-bio non-bio bio bio bio non-bio bio non-bio bio bio non-bio non-bio non-bio non-bio non-bio bio non-bio bio non-bio bio bio non-bio non-bio bio non-bio non-bio non-bio bio non-bio bio bio non-bio bio non-bio Rater 2 non-bio non-bio non-bio non-bio non-bio non-bio bio non-bio bio non-bio non-bio non-bio bio bio bio non-bio bio non-bio bio non-bio non-bio bio non-bio non-bio non-bio bio non-bio bio non-bio non-bio bio non-bio non-bio non-bio non-bio non-bio non-bio bio non-bio bio bio non-bio bio non-bio 246 Rater 3 Rater 4 Rater 5 bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio non-bio bio non-bio non-bio non-bio non-bio bio bio bio bio bio bio bio bio non-bio non-bio non-bio non-bio bio bio bio bio non-bio non-bio bio non-bio bio non-bio non-bio non-bio non-bio non-bio non-bio bio non-bio bio non-bio non-bio non-bio non-bio bio non-bio non-bio non-bio non-bio bio bio non-bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio bio non-bio bio non-bio non-bio non-bio bio bio bio non-bio bio non-bio non-bio bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio bio bio bio bio bio bio non-bio non-bio bio bio bio bio non-bio non-bio non-bio continued on next page A PPENDIX B: H UMAN S TUDY: M AIN S TUDY Table B.4: continued from previous page Sentence 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 B.3.5 Rater 1 non-bio bio bio bio non-bio bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio bio bio bio bio non-bio non-bio bio bio non-bio bio non-bio bio non-bio Rater 2 non-bio bio non-bio bio non-bio bio non-bio bio bio bio bio bio non-bio non-bio non-bio non-bio bio bio bio non-bio non-bio bio bio bio bio bio bio non-bio Rater 3 non-bio bio non-bio bio non-bio bio non-bio bio non-bio bio bio bio non-bio non-bio non-bio bio bio bio bio non-bio non-bio bio bio non-bio bio bio bio non-bio Rater 4 non-bio non-bio bio bio non-bio bio non-bio bio non-bio bio bio bio non-bio non-bio non-bio bio bio bio bio non-bio non-bio bio bio non-bio bio bio bio non-bio Rater 5 non-bio bio non-bio bio non-bio bio non-bio bio non-bio non-bio bio bio non-bio non-bio non-bio bio bio non-bio bio non-bio non-bio bio bio non-bio bio non-bio bio non-bio Set 5 Agreement Data Table B.5: Agreement Data for Set 4 Sentence 1 2 3 4 5 6 7 8 9 Rater 1 bio non-bio non-bio bio non-bio non-bio non-bio bio non-bio Rater 2 bio bio bio bio bio bio non-bio bio non-bio 247 Rater 3 Rater 4 Rater 5 non-bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio continued on next page A PPENDIX B: H UMAN S TUDY: M AIN S TUDY Table B.5: continued from previous page Sentence 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 Rater 1 bio bio non-bio non-bio non-bio bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio non-bio non-bio bio bio non-bio non-bio bio non-bio non-bio non-bio non-bio bio non-bio non-bio non-bio bio non-bio bio non-bio bio non-bio non-bio non-bio non-bio non-bio bio non-bio bio bio Rater 2 bio bio non-bio bio non-bio bio bio non-bio bio non-bio bio non-bio bio bio bio non-bio non-bio bio bio non-bio non-bio bio bio bio non-bio non-bio bio non-bio non-bio non-bio bio non-bio bio non-bio bio non-bio bio non-bio non-bio non-bio bio non-bio bio bio 248 Rater 3 Rater 4 Rater 5 non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio non-bio bio non-bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio non-bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio bio non-bio non-bio bio bio bio bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio bio non-bio non-bio non-bio non-bio non-bio bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio bio non-bio non-bio bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio bio bio bio bio bio bio continued on next page A PPENDIX B: H UMAN S TUDY: M AIN S TUDY Table B.5: continued from previous page Sentence 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 Rater 1 bio bio bio non-bio non-bio non-bio non-bio non-bio bio non-bio bio bio bio bio non-bio bio bio bio bio non-bio bio non-bio bio non-bio non-bio bio bio non-bio non-bio bio bio non-bio non-bio non-bio non-bio non-bio bio bio bio bio bio bio bio non-bio Rater 2 bio bio bio non-bio non-bio non-bio non-bio bio bio non-bio bio non-bio bio bio non-bio bio bio bio non-bio bio bio bio bio non-bio non-bio bio bio bio bio bio bio non-bio bio non-bio bio bio bio bio bio bio bio bio bio bio 249 Rater 3 Rater 4 Rater 5 bio bio bio bio bio bio bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio bio bio bio non-bio bio non-bio non-bio bio bio bio bio bio non-bio non-bio non-bio bio bio bio bio bio bio bio bio bio non-bio non-bio non-bio non-bio bio non-bio bio bio bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio non-bio bio non-bio bio bio bio bio bio bio non-bio non-bio non-bio bio bio non-bio bio non-bio bio bio non-bio bio non-bio non-bio non-bio bio bio bio non-bio non-bio non-bio non-bio non-bio non-bio bio bio non-bio bio bio bio bio bio bio bio bio bio non-bio non-bio non-bio bio bio bio bio bio bio bio bio bio non-bio non-bio non-bio continued on next page A PPENDIX B: H UMAN S TUDY: M AIN S TUDY Table B.5: continued from previous page Sentence 98 Rater 1 non-bio Rater 2 non-bio 250 Rater 3 non-bio Rater 4 non-bio Rater 5 non-bio A PPENDIX C Identifying Syntactic Feature This appendix presents data derived from the statistical data presented in B IBER (1988). See Section 8.3 on page 148 for a description of the methodology used. The first section presents features ranked by raw distance from the mean, and the second section sets forth features ranked by standard deviations from the mean. For descriptions of the individual features used by B IBER (1988) are given in Appendix D. C.1 Distance From the Mean This reproduces in full the list of features most prevalent in, and characteristic of, the biographical genre according to the methodology described in Section 8.3 on page 148 of this thesis (based on data derived from B IBER (1988)). A table listing results for all sixty-seven features (ranked by maximum distance from the mean) is reproduced in Table C.1. Table C.2 shows all features ranked numerically with respect to the mean for each feature. Table C.1: Sixty-seven Features Ranked by Distance from the Mean (Irrespective of Whether the Distance is Positive or Negative) Rank 1 2 3 4 5 6 7 8 9 10 Distance 41.922 29.700 24.690 16.340 13.027 12.522 10.136 8.072 7.163 4.486 Feature Name present tense adverbs past tense prepositions nouns contractions second person pronouns first person pronouns attributive adjectives private verbs continued on next page 251 A PPENDIX C: I DENTIFYING S YNTACTIC F EATURE Table C.1: continued from previous page Rank 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 Distance 4.263 3.695 3.354 2.754 2.740 2.586 2.318 2.259 2.213 2.036 1.918 1.895 1.722 1.609 1.427 1.409 1.386 1.354 1.318 1.290 1.250 1.136 1.131 0.968 0.827 0.795 0.686 0.572 0.554 0.490 0.477 0.468 0.427 0.350 0.327 0.309 0.309 0.290 0.240 0.231 0.227 0.222 0.195 0.181 Feature Name BE as main verb type/token ratio demonstrative pronouns pronoun IT predictive modals nominalisations analytic negation emphatics non phrasal coordination that deletion possibility modals predictive adjectives DO as pro-verb adv. sub. - condition agentless passives perfect aspect verbs stranded prepositions phrasal coordination split auxiliaries infinitives place adverbials discourse particles THAT verb complements demonstratives adv. subordinator - cause synthetic negation necessity modals public verbs hedges indefinite pronouns time adverbials WH relatives: obj. position suasive verbs WH relatives: subj. position past prt. WHIZ deletions WH relatives: pied pipes third person pronouns BY passives existential THERE WH questions adv. sub. - concession THAT relatives: subj. position downturners adv. sub. - other continued on next page 252 A PPENDIX C: I DENTIFYING S YNTACTIC F EATURE Table C.1: continued from previous page Rank 55 56 57 58 59 60 61 62 63 64 65 66 67 Distance 0.168 0.150 0.104 0.100 0.086 0.054 0.054 0.036 0.036 0.018 0.009 0.004 0 Feature Name THAT relatives: obj. position amplifiers sentence relatives past participle clauses wordlength present prt. WHIZ deletions THAT adj complements conjuncts gerunds WH clauses split infinitives SEEM/APPEAR present participle clauses Table C.2: Sixty-seven Features Ranked by Distance from the Mean Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Distance 24.69 16.34 13.02 7.16 3.69 2.58 1.42 1.40 1.35 1.31 1.29 0.96 0.79 0.46 0.42 0.35 0.30 0.30 0.29 0.18 0.15 0.08 0.05 0.05 0.03 0.00 Feature Name past tense prepositions nouns attributive adjectives type/token ratio nominalizations agentless passives perfect aspect verbs phrasal coordination split auxiliaries infinitives demonstratives synthetic negation WH relatives: obj. position suasive verbs WH relatives: subj. position WH relatives: pied pipes third person pronouns BY passives adv. sub. - other amplifiers wordlength present prt. WHIZ deletions THAT adj complements conjuncts SEEM/APPEAR continued on next page 253 A PPENDIX C: I DENTIFYING S YNTACTIC F EATURE Table C.2: continued from previous page Rank 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 Distance 0 -0.00 -0.01 -0.03 -0.10 -0.10 -0.16 -0.19 -0.22 -0.22 -0.23 -0.24 -0.32 -0.47 -0.49 -0.55 -0.57 -0.68 -0.82 -1.13 -1.13 -1.25 -1.38 -1.60 -1.72 -1.89 -1.91 -2.03 -2.21 -2.25 -2.31 -2.74 -2.75 -3.35 -4.26 -4.48 -8.07 -10.13 -12.52 -29.70 -41.92 Feature Name present participle clauses split infinitives WH clauses gerunds past participle clauses sentence relatives THAT relatives: obj. position downturners THAT relatives: subj. position adv. sub. - concession WH questions existential THERE past prt. WHIZ deletions time adverbials indefinite pronouns hedges public verbs necessity modals adv. subordinator - cause THAT verb complements discourse particles place adverbials stranded prepositions adv. sub. - condition DO as pro-verb predictive adjectives possibility modals that deletion non phrasal coordination emphatics analytic negation predictive modals pronoun IT demonstrative pronouns BE as main verb private verbs first person pronouns second person pronouns contractions adverbs present tense 254 A PPENDIX C: I DENTIFYING S YNTACTIC F EATURE C.2 Standard Deviations from the Mean An alternative approach to the problem of calculating features characteristic of the biographical genre provided by B IBER (1988) (see Section 8.3 on page 148) involves measuring the distance of each different feature for the biographical genre from the frequency of that feature for all genres in terms of the number of standard deviations. This figure — the z-score (O AKES, 1998) — can be calculated by first subtracting the mean frequency for each feature across all genres from the frequency of the that feature for the biographical genre, and then dividing the result by the standard deviation for all genre.1 A table listing results for all sixty-seven features (ranked by maximum distance from the mean in terms of standard deviations) is reproduced in Table C.3. Table C.4 shows all features ranked numerically with respect to the mean for each feature. Note that this alternative method produces results very similar to the method described in Chapter 6. Table C.3: Sixty-seven Features Ranked by Number of Standard Deviations from the Mean. Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 1 The Distance 1.60 1.46 1.27 1.19 1.13 1.09 1.05 1.01 1.01 0.93 0.89 0.89 0.86 0.82 0.79 0.79 0.78 0.69 0.69 0.68 0.68 0.63 0.62 0.62 0.61 0.61 Feature Name adv. sub. condition present tense predictive adjectives split auxiliaries predictive modals possibility modals type/token ratio second person pronouns synthetic negation demonstrative pronouns necessity modals past tense emphatics contractions stranded prepositions prepositions DO as proverb WH questions phrasal coordination past participle clauses adv. subordinator cause place adverbials discourse particles BE as main verb pronoun IT hedges continued on next page calculation was performed using a Perl script. 255 A PPENDIX C: I DENTIFYING S YNTACTIC F EATURE Table C.3: continued from previous page Rank 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 Distance 0.59 0.58 0.58 0.57 0.57 0.55 0.53 0.52 0.50 0.50 0.50 0.48 0.48 0.45 0.45 0.44 0.43 0.42 0.40 0.39 0.33 0.31 0.31 0.30 0.29 0.29 0.26 0.22 0.21 0.20 0.20 0.19 0.15 0.08 0.05 0.04 0.04 0.01 0.01 0.01 0.00 Feature Name non phrasal coordination adv. sub. concession WH relatives: pied pipes that deletion private verbs analytic negation sentence relatives perfect aspect verbs adv. sub. other attributive adjectives WH relatives: obj. position downturners nouns indefinite pronouns THAT verb complements demonstratives suasive verbs BY passives infinitives first person pronouns THAT adj complements WH relatives: subj. position THAT relatives: obj. position agentless passives wordlength existential THERE THAT relatives: subj. position nominalizations adverbs split infinitives public verbs time adverbials past prt. WHIZ deletions amplifiers present prt. WHIZ deletions WH clauses conjuncts third person pronouns gerunds SEEM/APPEAR present participle clause 256 A PPENDIX C: I DENTIFYING S YNTACTIC F EATURE Table C.4: Sixty-seven features Ranked by Number of Standard Deviations from the Mean. Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 Distance 1.19 1.05 1.01 0.89 0.79 0.69 0.58 0.52 0.50 0.50 0.50 0.48 0.44 0.43 0.42 0.40 0.33 0.31 0.30 0.29 0.22 0.08 0.05 0.04 0.01 0.01 0.00 -0.01 -0.04 -0.15 -0.19 -0.20 -0.20 -0.21 -0.26 -0.29 -0.31 -0.39 -0.45 -0.45 -0.48 -0.53 -0.55 Feature Name split auxiliaries type/token ratio synthetic negation past tense prepositions phrasal coordination WH relatives: pied pipes perfect aspect verbs adv. sub. - other attributive adjectives WH relatives: obj. position nouns demonstratives suasive verbs BY passives infinitives THAT adj complements WH relatives: subj. position agentless passives wordlength nominalizations amplifiers present prt. WHIZ deletions conjuncts third person pronouns SEEM/APPEAR present participle clauses gerunds WH clauses past prt. WHIZ deletions time adverbials public verbs split infinitives adverbs THAT relatives: subj. position existential THERE THAT relatives: obj. position first person pronouns THAT verb complements indefinite pronouns downturners sentence relatives analytic negation continued on next page 257 A PPENDIX C: I DENTIFYING S YNTACTIC F EATURE Table C.4: continued from previous page Rank 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 Distance -0.57 -0.57 -0.58 -0.59 -0.61 -0.61 -0.62 -0.62 -0.63 -0.68 -0.68 -0.69 -0.78 -0.79 -0.82 -0.86 -0.89 -0.93 -1.01 -1.09 -1.13 -1.27 -1.46 -1.60 Feature Name private verbs that deletion adv. sub. - concession non phrasal coordination hedges pronoun IT BE as main verb discourse particles place adverbials adv. subordinator - cause past participle clauses WH questions DO as pro-verb stranded prepositions contractions emphatics necessity modals demonstrative pronouns second person pronouns possibility modals predictive modals predictive adjectives present tense adv. sub. - condition 258 A PPENDIX D Syntactic Features This appendix provides a brief description of each of the sixty-seven features identified by B IBER (1988). The descriptions (and many of the examples) for each of the features given below are based on Appendix II of B IBER (1988) and the reader is referred to the source text for a fuller description. Past Tense Any word identified as a past tense form in an electronic dictionary, or any word longer than six letters which ends in -ed. Perfect Aspect “Have” indicates this feature. Present Tense Any base form of a verb in an electronic dictionary. Place Adverbials Gazetteer approach used (e.g. aboard, above, abroad, across, ahead, etc.). Place adverbials with other major functions (e.g. in, on) are excluded. Time Adverbials Gazetteer approach used (e.g. afterwards, again, earlier, etc.). First Person Pronouns Gazetteer approach used (I, me, we, us, my, our, myself, ourselves). Second Person Pronouns Gazetteer approach used (you, your, yourself, yourselves). Third Person Personal Pronouns she, he, they, her, them, his, their, himself, herself, themselves). Pronoun IT Gazetteer approach used. Demonstrative Pronoun Gazetteer approach used, with the context also taken into account (e.g. “this is silly”). Trigger words are: that, this, these and those. Indefinite Pronoun Gazetteer approach used (anybody, anyone, anything, everybody, everyone, everything, nobody, none, nothing, nowhere, somebody, someone, something). Pro-verbs Do DO when not used as an auxiliary or a question. 259 A PPENDIX D: S YNTACTIC F EATURES Direct WH-question Clause or sentence beginning with a WHO word (what, where, when, when, how, whether, why, whoever, whomever, whichever, wherever, whenever, whatever, however) followed by an auxiliary (e.g. “Who is”). Nominalisations All words ending in -tion, -ness, -ment, -ity. Gerunds Verbal nouns (i.e. verbal forms serving nominal functions). These were identified manually. Total Other Nouns All nouns in the electronic dictionary. Agentless Passives Clause in passive voice (e.g. “the cup was broken”). Partof-speech patterns were used to identify this construction. By-passives Clause in passive voice with agent (e.g. “the cup was broken by Bob”). Part-of-speech patterns were used to identify this construction. Be as Main Verb Gazetteer approach (e.g. am, is, are, etc.) augmented with part-of-speech patterns. Existential There Gazetteer approach used (e.g. “there are several possibilities”). That Verb Complements For example, “I said that he went”. That Adjective Complements For example, “I’m glad that you like it”. WH-clauses For example, “I believed what he told me”. Infinitives Identified using pattern matching (“to”) and part-of-speech patterns. Present Participle Clauses For example, “Stuffing his mouth with cookies, Joe ran out the door”. These forms were identified manually. Past Participle Clauses For example, “Built in a single week, the house would stand for fifty years”. These forms were identified manually. Past Participle WHIZ Deletion Relatives WHIZ deletions are defined by B IBER (1988) as “[p]articipal clauses functioning as reduced relatives” (e.g. “The solution produced by this process”). These forms were identified manually. Present Participle WHIZ Deletion Relatives For example, “the event causing this decline is . . . ”. These forms were identified manually. That Relative Clauses on Subject Position For example, “the dog that bit me”. Identified manually. That Relative Clause on Object Position For example, “the dog that I saw”. Identified manually. WH Relative Clause on Subject Position For example, “the man who likes popcorn”. WH Relative Clause on Object Position For example, “the man who Sally likes”. Pied-Piper Relative Clauses Preposition followed by a WH-pronoun (e.g. who, whom, which). 260 A PPENDIX D: S YNTACTIC F EATURES Sentence Relatives For example, “Bob likes fried mangos, which is the most disgusting thing I’ve ever heard of”. Indicated by the occurrence of “which” at the beginning of a clause. Causative Adverbial Subordinators: because Clauses beginning with because. Causative Adverbial Subordinators: although, though Clauses beginning with although or though. Causative Adverbial Subordinators: if, unless Clauses beginning with if or unless. Other Adverbial Subordinators For example, since, while, insofar as, etc. Total Prepositional Phrases Gazetteer approach used (e.g. against, amid, amidst, etc.). Attributive Adjectives For example, “the big horse”. Identified using part-ofspeech patterns. Predicative Adjectives For example, “the horse is big”. Identified using partof-speech patterns. Total Adverbs Any adverb that occurs in the electronic dictionary, or is longer than five letters and ends in -ly. Type/Token Ratio Number of different lexical items in the text expressed as a percentage. Word Length Mean length of words in a text. Conjuncts For example, alternatively, altogether, conversely, furthermore, etc. Downtoners Downtoners diminish the force of a verb. They are identified using a gazetteer (e.g. almost, hardly, slightly). Hedges Informal expressions of probability. They are identified using a gazetteer (e.g. at about, maybe, something like, etc.). Amplifiers Amplifiers enhance the force of a verb. They are identified using a gazetteer (e.g. very, absolutely, enormously, etc.). Discourse Particles For example, well, now, anyhow, anyway. Demonstratives That, this, these, those. Possibility Modals Can, may, might, would. Necessity Modals Ought, should, must. Predictive Modals Will, would, shall. Public Verbs Verbs that refer to external actions, e.g. proclaim, protest, reply, etc Private Verbs Verbs that refer to internal actions, e.g. decide, conclude, understand, etc. -suasive Verbs Verbs that persuade, e.g. ask, beg, propose, etc. Seem/Appear Use of the verbs seem and appear indicates hedging in more formal or academic contexts. 261 A PPENDIX D: S YNTACTIC F EATURES Contractions All contractions (with possessives excluded). Subordinator-that Deletion For example, “I think that he went to . . . ”. Stranded Preposition For example, “the candidate that I was thinking of.” Split Infinitives Insertion of Adverbs(s) in infinitives. Split Auxiliaries Insertion of Adverb(s) in auxiliaries. Phrasal Coordination (Adverb OR Adjective OR Verb OR Noun) “and” (Adverb OR Adjective OR Verb OR Noun). Independent Clause Coordination Clauses can stand independently. Indicated by use of the pattern “, and”. Synthetic Negation no, neither, nor. Analytic Negation not. 262 A PPENDIX E Ranked Features Table E on the following page lists the 100 hundred features with most discriminating power with respect to the Dictionary of National Biography (DNB) and TREC corpora (according to the feature ranking method described in Section 2.5.3 on page 50). A two megabyte sample of the DNB was used as a biographical corpus, and a 10 megabyte sample of the TREC corpus was used as a reference corpus. The features used consisted of the two thousand most frequent unigrams from the DNB, the two thousand most frequent bigrams from the DNB, and the two thousand most frequent trigrams from the DNB. Additionally, syntactic features derived from B IBER (1988) (described in Section 8.3 on page 148) and general biographical features (for example, family name, pronoun, and so on). Note that in Table E on the next page the presence of an underscore ( ) in the feature name indicates that the feature is an -gram (for example, “was educated south” refers to the trigram “was educated south”). The exception is those features prefixed with “feature”, which refer to syntactic features (for example “feature pronoun”) and the “past tense” and “present tense” features. 263 A PPENDIX E: R ANKED F EATURES Table E.1: 100 Most Discriminating Features with Respect to the DNB and TREC corpora, Calculated using the Feature Selection Method. Features are Presented in Alphabetical Order. and and his and in and was appoint appointed are as a at at the be became born born at born in brother by his cambridge charles college daughter daughter of de educated at edward english family father feature familyname feature familyrelationship feature forename feature month feature pronoun feature title feature year feature yearspan first george govern harry has have he he became he was he was educated henry her him his 264 his father his wife in in london it james john king london not of of his of john of the oxford past tense present tense publish published royal said school she sir son son of st that the government their they thomas university was was a was appointed was born was born at was born in was educated was educated at where where he which he wife william would year years year he A PPENDIX F Coverage of New Annotation Scheme This chapter reproduces in full the annotated documents referenced in Chapter 5. The chapter is made up of two main sections. First, the longer biographical texts from various sources are reproduced. Second, the short Wikipedia biographies are presented. F.1 Four Biographies from Various Sources Four annotated biographies are reproduced in this section, using the annotation scheme described in Chapter 5. The four biographical subjects and their respective sources are: Ambrose Bierce (Chambers Biographical Dictionary. Phillip Larkin (Dictionary of National Biography (old)). Alan Turing (Dictionary of National Biography (old)). Paul Foot (Wikipedia.) F.1.1 Ambrose Bierce key Bierce, Ambrose Gwinnett, 1842-1914 /key . work US short-story writer and journalist /work . work Born in Meigs County, Ohio, he grew up in Indiana and fought for the Union in the Civil War. /work work key In the UK from 1872 to 1875, he wrote copy for Fun and other magazines, and in 1887 joined the San Francisco Examiner. /key /work work He wrote Tales of Soldiers and Civilians (1892) and his most celebrated story, An Occurrence at Owl Creek Bridge, which is a haunted, neardeath fantasy of escape, influenced by Edgar Allan Poe and in turn influencing Stephen Crane and Ernest Hemingway. /work work He compiled the much-quoted Cynic’s Word Book (published in book form 1906), now better 265 A PPENDIX F: C OVERAGE OF N EW A NNOTATION S CHEME key He moved to Washington known as The Devil’s Dictionary. /work DC, and in 1913 went to Mexico to report on Pancho Villa’s army and disappeared. /key F.1.2 Phillip Larkin relationships key Larkin, Philip Arthur 1922-1985, poet, was born in Coventry 9 August 1922, the only son and younger child of Sydney Larkin, treasurer of Coventry, who was originally from Lichfield, and his wife, Eva Emily Day, of Epping. /key /relationships education He was educated at King Henry VIII School, Coventry (1930-40), and St John’s College, Oxford, where he obtained a first class degree in English language and literawork Bad eyesight caused him to be rejected ture in 1943. /education for military service, and after leaving Oxford he took up library work, becoming in turn librarian of Wellington, Shropshire (December 1943-July 1946), assistant librarian, University of Leicester (September 1946-September 1950), sub-librarian of the Queen’s University of Belfast (October 1950-March 1955), and finally taking charge of the Brynmor Jones Library, University of Hull, for the rest of his life. /work character Larkin, while always courteous and pleasant to meet, was solitary by nature; he never married and had no objection to his own company; it was said that the character in literature he most resembled was Badger in Kenneth Grahame’s The Wind in the Wilcharacter A bachelor, he found his substitute for family lows. /character life in the devotion of a chosen circle of friends, who appreciated his dry wit and his capacity for deep though undemonstrative affection. /character character His character was stable and his attitude to others considerate, so that having established a friendship he rarely abandoned it. /character relationships Most of the friends he made in his twenties were still attached to him in his sixties, and his long-standing friend and confidante Monica Jones, to whom he dedicated his first major collection The Less Deceived (1956), was with him at the time of his death thirty years later /relationships . work Larkin was a highly professional librarian, notably conscientious in his work, and an active member of the Standing Conference of National and University Libraries. /work work In the limited time this left him he did not undertake lecture tours, very rarely broadcast or gave interviews, and produced (compared with most authors) very little ancillary writing; though his lifelong interest in jazz led him to review jazz records for the Daily Telegraph, 196171. /work work Some of the reviews were collected in All What Jazz (1970) /work . work In his forties he discovered a facility for book reviewing, of which he had previously done very little, and a collection of his reviews, Required Writing (1983) reveals him as an excellent critic; though perhaps “reveal” is not the right word, for a decade earlier he had done much to influence contemporary attitudes to poetry with his majestic and in some quarters highly controversial Oxford Book of Twentieth-Century English Verse (1973), prepared with the utmost care during his tenure of a visiting fellowship at All Souls College in 1970-1. /work work He spent much time working on behalf of his fellow writers, as a member of the literature panel of the Arts Council, helping to set up and then guide its National Manuscript Collection of Contemporary Writers in conjunction with the British Museum, and 266 A PPENDIX F: C OVERAGE OF N EW A NNOTATION S CHEME serving as chairman for several years of the Poetry Book Society. /work work He was chairman of the Booker prize judges in 1977. /work To this Dictionary he contributed the notice of Barbara Pym. Larkin’s early ambition was to contribute both to the novel and to poetry. work His first novel Jill (1946), published by a small press (which paid him with only a cup of tea) and not widely reviewed, did little to establish him, though its merits were recognized when it was reprinted in 1964 and 1975; but the second, A Girl in Winter (1947), attracted the attention of discerning readers, and the only reason he did not write more novels was that he found he could not, though he tried for some five years before giving up and working entirely in poetry, an art he loved but did not regard as necessarily “higher” than fiction. /work The poet, he said, made a memorable statement about a thing, the novelist demonstrated that thing as it was in actuality. “The poet tells you that old age is horrible, the novelist shows you a lot of old people in a room”. Why the second became impossible to him, and the first remained strikingly possible, it is useless to speculate. work As a poet, Larkin’s early work, written when he was about twenty, already shows a fine ear and an unmistakable gift; but the breakthrough to an individual, and perfectly achieved, manner came some ten years later, in the poems collected in The Less Deceived. /work work From that point on, his work did not change much in style or subject matter throughout the thirty years still to come, in which he produced two volumes, The Whitsun Weddings (1964) and High Windows (1974), plus a few poems still uncollected at his death. /work There were surprises, but then there had been surprises from the start, for Larkin’s range was much more varied than a brief description of his work could hope to convey. He was restlessly alive to the possibilities of form, and never seemed constricted by tightly organized forms like the sonnet, the couplet, or the closely rhymed stanza, nor flaccid when he moulded his statement into free verse. It is instructive to pick out any one individual poem of Larkin’s and then look through his work for another that seems to be saying much the same thing in much the same manner. As a rule one finds that there is no such animal. Most poets repeat themselves; he did not, and this should qualify the frequently repeated judgement that his output was small.Both in prose and verse, Larkin’s themes were those of quotidian life: work, relationships, the earth and its seasons, routines, holidays, illnesses. He worked directly from life and felt no need of historical or mythological references, any more than he needed the cryptic verbal compressions that were mandatory in the modern poetry of his youth. Where modern poetry put its subtleties and complexities on the surface as a kind of protective matting, to keep the reader from getting into the poem too quickly, Larkin always provides a clear surface, one feels confident of knowing what the poem is about at the very first reading, and plants his subtleties deep down, so that the reader becomes gradually aware of them with longer acquaintance. key The poems thus grow in the mind until they become treasured possessions; this would perhaps account for the sudden explosion of feeling in the country at large when Larkin unexpectedly died at the Nuffield Hospital, Hull, 2 December 1985 (he had been known to be ill but thought to be recovering), and the extraordinary number who crowded into Westminster Abbey for his memorial service on St Valentine’s Day 1986. /key education Philip Larkin was an honorary D.Litt. of the universities of Belfast, 1969; Leicester, 1970; Warwick, 1973; St Andrews, 1974; Sussex, 1974; and Oxford, 1984. 267 A PPENDIX F: C OVERAGE OF N EW A NNOTATION S CHEME /education fame He won the Queen’s gold medal for poetry (1965), the Loines award for poetry (1974), the A. C. Benson silver medal, RSL (1975), the Shakespeare prize, FVS Foundation of Hamburg (1976), and the Coventry award of merit (1978) /fame . fame In 1983 Required Writing won the W. H. Smith literary award. /fame fame In 1975 he was appointed CBE and a foreign honorary member of the American Academy of Arts and Sciences. /fame fame education St John’s College made him an honorary fellow in 1973, and in 1985 he was made a Companion of Honour. /education /fame F.1.3 Alan Turing relationships work key Turing , Alan Mathison 1912 - 1954 , mathematician , was born in London 23 June 1912, the younger son of Julius Mathison Turing, of the Indian Civil Service, and his wife, Ethel Sara, daughter of Edward Waller Stoney, chief engineer of the Madras and Southern Mahratta Railway. /key /work /relationships relationships G. J. and G. G. Stoney were collateral relations. /relationships character education He was educated at Sherborne School where he was able to fit in despite his independent unconventionality and was recognized as a boy of marked abil/character education He went as a ity and character. /education mathematical scholar to King’s College, Cambridge , where he obtained a second class in part i and a first in part ii of the mathematical tripos (1932-4). /education education He was elected into a fellowship in 1935 with a thesis “On the Gaussian Error Function” which in 1936 obtained for him a Smith’s prize. /education fame In the following year there appeared his best-known contribution to mathematics, a paper for the London Mathematical Society “On Computable Numbers, with an Application to the Entscheidungsproblem” a proof that there are classes of mathematical problems which cannot be solved by any fixed and definite process, that is, by an automatic machine. /fame fame His theoretical description of a “universal” computing machine aroused much interest. /fame work After two years (19368) at Princeton, Turing returned to King’s where his fellowship was renewed. /work fame work But his research was interrupted by the war during which he worked for the communications department of the Foreign Office; in 1946 he was appointed O.B.E. for his services. /work /fame work The war over, he declined a lectureship at Cambridge, preferring to concentrate on computing machinery, and in the autumn of 1945 he became a senior principal scientific officer in the mathematics division of the National Physical Laboratory at Teddington. /work work With a team of engineers and electronic experts he worked on his “logical design” for the Automatic Computing Engine (ACE) of which a working pilot model was demonstrated in 1950 (it went eventually to the Science Museum). /work work In the meantime Turing had resigned and in 1948 he accepted a readership at Manchester where he was assistant director of the Manchester Automatic Digital Machine (MADAM). /work He tackled the problems arising out of the use of this machine with a combination of powerful mathematical analysis and intuitive short cuts which showed him at heart more of an applied than a pure mathematician. work In “Computing Machinery and Intelligence” in Mind 268 A PPENDIX F: C OVERAGE OF N EW A NNOTATION S CHEME (October 1950) he made a brilliant examination of the arguments put forward against the view that machines might be said to think. /work He suggested that machines can learn and may eventually “compete with men in all purely intellectual fields” fame In 1951 he was elected F.R.S., one of his proposers being Bertrand (Earl) Russell. /fame work The central problem of all Turing’s investigations was the extent and limitations of mechanistic explanations of nature and in his last years he was working on a mathematical theory of the chemical basis of organic growth. /work key But he had not fully developed this when he died at his home at Wilmslow 7 June 1954 as the result of taking poison. /key key Although a verdict of suicide was returned it was possibly an accident, for there was always a Heath-Robinson element in the experiments to which he turned for relaxation: everything had to be done with materials available in the house. /key character This self-sufficiency had been apparent from an early age; it was manifested in the freshness and independence of his mathematical work; and in his choice of long-distance running, not only for exercise but as a substitute for public transport. /character character An original to the point of eccentricity, he had a complete disregard for appearances and his extreme shyness made him awkward. /character character But he had an enthusiasm and a humour which made him a generous and lovable personality and won him many friends, not least among children. /character relationships He was unmarried /relationships . F.1.4 Paul Foot work key Paul Mackintosh Foot (November 8, 1937 - July 18, 2004) was a British radical investigative journalist, political campaigner, author, and long-time member of the Socialist Workers Party (SWP). /key /work relationships Paul Foot was the son of Hugh Foot, later Lord Caradon, who was governor of Cyprus during the independence battle with Britain in the 1950s, and later represented the United Kingdom at the United Nations from 1964-1970. /relationships relationships Paul Foot was the nephew of former leader of the Labour Party Michael Foot. /relationships education He was educated at Shrewsbury School and University College, Oxford. /education work He first joined the International Socialists, organisational forerunner of the SWP, when he was a cub reporter in Glasgow in the early 1960s. /work work He wrote for Socialist Worker throughout his career and was its editor in the late 1970’s until 1980 when he moved to the Daily Mirror. /work work He left the Mirror in 1993 when the paper refused to print articles critical of their management. /work work Latterly he returned to Private Eye; he also wrote for The Guardian. /work work He fought the Birmingham Ladywood by-election in 1977 for the SWP and was a Socialist Alliance candidate for several offices from 2001 onwards. /work fame In the Hackney mayoral election in 2002 he came third, beating the Liberal Democrat candidate into fourth. /fame work He stood in the London region for the RESPECT coalition at the 2004 European elections /work . fame He was Journalist of the Year in the What The Papers Say Awards in 1972 and 1989, Campaigning Journalist of the Year in the 1980 British Press Awards, won the George Orwell Prize for Journalism in 1994, won the Journal269 A PPENDIX F: C OVERAGE OF N EW A NNOTATION S CHEME ist of the Decade in the What The Papers Say Awards in 2000, and the James /fame fame His best Cameron Special Posthumous Award in 2004. known work was in the form of campaign journalism, including his exposure of corrupt architect John Poulson and, most notably, his prominent role in the campaigns to overturn the convictions of the Birmingham Six and the Bridgewater Four, which succeeded in 1991 and 1997 respectively. /fame work He took a particular interest in the conviction of Abdel Basset Ali al-Megrahi for the Lockerbie bombing, firmly believing Megrahi to have been a victim of a miscarriage of justice. /work work He also worked tirelessly, though without success, to gain a posthumous pardon for James Hanratty, who was hanged in 1962 for the A6 murder. /work work His books are Immigration and Race in British Politics (1965), The Politics of Harold Wilson (1968) The Rise of Enoch Powell (1969) Who Killed Hanratty? (1971) Red Shelley (1981) The Helen Smith Story (1983) Murder on the Farm, Who Killed Carl Bridgewater? (1986) Who Framed Colin Wallace? (1989) Words as Weapons (1990) Articles of Resistance (2000) and The Vote: How It Was Won, and How It Was Undermined (2005). /work key He died of a heart attack while waiting at Stansted Airport to begin a family holiday in Ireland. /key work A special tribute issue of the Socialist Review magazine, of which he was on the editorial board for 19 years, collected together many of his articles. Private Eye issue 1116 included a tribute to Foot from the many people whom he worked work On October 10, 2004 – three months with over the years. /work after Foot’s death – there was a full house at the Hackney Empire in London for an evening’s celebration of the life of this much-admired and respected campaigning journalist. /work F.2 Four Biographies from Wikipedia Four annotated biographies are reproduced in this section, using the annotation scheme described in Chapter 5. All four examples are short biographies of people who died in December 2005 and harvested from Wikipedia. The four biographical subjects are: F.2.1 Jack Anderson fame work key Jackson Northman Anderson (October 19, 1922 December 17, 2005) was an American newspaper columnist and is considered one of the fathers of modern investigative journalism /key . /work /fame work fame Anderson won the 1972 Pulitzer Prize for National Reporting for his investigation on secret American policy decision-making between the United States and Pakistan during the 1971 Indo-Pakistan War of 1971. /fame /work fame work Jack Anderson was a key and often controversial figure in reporting on J. Edgar Hoover’s apparent ties to the Mafia, Watergate, the John F. Kennedy assassination, the search for fugitive exNazi Germany officials in South America and the Savings and Loan scandal. /work /fame fame He discovered a CIA plot to assassinate Fidel Castro, and has also been credited for breaking the Iran-Contra affair, though 270 A PPENDIX F: C OVERAGE OF N EW A NNOTATION S CHEME he has said the scoop was “piked” because he had become too close to President Ronald Reagan /fame . character Anderson was a crusader against corruption /character . character Henry Kissinger once described him as “the most dangerous man in America.” /character key Anderson was diagnosed with Parkinson’s disease in 1986. /key work In July 2004, at the age of 81, Anderson retired from his syndicated column, “Washington Merry-Go-Round.” /work key relationships He died of complications from Parkinson’s disease, survived by his wife, Olivia, and nine children. /relationships /key A few months after his death, the FBI attempted to gain access to his files as part of the AIPAC case on the grounds that the information could hurt U.S. government interests. F.2.2 Kerry Packer key Kerry Francis Bullmore Packer AC (17 December 1937 26 December 2005) was an Australian publishing, media and gaming tycoon. /key character work fame He was famous for his outspoken nature, wealth, expansive business empire and clashes with the Australian Taxation Office and the Costigan Commission. /fame /work /character fame At the time of his death, Packer was the richest and one of the most influential men in Australia. /fame fame In 2004 Business Review Weekly magazine estimated Packer’s net worth at AUD 6.5 billion ($6.5 billion; about USD 4.7 billion). /fame F.2.3 Richard Pryor key Richard Franklin Lennox Thomas Pryor III (December 1, 1940 December 10, 2005) was an American comedian, actor, and writer. /key character fame Pryor was a gifted storyteller known for unflinching examinations of race and custom in modern life, and was well-known for his frequent use of colorful language, vulgarities, as well as such racial epithets as “nigger,” “honky,” and “cracker”. /fame /character He reached a broad audience with his trenchant observations, although public opinion of his act was often divided. fame He is commonly regarded as one of the most important stand up comedians of his time: Jerry Seinfeld called Pryor “The Picasso of our profession.” /fame work His catalog includes such concert movies and recordings as Richard Pryor: Live and Smokin’ (1971), That Nigger’s Crazy (1974), Bicentennial Nigger (1976), Richard Pryor: Wanted Live In Concert (1979) and Richard Pryor: Live on the Sunset Strip (1982). /work work He also starred in numerous films as an actor, usually in comedies such as the classic Silver Streak, but occasionally in the noteworthy dramatic role, such as Paul Schrader’s film Blue Collar. /work work He also collaborated on many projects with actor Gene Wilder. /work fame He won an Emmy Award in 1973, and five Grammy Awards in 1974, 1975, 1976, 1981, and 1982. /fame fame In 1974 he also won two American Academy of Humor awards and the Writers Guild of America Award. /fame 271 A PPENDIX F: C OVERAGE OF N EW A NNOTATION S CHEME F.2.4 Stanley Williams key fame Stanley Tookie Williams III (December 29, 1953 December 13, 2005), was an early leader of the Crips, a notorious American street gang which had its roots in South Central Los Angeles in 1969. /fame /key key fame In December 2005 he was executed for the 1979 murders of Albert Owens, Yen-Yi Yang, Tsai-Shai Lin, and Yee-Chen Lin. /fame /key key fame While in prison, Williams refused to aid police investigations with any information against his gang, and was implicated in attacks on guards and other inmates as well as multiple escape plots. /fame /key character fame In 1993, Williams began making changes in his behavior, and became an anti-gang activist while on Death Row in California. /fame /character fame Although he continued to refuse to assist police in their gang investigations, he renounced his gang affiliation and apologized for the Crips’ founding, while maintaining his innocence of the crimes for which he was convicted. /fame fame He co-wrote children’s books and participated in efforts intended to prevent youths from joining gangs. /fame fame A 2004 biographical TV-movie entitled Redemption: The Stan Tookie Williams Story /fame key On December 13, 2005, featured Jamie Foxx as Williams. Williams was executed by lethal injection amidst debate over the death penalty and whether his anti-gang advocacy in prison represented genuine atonement. /key 272 A PPENDIX G The Corrected Re-sampled -test This appendix describes the corrected re-sampled -test, as presented by B OUCK AERT and F RANK (2004).1 The test was implemented in the Perl programming language.2 G.1 An Outline of the Corrected Re-sampled -test and of algorithm A measured on run , where Let, , be bethetheaccuracy and accuracy of algorithm B measured on run , where . Let be the total number of runs, the number of instances used for training and the number of instances used for testing. (that is, equals the difference between the accuracy of Algorithm A and Algorithm B on run ). is the estimate of the variance of the differences (that is, the square of the standard deviation of the differences). Equation G.1 shows the statistic for the corrected resampled -test.3 (G.1) B OUCKAERT and F RANK (2004) point out that the difference between the cor rected resampled -test and the “standard” -test is the substitution of in the “standard” statistic by in the corrected version. 1 The exposition follows B OUCKAERT and F RANK (2004) closely. code is available at http://www.dcs.shef.ac.uk/ mac/t-test.pl 3 Note that the test is used in conjunction with the student distribution with freedom. 2 The 273 degrees of A PPENDIX H Factor Analysis This appendix provides further details on the process of factor analysis (FA) used by B IBER (1988), as it was felt that a lengthy digression on statistics in the main body of the thesis would be inappropriate. Note that this appendix is only designed to give a flavour of the strengths and weaknesses of the technique. FA has traditionally been grouped as a technique within multivariate statistics. Multivariate statistics — as the name suggests — looks at the patterns of relationships between variables. Other techniques, along with FA, belonging to the multivariate statistics group are cluster analysis and multidimensional scaling (C HATFIELD and C OLLINS, 1980). FA examines correlations between variables, and seeks to describe observed correlations in terms of underlying factors. Factors – in this sense – are hypothetical variables on which individuals (individual people, individual texts, and so on ) can differ. FA has been used extensively in psychology to explore the underlying dimensions of personality; the so called “Big Five” personality types (C OOLICAN, 2004)1 Often in this kind of research, participants were presented with a list of several hundred “trait adjectives” (for example, talkative, quiet). Correlations between all these trait adjective variables were then calculated, forming a matrix, and factors then identified using matrix algebra. The factors were then named (that is, interpreted by the researcher) by identifying the particular characteristics of the identified variables. For example if we were conducting personality research and identified a factor which had the positive variables; diligent, tidy, punctual, and frugal, and if that same factor has the negative variables; tardy, messy, spendthrift, and lazy, we may want to call the factor “Responsible vs Irresponsible.”2 There are two main stages in FA. First, constructing a correlational matrix, and second, manipulating that matrix to identify factors. 1 The “Big Five” types are: Extraversion, Agreeableness, Conscientiousness, Neuroticism and Openness to Experience. 2 The earliest use of FA — S PEARMAN (1904) — was in identifying dimensions of intelligence rather than personality. 274 A PPENDIX H: FACTOR A NALYSIS H.1 Constructing a Correlational Matrix B IBER (1988) identified sixty-seven features and investigated the frequency of these features across different genre (see Figure 2.4 on page 17 for a selection of the features used and Appendix D on page 259 for a more comprehensive listing). Four hundred and eighty one text documents were used (that is, all the genres). The correlation between each variable pair were then placed in a correlation matrix (that is, a matrix of size , consisting of 4489 correlations). A correlation is a measure of the extent to which a change in one random variable corresponds to a change in another random variable. There are two scales to use when talking about correlations, the strength of the correlation (positive or negative) and the direction of the correlation. The correlation between two random variables can range from +1 to -1. A +1 correlation is a positive perfect correlation (that is, the two variables always appear together). A -1 correlation is a negative perfect correlation (that is, they never occur together). An 0.8 correlation is a strong, positive correlation) (that is, the two variables are likely to appear together). An -0.8 correlation is a strong, negative correlation (that is, the two variables are unlikely to appear together). It is assumed that the features that frequently occur together have one (or more) communicative function (that is, that correlations indicate some underlying linguistic dimension, rather than just being correlations). Individual correlations were calculated using Pearson’s correlation, a statistic that assumes data is normally distributed.3 The procedure (described in O AKES (1998) which this account follows closely) involves identifying the following from the genre frequency tables: 1. The sum of all values of variable 1 ( ). 2. The sum of all values of variable 2 ( ). 3. The sum of all squares of . 4. The sum of all squares of . 5. The sum of the products over all data 4489 data pairs. 6. The number of documents ( ). Equation H.1 (again based on O AKES (1998)) shows how Pearson’s correlation coefficient can be easily computed. (H.1) 3 The correlation matrix used by B IBER (1988) is published as an Appendix to that book. 275 A PPENDIX H: FACTOR A NALYSIS H.2 Identifying Factors from a Correlational Matrix There are numerous methods available for performing factor analysis (see C ATTELL (1979) for a review of techniques and applications). The method used in B IBER (1988) was principal factor analysis (also known as common factor analysis). Note that as B IBER (1988) did not provide extensive details of the factor analysis technique used, this account is rather generic. Further, FA is a complex technique and detailed descriptions of its operation are largely skirted around in the literature concerned with Multidimensional analysis (see L EE (1999)). A good, book length treatment, aimed at non-mathematicians is K LINE (1994). Factor analysis can be broken down into several steps: 1. A correlational matrix is produced (see above). 2. Identify (extract) factors using matrix manipulation techniques. There are two approaches to identifying factors, one uses geometry (that is, treating the relationships between co-efficients as angles) and one uses matrix algebra (this is the method implemented in SPSS and other statistics software (K LINE, 1994)). 3. Identify the optimum number of factors. Any number of factors (less than or equal to the number of variables) can be extracted, but if the number of factors is equal to the number of variables, then the explanatory power of the identified factors is questionable. Normally, the first few factors are most important, and the remaining factors discarded. There is no single technique for identifying a cut off point for the number of factors, but various heuristic techniques have been developed. Once the number of factors has been decided upon, they are interpreted and named. . B IBER (1988) used statistical software (SPSS) to produce the correlation matrix and perform the FA. SPSS automatically identifies the optimal number of factors using a heuristic, although this decision can be overridden by the researcher if required. SPSS also provides several options for rotating the matrix, including the varimax method employed by (B IBER, 1988). 276 Bibliography A DAMNAN (c 690). Life of St Columba. 1995, Penguin, London. A MIS , M. (1985). Money. Penguin, London. A RGAMON , S., K OPPE , M., F INE , J., AND S HIMON , A. (2003). Gender, Genre and Writing Style in Formal Written Texts. Text, 23(3). A RISTOTLE (c 340 BC). Aristotle on the Art of Fiction: An English Translation of the Poetics. 1962, Cambridge University Press. A RMSTRONG , E. (1991). The Potential of Cohesion Analysis in the Analysis and Treatment of Aphasic Discourse. Clinical Linguistics and Phonetics, 5(1). A RTSTEIN , R. AND P OESIO , M. (2005). Bias Decreases in Proportion to the Number of Annotators. In The Proceedings of the 10th Conference on Formal Grammar and the 9th Meeting on Mathematics of Language, pages 141–150. ATKINSON , D. (1992). The Evolution of Medical Research Writing from 1735 to 1985: The Case of the Edinburgh Medical Journal. Applied Linguistics, 13:337–374. A UDEN , W. (1935). Collected Poems. 1991, Vintage, London. B AL , M. (1985). Narratology: Introduction to the Theory of Narrative. University of Toronto Press, Toronto. B ATTLE , M. (2004). Library: An Unquiet History. Vintage, London. B EDE (c 700). The Age of Bede. 1998, Penguin,, London. B IBER , D. (1988). Variations Across Speech and Writing. Cambridge University Press. B IBER , D. (1989). A Typology of English Texts. Linguistics, 27:3–43. B IBER , D., C ONRAD , S., R EPPEN , R., B YRD , P., AND H ELT, M. (2002). Speaking and Writing in the University: A Multidimensional Comparison. TESOL Quarterly, 36:9–48. B IBER , D. AND F INEGAN , E. (1986). An Inital Typology of English Texts. In A ARTS , J. AND M EIJS , W. (editors), Corpus Linguistics 2, pages 19–46. Rodopi. 277 B IBLIOGRAPHY B IBER , D. AND H ARED , M. (1994). Linguistic Correlates of the Transition to Literacy in Somali: Language Adaptation in Six Press Registers. In B IBER , D. AND F INEGAN , E. (editors), Sociolinguistic Perspectives on Register, pages 183–216. Oxford University Press. B LACK (2004). Who’s Who 2004. A & C Black, London. B LAIR -G OLDENSOHN , S., E VANS , D., H ATZIVASSILOGLOU , V., M C K EOWN , K., N ENKOVA , A., PASSONNEAU , R., S CHIFFMAN , B., S CHLAIKJER , A., AND A DVAITH (2004). Columbia University at DUC 2004. In Document Understanding Conference-2004. B OESE , A., E.S. & H OWE (2005). Effects of Web Document Evolution on Genre Classification. In Proceedings of the 14th ACM Fourteenth Conference on Information and Knowledge Management. B ORKO , H. AND B ERNICK , M. (1963). Automatic Document Classification. Journal of the Association for Computing Machinery, 10(2):151–161. B OSWELL , J. (1791). The Life of Samuel Johnson, LL.D. 2005, Penguin, London, London. B OUCKAERT, R. AND F RANK , E. (2004). Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms. In Advances in Knowledge Discovery and Data Mining, pages 3–12. Springer, Berlin. B ROADFIELD , A. (1946). Philosophy of Classification. Grafton, London. B URGES , C. J. C. (1998). A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2(2):121–167. C ARDIE , C. (1997). Empirical Methods in Information Extraction. AI Magazine, 18(4):65–80. C ARLETTA (1996). Assessing Agreement on Classification Tasks: The Kappa Statistic. Computational Linguistics, 22(2). C ATTELL , R. (1979). The Scientific use of Factor Analysis in the Life Sciences. Plenum Press, New York. C HAMBERS (2004). Chambers Biographical Dictionary. Chambers-Harrap, Edinburgh. C HATFIELD , C. AND C OLLINS , A. (1980). Introduction to Multivariate Analysis. Chapman & Hall, London. C IRAVEGNA , F. (2001). Adaptive Information Extraction from Text by Rule Induction and Generalization. In 17th International Joint Conference on Artificial Intelligence, Seattle. Seattle. C LEGG , B. (2007). The Man Who Stopped Time: The Illuminating Story of Eadweard Muybridge: Pioneer Photographer, Father Of The Motion Picture, Murderer. Joseph Henry Press, Washington. C OHEN , J. (1960). A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement, 20:37–46. 278 B IBLIOGRAPHY C OHEN , W. W. (1995). Fast Effective Rule Induction. In Proceedings of the 12th International Conference on Machine Learning. Morgan Kaufmann, Tahoe City, CA. C OOLICAN , H. (2004). Research Methods and Statistics in Psychology. Hodder Arnold, London. C ORTES , C. AND VAPNIK , V. (1995). Support Vector Networks. Machine Learning, 20:273–297. C OWIE , J. AND L EHNERT, W. (1996). Information Extraction. Communications of the ACM, 39(1):80–91. C OWIE , J., N IRENBURG , S., AND M OLINO -S ALGADO , H. (2001). Generating Personal Profiles. Technical report, New Mexico State University. C RAGGS , R. AND M C G EE W OOD , M. (2005). Evaluating discourse and dialogue coding schemes. Computational Linguistics, 31:289–295. C RAIG , H. (1999). Authorial Attribution and Computational Stylistics: If You Can Tell Authors Apart, Have You Learned Anything About Them? Literary and Linguistic Computing, 14. C SOMAY, E. (2002). Variation in Academic Lectures. In R EPPEN , R., F ITZ MAURICE , S., AND B IBER , D. (editors), Using Corpora to Explore Linguistic Variation, pages 203–224. John Benjamins, Amsterdam. D ENNETT, D. (1992). Conciousness Explained. Penguin, London. D I E UGENIO , B. (2000). On the Usage of Kappa to Evaluate Agreement on Coding Tasks. In LREC-2000. D I E UGENIO , B. AND G LASS , M. (2004). The Kappa Statistic: A Second Look. Computational Linguistics, 30(1):95–101. D IETTERICH , T. G. (1998). Approximate Statistical Test for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10(7):1895– 1923. D UBROW, H. (1982). Genre: The Critical Idiom. Methuen, London. D UNNING , T. (1993). Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 19(1):61–74. E DMUNDSON , H. P. (1969). New Methods in Automatic Extracting. Journal of the ACM, 16(2):264–285. E GGINS , S. (1994). An Introduction to Systemic Functional Linguistics. Frances Pinter, London. E LLISON , J. (1967). Computers and the Testaments. In Computers in Humanistic Research. Prentice-Hall, New Jersey. E MAM , K. E. (1999). Benchmarking Kappa: Interrater Agreement in Software Process Assessment. Empirical Software Engineering, 4:113–133. 279 B IBLIOGRAPHY FABER , R. AND H ARRISON , B. (2002). The Dictionary of National Biography: A Publishing History. In M YERS , R., H ARRIS , M., AND M ANDELBROTE , G. (editors), Lives in Print: Biography and the Book Trade in the Middle Ages to the 21st Century, pages 171–92. Oakwell Press & The British Library, London. F ENG , D. AND H OVY, E. (2005). Handling Biographical Questions with Implicature. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing. HLT-2005. Morristown, NJ. F ERGUSSON , J. (2000). Death and the Press. In G LOVER , S. (editor), The Penguin Book of Journalism: Secrets of the Press. Penguin, London. F INN , A. AND K USHMERICK , N. (2003). Learning to Classify Documents According to Genre. In Proceedings of the IJCAI-03 Workshop on Computational Approaches to Style Analysis and Synthesis. F ISHER , R. (1922). On the Interpretation of from Contingency Tables, and the Calculation of P. Journal of the Royal Statistical Society, 85:87–94. F LEISS , J. (1971). Measuring Nominal Agreement Among Many Raters. Psychological Bulletin, 76:378–382. F ORMAN , G. (2002). Choose Your Words Carefully: An Empirical Study of Feature Selection Metrics for Text Classification. In Proceedings of PKDD-02, 6th European Conference on Principles of Data Mining and Knowledge Discovery, pages 150–162. Helsinki. F ORMAN , G. (2003). An Extensive Empirical Study of Feature Selection Metrics for Text Classification. Journal of Machine Learning Research, 3:1289–1305. F ÜRNKRANZ , J. (1998). A Study Using -gram Features for Text Categorization. Technical Report OEFAI-TR-9830, Austrian Research Institute for Artificial Intelligence. G AIZAUSKAS , R., G REENWOOD , M., H EPPLE , M., R OBERTS , I., S AGGION , H., AND S ARGAISON , M. (2003). The University of Sheffield’s TREC 2003 QA Experiments. In Proceedings of the 2003 Text Retrieval Conference (TREC-2003). G ITTINGS , R. (1978). The Nature of Biography. Heinemann, London. G ROVE , W. M., A NDREASEN , N., M C D ONALD -S COTT, P., K ELLER , M., AND S HAPIRO , R. (1981). Reliability Studies of Psychiatric Diagnosis. Theory and Practice. Archive of General Psychiatry, 38:408–413. G UYON , I. AND E LISSEEFF , A. (2003). An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, 3:1157–1182. H ALLIDAY, M. (1961). Categories of the Theory of Grammar. Word, 17(3). H ALLIDAY, M. (1966). Some Notes on ”Deep” Grammar. Journal of Linguistics, 2(1). H ALLIDAY, M. AND H ASAN , R. (1976). Cohesion in English. Longman, London. H ALLIDAY, M. AND M ATTHIESSEN , C. (2004). An Introduction to Functional Grammar. Hodder Arnold, London, third edition. 280 B IBLIOGRAPHY H ERODOTUS (c 440 BC). The Histories. 2003, Penguin, London. H OLMES , D. AND F ORSYTH , R. (1995). The Federalist Revisited: New Directions in Authorship Attribution. Literary and Linguistic Computing, 10(2):111– 127. H OLTE , R. C. (1993). Very Simple Classification Rules Perform Well on Most Commonly Used Datasets. Machine Learning, 11(1):63–90. H ONEYBONE , P. (2005). J. R. Firth. In C HAPMAN , S. AND R OUTLEDGE , C. (editors), Key Thinkers in Linguistics and the Philosophy of Language. Oxford University Press. H OVY, E. (2003). Text Summarization. In M ITKOV, R. (editor), Oxford Handbook of Computational Linguistics, pages 583–598. Oxford University Press. J OACHIMS , T. (1998). Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In Proceedings of ECML-98, 10th European Conference on Machine Learning. J OHANSSON , S., L EECH , G., AND G OODLUCK , H. (1978). Manual of Information To Accompany the Lancaster-Oslo/Bergen Corpus. Technical report, Norwegian Computing Centre for the Humanities. J OHN , G. AND L ANGLEY, P. (1995). Estimating Continuous Distributions in Bayesian Classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pages 338–345. J OHNSON , S. (1781). Lives of the Poets. 2006, OUP. J UOLA , P., S OFKO , J., AND B RENNAN , P. (2006). A Prototype for Authorship Attribution Studies. Literary and Linguistic Computing, 21. K ARLGREN , J. (2004). The Wheres and Whyfores for Studying Textual Genre Computationally. In Proceedings of the AAAI Fall Symposium of Style and Meaning in Language, Art and Music. K ARLGREN , J. AND C UTTING , D. (1994). Recognizing Text Genres with Simple Metrics using Discriminant Analysis. In Proceedings of the 15th International Conference on Computational Linguistics - Volume 2. K ENNY, A. (1982). The Computation of Style: An Introduction to Statistics for Students of Literature and Humanities. Pergamon, Oxford. K ESSLER , B., N UNBERG , G., AND S CH ÜTZE , H. (1997). Automatic Detection of Text Genre. In Proceedings of the Thirty-Fifth Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics. K ILGARRIFF , A. AND R OSE , T. (1998). Measures for Corpus Similarity and Homogeneity. In 3rd Conference on Empirical Methods in Natural Language Processing, Granada, Spain. 281 B IBLIOGRAPHY K IM , S., H ARITH , A., H ALL , W., L EWIS , P. H., M ILLARD , D. E., S HADBOLT, N. R., AND W EAL , M. J. (2002). Artequakt: Generating Tailored Biographies with Automatically Annotated Fragments from the Web. In Proceedings of the Semantic Authoring, Annotation and Knowledge Markup Workshop in the Fifteenth European Conference on Artificial Intelligence. Lyon. K IM , Y.-J. AND B IBER , D. (1994). A Corpus-Based Analysis of Registry Variation in Korean. In B IBER , D. AND F INEGAN , E. (editors), Sociolinguistic Perspectives on Register, pages 157–181. Oxford University Press. K LINE , P. (1994). An Easy Guide to Factor Analysis. Routledge, London. K RAEMER , H. (1992). Measurement of Reliability for Categorical Data in Medical Research. Statistical Methods in Medical Research, 1:183–99. K RIPPENDORFF , K. (1980). Content Analysis: An Introduction to its Methodology. Sage, Los Angeles. L ABOV, W. (1975). The Boundaries of Words and their Meanings. In B AILEY, C. AND S HY, R. (editors), New Ways of Analysing Variation in English. Georgetown University Press, Washington, D.C. L AKOFF , G. (1987). Women, Fire, and Dangerous Things: What Categories Reveal About the Mind. University of Chicago Press, Chicago. L ANDIS , J. AND K OCH , G. (1977). The Measurement of Observer Agreement in Categorical Data. Biometrics, 33:159–174. L EE , D. (1999). Modelling Variation in Spoken and Written English. Ph.D. thesis, Lancaster University. L EECH , G. N. AND S HORT, M. (1981). Style in Fiction: A Linguistic Introduction to English Fictional Prose. Longman, London. L EWIS , D. D. (1992a). Feature Selection and Feature Extraction for Text Categorization. In HLT ’91: Proceedings of the Workshop on Speech and Natural Language, pages 212–217. Association for Computational Linguistics, Morristown, NJ, USA. L EWIS , D. D. (1992b). Representation and Learning in Information Retrieval. Ph.D. thesis, Department of Computer Science, University of Massachusetts, Amherst, US. L INNAEUS , K. (1735). Systema Naturae. Leyden. L OVE , H. (2002). Attributing Authorship. Cambridge University Press. L UHN , H. P. (1958). The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development, 2(2). M AI , J.-E. (2004). Classification in Context: Relativity, Reality and Representation. Knowledge Organization, 1:39–48. M ALINOWSKY, B. (1935). The Language and Magic of Gardening. George & Allen, London. 282 B IBLIOGRAPHY M ANI , I. (2001a). Automatic Summarization. John Benjamins, Amsterdam. M ANI , I. (2001b). Summarization Evaluation: An Overview. In Proceedings of the Second NTCIR Workshop on Research in Chinese and Japanese Text Retrieval and Text Summarization. Tokyo. M ANI , I. AND B LOEDORN , E. (1999). Summarizing Similarities and Differences Among Related Documents. In M ANI , I. AND M AYBURY, M. (editors), Advances in Automatic Text Summarization, pages 357–379. MIT Press, Cambridge, MA. M ANNING , C. D. AND S CH ÜTZE , H. (1999). Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, MA. M ARCU , D. (1999). The Automatic Construction of Large-Scale Corpora for Summarization Research. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 137–144. M ARTIN , J., M ATTHIESSEN , C., AND PAINTER , C. (1996). Working with Functional Grammar. Edward Arnold, London. M ARTIN , J., M ATTHIESSEN , C., AND PAINTER , C. (1997). Working with Functional Grammar. Arnold Press, London. M AUROIS , A. (1929). Aspects of Biography. Cambridge University Press. M AYBURY, M. AND M ANI , I. (2001). Automatic Summarization - A Tutorial. Technical report, MITRE Corp. M C C OLLY, W. AND W EIER , D. (1983). Literary Attribution and Likelihood Ratio Tests. Computers and the Humanities, 17:65–75. M C E NERY, A. AND O AKES , M. (2000). Authorship Attribution. In D ALE , R., S OMERS , H., AND M OISL , H. (editors), Handbook of Natural Language Processing. Dekker, NY. M C E NERY, T. sity Press. AND W ILSON , A. (1996). Corpus Linguistics. Edinburgh Univer- M C K EOWN , K. (1998). Generating Patient-Specific Summaries of Online Literature. In Proceedings of 1998 Spring Symposium Series Intelligent Text Summarization. Stanford. M C K EOWN , K., B ARZILAY, R., E VANS , D., H ATZIVASSILOGLOU , V., K LA VANS , J., S ABLE , C., S CHIFFMAN , B., AND S IGELMAN , S. (2002). Tracking and Summarizing News on a Daily Basis with Columbia’s Newsblaster. In Proceedings of the Second Human Language Technology Conference. San Diego. M C K EOWN , K., K LAVANS , J., H ATZIVASSILOGLOU , V., B ARZILAY, R., AND E SKIN , E. (1999). Towards Multidocument Summarization by Reformulation: Progress and Prospects. In Proceedings of the Sixteenth National Conference on Artificial Intelligence and the Eleventh Innovative Applications of Artificial Intelligence Conference. 283 B IBLIOGRAPHY M EYER ZU E ISSEN , S. AND S TEIN , B. (2004). Genre Classification of Web Pages. In Proceedings of the 27th German Conference on Artificial Intelligence. Ulm. M ILLARD , D. E., M OREAU , L., D AVIS , H. C., AND R EICH , S. (2000). FOHM: A Fundamental Open Hypertext Model for Investigating Interoperability between Hypertext Domains. In Proceedings of the UK Conference on HyperText. M INSKY, M. (1974). A Framework for Representing Knowledge. Technical Report Memo 306, MIT-AI Laboratory. M ITCHELL , T. (1997). Machine Learning. McGraw-Hill International, Singapore. M OESSNER , L. (2001). Genre, Text Type, Style, Register: A Terminological Maze? European Journal of English Studies, 5:131–138. M ORTON , A. (1965). The Authorship of Greek Prose. Journal of the Royal Statistical Society, 128(2):169–233. M ORTON , A. (1978). Literary Detection. Bowker, London. M OSCHITTI , A. AND B ASILI , R. (2004). Complex Linguistic Features for Text Classification: A Comprehensive Study. In Proceedings of the 26th European Conference on Information Retrieval Research, pages 181–196. M OSTELLER , F. AND WALLACE , D. (1984). Applied Bayesian and Classical Inference: The Case of the Federalist Papers. Addison-Wesley, Reading, MA, second edition. N ADEAU , C. AND B ENGIO , Y. (2003). Inference for the generalization error. Machine Learning, 52(3):239–281. O AKES , M. (1998). Statistics for Corpus Linguistics. Edinburgh University Press, Edinburgh. O AKES , M., G AIZAUSKAS , R., F OWKES , H., J ONSSON , A., WAN , V., AND B EAULIEU , M. (2001). Comparison Between a Method Based on the ChiSquare Test and a Support Vector Machine for Document Classification. In Proceedings of the ACM Special Interest Group on Information Retrieval (SIGIR01). OUP (2003). New Dictionary of National Biography: Notes for Contributors. Oxford University Press. OUP (2004). Dictionary of National Biography. Oxford University Press. PAPE , S. AND F EATHERSTONE , S. (2005). Newspaper Journalism: A Practical Introduction. Sage, London. P LUTARCH (c 100). Roman Lives: A Selection of Eight Lives. 1999, Penguin, London. P ORTER , M. (1980). An Algorithm for Suffix Stripping. Program, 14. Q UINLAN , J. R. (1988). Simplifying Decision Trees. In G AINES , B. AND B OOSE , J. (editors), Knowledge Acquisition for Knowledge-Based Systems, pages 239–252. Academic Press, London. 284 B IBLIOGRAPHY Q UINLAN , R. (1993). C4.5: Programs for Machine Learning. Morgan Kauffman, San Mateo, CA. R ADEV, D. (1999). Generating Natural Language Summaries from Multiple Online Sources. Ph.D. thesis, Columbia University, New York City. R OPER , W. (1550). Life of St Thomas More. 2001, Ignatius Press, San Francisco, San Francisco, CA. R OSCH , E. H. (1973). Natural Categories. Cognitive Psychology, 4(3). R OTHERY, J. (1991). Developing Critical Literacy: An Analysis of the writing task in a year 10 reference test. DSP, Sydney. S ANTINI , M. (2004a). A Shallow Approach to Syntactic Feature Extraction for Genre Classification. In 7th Annual CLUK Research Colloquium. S ANTINI , M. (2004b). State-of-the-Art on Automatic Genre Identification. Technical Report ITRI-04-03, Information Technology Research Institute, University of Brighton. S CHANK , R. C. AND A BELSON , R. P. (1977). Scripts, Plans, Goals and Understanding. Lawrence Erlbaum Associates, Hillsdown, N.J. S CHIFFMAN , B., M ANI , I., AND C ONCEPCION , K. J. (2001). Producing Biographical Summaries: Combining Linguistic Knowledge with Corpus Statistics. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics. Toulouse. S COTT, M. (2005). WordSmith Tools: Online Manual. URL: http://www.lexically.net/downloads/version4/wordsmith.pdf Accessed on 05-02-07. S COTT, S. AND M ATWIN , S. (1999). Feature Engineering for Text Classification. In Proceedings of the 16th International Conference on Machine Learning. Bled, SL. S COTT, W. (1955). Reliability of Content Analysis: The Case of Nominal Scale. Public Opinion Quarterly, 19:127–141. S EBASTIANI , F. (2002). Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1):1–47. S EMINO , E. AND S HORT, M. (2004). Corpus Stylistics. Routledge, London. S HANNON , C. E. (1948). A Mathematical Theory of Communication. The Bell System Technical Journal, 27:379–423 623–656. S HELSTON , A. (1977). Biography. Methuen, London. S HORT, M. (1996). Exploring the Language of Poems, Plays and Prose. Longman, London. S P ÄRCK J ONES , K. (1999). Automatic Summarizing: Factors and Directions. In M ANI , I. AND M AYBURY, M. (editors), Advances in Automatic Text Summarization, pages 1–12. MIT Press, Cambridge, MA. 285 B IBLIOGRAPHY S PEARMAN , C. (1904). The Proof and Measure of Association Between Two Things. American Journal of Psychology, 15:88–103. S TAMATATOS , E., FAKOTAKIS , N., AND K OKKINAKIS , G. (2000a). Automatic Text Categorization in Terms of Genre and Author. Computational Linguistics, 26(4):471–495. S TAMATATOS , E., FAKOTAKIS , N., AND K OKKINAKIS , G. (2000b). Text Genre Detection Using Common Word Frequencies. In Proceedings of the 18th conference on Computational Linguistics. Morristown, NJ, USA. S VARTVIK , J. (editor) (1990). The London-Lund Corpus of Spoken English: Description and Research. Lund University Press: Lund, Sweden. TAN , C., WANG , Y., AND L EE , C. (2002). The Use of Bigrams to Enhance Text Categorization. Information Processing Management, 38(4):529–546. TAYLOR , J. (2003). Linguistic Categorization. Oxford University Press. T EICH , E. (1999). Systemic Functional Grammar in Natural Language Generation: Linguistic Description and Computational Representation. Cassell, London. T HUCYDIDES (c 411 BC). History of the Peloponnesian War. 1970, Penguin, London. T OMAN , M., T ESAR , R., AND J EZEK , K. (2006). Influence of Word Normalization on Text Classification. In Proceedings of the International Conference on Multidisciplinary Information Sciences and Technologies. Madrid. T RIBBLE , C. (1998). Writing Difficult Texts. Ph.D. thesis, University of Lancaster. U NIVERSITY OF A UCKLAND P RESS (1998). Dictionary of New Zealand Biography, volume 1-5. Auckland University Press/New Zealand Department of Internal Affairs. V OORHEES , E. M. (2001). Overview of the TREC 2001 Question Answering Track. In Proceedings of the 2001 Text Retrieval Conference (TREC-2001). WALES , K. (editor) (1989). Dictionary of Stylistics. Longman, London. WATKINS , R. (2003). Vertical Cup-to-Disc Ratio: Agreement between Direct Ophthalmoscopic Estimation, Fundus Biomicroscopic Estimation, and Scanning Laser Ophthalmoscopic Measurement. Optometry and Vision Science, 80:454–459. W HITELAW, C. AND A RGAMON , S. (2004). Systemic Functional Features in Stylistic Text Classification. In AAAI Fall Symposium on Style and Meaning in Language, Art, Music and Design. W ILKS , Y. AND C ATIZONE , R. (1999). Can We Make Information Extraction More Adaptive? In Proceedings of the SCIE99 Workshop, Proceedings of the SCIE99 Workshop. Rome. W ITTEN , I. AND F RANK , E. (2005). Data Mining: Practical Machine Learning Tools and Techniques. Morgan-Kaufmann, San Francisco, second edition edition. 286 B IBLIOGRAPHY W ITTEN , I., M OFFAT, A., AND B ELL , T. (1999). Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco. W ITTGENSTEIN , L. (1953). Philosophical Investigations. Translated by G.E.M. Anscombe. Blackwell, Oxford. W OOD , A. (1691). Athenae Oxonienses. Bennett, London. W YNNE , M. (2004). Writing a Corpus Cookbook. Technical report, Oxford Text Archive. URL: http://ahds.ac.uk/litlangling/linguistics/IRCS.htm Accessed on 01-02-07. X IAO , R. AND M C E NERY, A. (2005). Two Approaches to Genre Analysis: Three Genres in Modern American English. Journal of English Linguistics, 33:62–82. YANG , Y. AND P EDERSEN , J. O. (1997). A Comparative Study on Feature Selection in Text Categorization. In Proceedings of ICML-97, 14th International Conference on Machine Learning. Nashville, US. YANGARBER , R. AND G RISHMAN , R. (2000). Machine Learning of Extraction Patterns from Unannotated Corpora: Position Statement. In Proceedings of the 14th European Conference on Artificial Intelligence: ECAI-2000 Workshop on Machine Learning for Information Extraction. Y ULE , G. (1944). The Statistical Study of Literary Vocabulary. Cambridge University Press. Z HOU , L., T ICREA , M., AND H OVY, E. (2004). Multi-Document Biographical Summarization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-2004). Barcelona, Spain. 287
© Copyright 2026 Paperzz