Automated Vs. Human Analysis of Two English Plays

A Comparative Study:
Automated Vs. Human Analysis of Two English Plays
An MA Thesis
Submitted to the Institute of Language Studies and Translation
Faculty of Arts
Alexandria University
By:
Mervat Mahmoud Ali Ahmed
Under the Supervision of
Prof. Zeinab M. Raafat
Professor of English Literature
Faculty of Arts
Alexandria University
Associate Prof. Sameh A. Ansary
Associate Prof. of Computational Linguistics
Department of Phonetics and Linguistics
Faculty of Arts
Alexandria University
Acknowledgements
Before all, I thank Allah for guiding me throughout my life, showing me the right path and
giving me the strength to pursue my studies in a field so close to my heart.
I wish to express my deepest gratitude and sincere appreciation to Prof. Zeinab M. Raafat. She
has been and will always be my mentor. It has been an honour to work with her and learn from
her academic expertise and personal dedication.
I would also like to express my utmost thanks and appreciation to Prof. Sameh A. Ansary for his
precious help, support and patience in supervision. His continuous guidance has been invaluable.
I can never seize to learn from his expertise in the field of computational linguistics.
I am deeply indebted to Prof. Hassan A. Taman who has opened my eyes and mind to the
boundless realm of Applied Linguistics. His sincere love and dedication to his work was the
main reason in pushing this research forward. I only wish he were present with us to make him
proud of this work. May God bless his soul.
I would like to thank my entire amazing family for their continuous love and support. I would
like to thank Mohamed, my husband and soul mate, who has always been supportive and
encouraging, putting up with my long working hours and mood swings. I would not have done it
without him. I would also like to thank my mother, my brother Mohamed, my lovely sisters; Mai
& Maha and my aunt for being there for me at all times.
My thanks also go to all my friends and colleagues at work for their constant encouragement.
i
To my mother who has been and will always remain to be my anchor, my sail and
my guiding star. Her belief in me and her constant prayers were the main reasons
in pushing me as well as this work forward “to see the light”.
ii
Table of Contents
______________________________________________________________________________
List of Abbreviations …………………………………………………………………………...vii
List of Tables …………………………………………………………………………………...viii
List of Figures …………………………………………………………………………………..ix
Abstract …………………………………………………………………………………………x
Introduction ………………………………………………………………………………......
1
Chapter 1 Theoretical Background: Computers and Text Analysis ……………………
4
1.1 Introduction ……………………………………………………………………………..
4
1.2 Argument for and against Computer-Aided Text Analysis …………………………….
5
1.3 Natural Language Processing and Computational Linguistics …………………………
9
1.3.1 Computational Syntax and Semantics ……....………………………………..
10
Computational Discourse …………………………………………………….
11
1.4 Concordances …………………………………………………………………………..
12
Application Areas for Computational Linguistics ……………………………………..
14
1.3.2
1.5
1.5.1
Lexicography ………………………………………………………… 14
1.6
Overview of Computer-Aided Discourse Analysis ……………………………………
16
1.7
Overview of Computer-Assisted Stylistic Analysis …………………………………...
19
1.8
Comparing Human Analysis to Computer-Aided Analysis ……………………………
28
1.9
Corpus Linguistics ……………………………………………………………………..
28
1.10 CATA Software Selection …………………………………………………………….
30
1.11 T-LAB Selection ………………………………………………………………………
32
Introduction to T-LAB ……………………………………………..
32
1.11.2 T-LAB Pre-processing Steps ………………………………………
35
1.11.1
1.11.2.1 Corpus Normalization and Disambiguation Operation …
35
1.11.2.2 Linguistic Dictionaries and Lemmatization …………….
36
1.11.2.3 Corpus Segmentation …………………………………….
37
iii
1.11.2.4 Multi-Word and Stop-Word detection …………….…. 38
1.11.2.5 Vocabulary building and Key-Terms selection …....... 39
Chapter 2 Data Analysis I: “She Stoops to Conquer” ………………………….…… 40
2.1 Introduction ……………………………………………………………….……….. 40
2.2 Part One: Human Analysis ……………………………………………….……….. 40
2.2.1 Register ………………………………………………………….………. 40
2.2.1.1 Marlow with Kate ………………………………………….….. 41
2.2.1.2 Kate with Marlow ……………………………………………... 42
2.2.1.3 Marlow with Mr. Hardcastle ……………………………….…... 44
2.2.1.4 Mr. Hardcastle with Marlow …………………………………... 47
2.2.2 Signs of Formality and Informality …………………………………….… 49
2.2.3 Dialect ……………………………………………………………………. 53
2.2.4 Repetition ………………………………………………………………… 56
2.2.5 Slang ……………………………………………………………………… 58
2.2.6 Naming ………...…………………………………………………………. 60
2.2.7 Figurative Language ……………………………………………………… 62
2.2.8 Archaic Language ……………………………………………………........ 66
2.3 Part Two: Computer-Aided Analysis ………………………………………………. 67
2.3.1
Introduction ………………………………………………………………. 67
2.3.2
“She Stoops to Conquer” CATA Results that agree with Human Analysis ..67
2.3.2.1 Register ……………………………………………………………68
2.3.2.2 Repetition …………………………………………………………74
2.3.2.3 Dialect and Slang …………………………………………………77
2.3.3
“She Stoops to Conquer” CATA Additional Semantic Contribution ……...80
2.3.3.1 Importance of Word Frequencies …………………………………80
2.3.3.1.1 Fortune and Marriage theme ……………………………80
2.3.3.2 Importance of Sequence Analysis Tool …………………………..85
2.3.3.2.1 The structure “But_a” …………………………………..85
2.3.3.3 Importance of thematic and cluster analysis ……………………..88
iv
2.3.3.3.1 Parents-Children Relationship theme ……………..88
Chapter 3 Data Analysis II: “The Caretaker” ……………………………………95
3.1 Introduction ………………………………………………………………………95
3.2 Part One: Human Analysis ……………………………………………………….95
3.2.1 Repetition ……………………………………………………………….95
3.2.2 Rhythm …………………………………………………………………99
3.2.3 Register ………………………………………………………………..101
3.2.4 Status Marked through Language ……………………………………..103
3.2.5 Pause and Silence ……………………………………………………...104
3.2.6 Turn-taking Technique ………………………………………………...108
3.2.7 Long Dialogues and Small Talk ……………………………………….109
3.2.8 Question Forms ………………………………………………………..111
3.2.9 Naming and Pronouns ……………..…………………………………..112
3.2.10 Stage Directions and Body Language …………………………………113
3.3 Part Two: Computer-Aided Analysis ……………………………………………116
3.3.1 Introduction ……………………………………………………………116
3.3.2
“The Caretaker” CATA Results that agree with Human Analysis …..116
3.3.2.1 Repetition ……………………………………………………116
3.3.2.2
Register ………………………………………………………122
3.3.2.3
Turn-taking and Interrogatives ………………………………128
3.3.2.4
Pause …………………………………………………………130
3.3.2.5
Silence ……………………………………………………….131
3.3.2.6 Signs of Formality and Informality ………………………….131
3.3.3 “The Caretaker” CATA Additional Semantic Contribution ………….136
3.3.3.1 Importance of Comparison between Word Pairs ……………136
3.3.3.2 Importance of List of Word Frequencies ……………………138
3.3.3.3 Importance of thematic and cluster analysis ………..……….143
v
Conclusion ………………………………………………………………………….145
Appendix [1] ……………………………………………………………………......147
Appendix [2] ………………………………………………………………………..153
References ………………………………………………………………………….160
vi
List of Abbreviations
________________________________________________________________________
AI
CATA
CALL
CDA
CL
CU
HMM
HTML
IR
KWIC
LU
MT
NLP
OED
SL
SLT
TRP
UNLP
Artificial Intelligence
Computer-Assisted Text Analysis
Computer Assisted Language Learning
Critical Discourse Analysis
Computational Linguistics
Context Units
Hidden Markov Models
Hyper Text Markup Language
Information Retrieval
Key Word in Context
Lexical Units
Machine Translation
Natural Language Processing
Oxford English Dictionary
Source Language
Spoken-Language Translation
Transition Relevance Place
Universal Networking Language Project
vii
List of Tables
____________________________________________________________________________
Table (1):
Table (2):
Table (3):
Table (4):
Table (5):
Table (6):
Table (7):
Table (8):
Table (9):
Table (10):
Table (11):
Table (12):
Table (13):
Table (14):
Table (15):
Table (16):
Table (17):
Table (18):
Table (19):
Table (20):
Table (21):
Table (22):
Table (23):
Table (24):
Table (25):
Table (26):
Table (27):
Table (28):
Table (29):
Table (30):
Table (31):
Table (32):
Table (33):
Word Net Semantic Relations ……………………………………………………...11
Country dialect in “She Stoops to Conquer” – Part 1 ……………………………...54
Country dialect in “She Stoops to Conquer” – Part 2 ……………………………...55
T-LAB word association tool: “Marlow” and “Hardcastle” in “She Stoops to
Conquer” ……………………………………………………………………….......69
T-LAB Concordance of the lemma “Fellow” in “She Stoops to Conquer” ……….70
T-LAB word association tool: “Marlow” and “Madam” in “She Stoops to
Conquer” …………………………………………………………………………..71
T-LAB word association tool: “Marlow” and “Child” in “She Stoops to
Conquer” ……………………………………………………………………….......73
T-LAB word association tool: “Tony” and “Ecod” in “She Stoops to Conquer” …76
T-LAB Concordance of the lemma “Servant” in “She Stoops to Conquer” ………78
T-LAB Concordance of the lemma “Diggory” in “She Stoops to Conquer” ……...79
T-LAB Modeling of emerging themes tool: “Fortune” theme Part 1………………81
T-LAB Modeling of emerging themes tool: “Fortune” theme Part 2 ……………...83
T-LAB Modeling of emerging themes tool: “Fortune” theme Part 3 ……………...84
Syntactic analysis of “but a” structure in “She Stoops to Conquer” ……………….86
T-LAB Concordance of the structure “But a” in “She Stoops to Conquer” ……….87
T-LAB Modeling of emerging themes tool: “Age” theme in “She Stoops to
Conquer” …………………………………………………………………………...89
T-LAB thematic cluster analysis in “She Stoops to Conquer” …………………….91
Davies dialect in “The Caretaker” ………………………………………………..101
Mick’s slang in “The Caretaker” …………………………………………………102
T-LAB Concordance of the lemma “Black” in “The Caretaker” ………………..117
T-LAB Modeling of emerging themes tool: “Jenkins” theme in
“The Caretaker” ………………………………………………………………….119
T-LAB Modeling of emerging themes tool: “Sidcup” theme in
“The Caretaker” ………………………………………………………………….121
T-LAB Concordance of the lemma “Davies” in “The Caretaker” ………………123
T-LAB Modeling of emerging themes tool: “Mick” theme in “The Caretaker” ..126
T-LAB word association tool: “Davies” and “Mick” in “The Caretaker” ………129
T-LAB word association tool: “Mate” and “Davies” in “The Caretaker”……….133
T-LAB Concordance of the lemma “Boy” in “The Caretaker” ………………….134
T-LAB Modeling of emerging themes tool: “Call” theme in “The Caretaker” …135
T-LAB word association tool: “Brother” and “Mick” in “The Caretaker” ………137
T-LAB word association tool: “Davies”, “Brother” and “Mick” ………………..138
T-LAB Modeling of emerging themes tool: “Bed” theme in “The Caretaker” …140
T-LAB Modeling of emerging themes tool: “Good” theme in “The Caretaker” ..141
T-LAB thematic cluster analysis in “The caretaker” …………………………….143
viii
List of Figures
____________________________________________________________________________
T-LAB automatic lemmatization …………………………………………………37
T-LAB word association tool: “Marlow” in “She Stoops to Conquer” …………68
T-LAB sequence analysis tool with “Marlow” in “She Stoops to Conquer” ...... 71
T-LAB sequence analysis tool with “Ecod” in “She Stoops to Conquer” ……...74
T-LAB tool for comparison between word pairs: “Tony” and “Fellow”
in “She Stoops to Conquer” ……………………………………………………..75
Figure (6): T-LAB pie chart: Percentage of “Ecod” with “Tony” and “Fellows” in
“She Stoops to Conquer” …………………………………………………………75
Figure (7): T-LAB key contexts for thematic words tool in “She Stoops to Conquer” …….78
Figure (8): T-LAB word association tool: “Fortune” in “She Stoops to Conquer” …………80
Figure (9): T-LAB bar chart: Percentage of “happiness” with “love” and “fortune” in
“She Stoops to Conquer” …………………………………………………………85
Figure (10): T-LAB sequence analysis tool with “But_a” in “She Stoops to Conquer” ………86
Figure (11): T-LAB word association tool: “Black” in “The Caretaker” …………………….116
Figure (12): T-LAB sequence analysis tool with “Name” in “The Caretaker” ………………118
Figure (13): T-LAB sequence analysis tool with “Pause” in “The Caretaker” ………………130
Figure (14): T-LAB sequence analysis tool with “Silence” in “The Caretaker” …………….131
Figure (15): T-LAB word association tool: “Mate” in “The Caretaker” ……………………..132
Figure (16): T-LAB tool for comparison between word pairs: “Brother” and “Mick” in “The
Caretaker” ……………………………………………………………………….136
Figure (17): T-LAB bar chart: Percentage of “Worry” with “Brother” and “Mick” in “The
Caretaker” ………………………………………………………………………137
Figure (18): T-LAB sequence analysis tool with “Bed” in “The Caretaker” ……………….. 139
Figure (1):
Figure (2):
Figure (3):
Figure (4):
Figure (5):
ix
Abstract
_____________________________________________________________________________
For centuries, text analysis has been depending on human analysts. We have been depending
entirely on human intuition in analyzing written and spoken discourse. However, suddenly this
started to change 50 years ago. With the advent of technology and the invasion of computers to
many fields, linguists started to ask a new question: Why don’t we use computers in text
analysis?
So, since the 1960s this new trend of using computers in text analysis started taking place.
However, as it is the case with all new ideas and methods, many linguists started attacking it.
Many linguists had doubts: Can computer software really assist in text analysis? If yes, to what
extent? Even if it helps with normal texts, can it help in a stylistic analysis of the literary genre?
So, in this thesis, I am trying to answer a vital question. Can computer software assist in the
stylistic analysis of literary texts today? Now in order to answer such question, a text analysis
software, called T-LAB, is used to assist in a stylistic analysis of two literary texts and then
compare the output to a pure human stylistic analysis of the same texts carried out beforehand.
The two literary texts used for analysis are two English plays. The first is the 18th century “She
Stoops to Conquer” by Oliver Goldsmith, while the second is the 20th century “The Caretaker”
by Harold Pinter.
x