Sentence complexity and clause linking in African

2/22
Conference of the International Association for World Englishes (IAWE)
Aims
World Englishes and World's Languages
December 3-5
3-5, 2008
City University of Hong Kong
Aims

Sentence complexity and
clause linking in
African academic writing:
a critical empirical
p
comparison
p


compare discourse features of African Englishes
discuss key concepts of complexity and cohesion
present a pilot study of new methodologies of analysis
 a top
top--down/text
down/text--based automatic texttext-analysis of statistical
complexity variables using ComplexAna
 a bottom
bottom--up/item
up/item--based human analysis of different adverb
adverb-types

Josef Schmied
Chair English Language & Linguistics
Chemnitz University of Technology
www
www.tu
www.tutu-chemnitz.de/phil/english/schmied
tuchemnitz de/phil/english/schmied
[email protected]
[email protected]

3/22
Databases/Corpora Concepts & Methods
Comparison:ICE Comparison:Nordic Outlook
IAWE08
Hong Kong SAR, China
Aims
Databases/Corpora Concepts & Methods
use an old (ICE(ICE-EA) and a new corpus
(NORDIC Journal) and compare results
test standard hypotheses on differences between
Kenyan and Tanzanian English and discuss East African
English in a larger African perspective
4/22
Comparison:ICE Comparison:Nordic Outlook
IAWE08
IAWE08
1. Databases/Corpora
p


International Corpus of English – East Africa
(1990--96):
(1990
96)
a stratified corpus of English as a Second
Language: 500 texttext-types of 2000 words each
Nordic Journal of African Studies:
English as an Academic lingua franca
mainly for African scholars
5/22
Aims
Databases/Corpora Concepts & Methods
6/22
Comparison:ICE Comparison:Nordic Outlook
IAWE08
IAWE08
Appendix 6: List of written texts from Tanzania (word count)
PRINTED
Informational: Learned
Humanities
Social Science
Natural Science
Technology/Agriculture/Environmental dev.
W2A001T – W2A010T
W2A011T – W2A020T
W2A021T–
W2A021T
– W2A027T
W2A031T
W2A031T–
– W2A040T
Informational: Popular
Humanities
Social Science
Natural Science
Technology/Agriculture/Small Industry
General
20.172
20.151
20 114
20.114
20.148
80.585
W2B001T – W2B010T
W2B011T – W2B020T
W2B021T – W2B24T
W2B031T – W2B040T
W2BGEN1T - W2BGEN8T
Informational: Reportage
Splash
Reportage/Features
20 133
20.133
20.223
6.542
20.065
13 789
13.789
80.752
W2C001T - W2C0010T
W2C011T - W2C020T
20.018
20 139
20.139
40.157
total
total
total
Instructional
administrative/regulatory
Persuasive
Institutional
Personal Column
total
W2D001T - W2D010T
20.120
W2E001T – W2E010T
W2E011T – W2E020T
20.078
20.125
40.203
7/22
8/22
Aims
IAWE08
Databases/Corpora Concepts & Methods
Comparison:ICE Comparison:Nordic Outlook
IAWE08
2. Concepts & Methods
automatic corpus processing:
POS tagging using Penn Treebank + treetagger
ComplexAna:
p
: Complexity
p
y Analyser
y
ComplexAna
morphosyntactic (type/token) and
semantic ((unknown words in WordNet)
WordNet)
with flexible parameter weight
human corpus analysis:
domain--specific
adjuncts: linking, modal, evaluative, domain
searches
h with
ith AntConc
A tC
f 181 adverbs
for
d b
9/22
Aims
Databases/Corpora Concepts & Methods
10/22
10
/22
Comparison:ICE Comparison:Nordic Outlook
Aims
IAWE08
IAWE08
Databases/Corpora Concepts & Methods
Comparison:ICE Comparison:Nordic Outlook
2.1. Automatic analysis with ComplexAna
Table: Modal adjuncts in Huddleston/Pullum 2006:768
strong
i
assuredly
certainly
clearly
definitely
incontestably
indubitably
ineluctably
inescapably
manifestly
necessarily
obviously
patently
plainly
surely
truly
unarguably
unavoidably
undeniably
undoubtedly
unquestionably
apparently
y
doubtless
evidently
y
presumably
y
seemingly
gy
iii
arguably
likely
probably
iv
conceivably
maybe
perhaps
ii
•
•
•
•
•
possibly
weak
• calculates single score of
semantic complexity
Evaluative adjuncts in Huddleston/Pullum 2006:771
11/22
11
/22
Aims
IAWE08
Databases/Corpora Concepts & Methods
tag
accordingly
v3
accurately
y
v8
actually
v4
additionally
link
admittedly
v4
alternatively
link
analytically
s
apparently
m9
artificially
s
astonishingly
m9
asymmetrically
s
asymptotically
s
automatically
s
autonomously
s
basically
y
v2
briefly
v2
carefully
v9
BW1A
BW2L
CM1L
12/22
12
/22
Comparison:ICE Comparison:Nordic Outlook
IAWE08
2.2. Human analysis with AntConc
Adverb
CM2H
ET1C
ET2H
POS-tags texts
counts types/tokes
identifies nominal items
processes stoplist(s)
searches WordNet for
nominals
GH1L
KE1L
KE2E
13/22
13
/22
Aims
IAWE08
Databases/Corpora Concepts & Methods
Aims
IAWE08
3. Comparison: Ke and Tz in ICEICE-EA

Databases/Corpora Concepts & Methods
Complexity ICEICE-KE > ICE
ICE--TZ
Clause linking by adjuncts ICE
ICE--KE > ICEICE-TZ
informational learned
Σ
ldhumK ldhumT ldnatsK ldnatsT ldsocsK ldsocsT ldtechsKldtechsT KE
22305 22378 22210 22264
22154 22678 22098 22386 88767
19831 19785 19484 19496
19694 19727 19788 19751 78797
89
119
85
109
103
98
87
111
364
22
23
19
24
24
23
22
23 87.215
5672
6039
6513 6252
6062
6251
6363
6569 24610
Number of nouns considered (not in stoplist)
Nouns known to WordNet ((%))
Nouns unknown to WordNet (%)
Nouns not in frequency list (%)
Maximum length of a considered noun
Mean length of a considered noun
Number of commas
Max. number of commas/sentence
Max. degree of noun specification
Degree of Semantic Specialization of text
Degree of Semantic Difficulty
including a few surprises
Comparison:ICE Comparison:Nordic Outlook
3.1. Complexity of informational learned texts
in ICE
ICE--EA
COMPLEXITY variables
Number of Tokens
Number of Words
Maximum number of words in a sentence
Mean number of words in a sentence
Number of nouns in text
Hypotheses:
yp

14/22
14
/22
Comparison:ICE Comparison:Nordic Outlook
1941
82.12
17.88
59.2
37
7.84
803
8
16
8.27
21.95
1739
85.68
14.32
58.77
38
7.82
781
7
16
8.17
21.47
2137
83.11
16.89
65.28
26
7.62
841
11
18
8.35
23.56
1899
83.78
16.22
61.98
32
7.77
815
7
16
8.33
22.31
1797
89.43
10.57
55.48
28
7.86
878
12
16
8.36
20.42
1706
88.45
11.55
54.34
37
8.01
1054
16
15
8.31
20.07
1746
88.2
11.8
55.73
58
7.78
793
10
16
8.18
20.82
1793
90.57
9.43
53.88
39
7.85
874
12
16
8.31
20.07
7621
342.86
57.14
235.69
149
31.1
3315
41
66
33.162
86.746
Ø
TZ
KE
TZ
89706 22191.75
22426.5
78759 19699.25 19689.75
437
91
109.25
93.001 21.803787 23.250357
25111
6152.5
6277.75
7137
1905.25
1784.25
348.48
85.715
87.12
51.52
14.285
12.88
228.97
58.9225
57.2425
146
37.25
36.5
31.45 7.7750675 7.8624543
3524
828 75
828.75
881
42
10.25
10.5
63
16.5
15.75
33.122 8.2906088 8.2805363
83.923 21.686461 20.980642
hypo supported:
KE more complex
15/22
15
/22
Aims
Databases/Corpora Concepts & Methods
16/22
16
/22
Comparison:ICE Comparison:Nordic Outlook
IAWE08
IAWE08
Complexity of informational popular texts in ICEICE-EA
3.2.Clause linkers
spoken
MonoScri
ICE--EA spoken
ICE
k
Di l
Dialog
DiaPub
KE
KE
TZ
109
73
22
1
131
74
111
57
242
131
Σm1
1
0
1
0
2
0
2
0
4
0
Σm3
6
2
0
0
6
2
2
6
8
8
Σm6
13
3
0
1
13
4
8
9
21
13
link
COMPLEXITY variables
Number of Tokens
Number of Words
Maximum number of words in a sentence
Mean number of words in a sentence
Number of nouns in text
Number of nouns considered (not in stoplist)
Nouns known to WordNet (%)
Nouns unknown to WordNet (%)
Nouns not in frequency list (%)
Maximum length of a considered noun
Mean length of a considered noun
Number of commas
Max. number of commas/sentence
Max. degree of noun specification
Degree of Semantic Specialization of text
Degree of Semantic Difficulty
informational-popular
Σ
ppgenT pphumK pphumT ppnatsK ppnatsT ppsocsK ppsocsT pptechK pptechT KE
15156
22382 22965 22217
7246 22249 22163
21958 22037 88806
13487
19713 19784 19608
6390 19640 19487
19665 19429 78626
82
84
96
85
66
80
110
84
72
333
23
22
21
20
23
21
24
22
23 84.673
4172
5927
6176
5714
2129
5885
6211
6192
6945 23718
1619
2510
2729
1986
982
2184
1807
1929
1943
8609
85.48
78.92
75.45 87.71
84.42
86.26
82.01
87.92 79.26 340.81
14.52
21.08
24.55 12.29
15.58
13.74
17.99
12.08 20.74 59.19
55.4
63.98
65.52 61.08
60.49
58.7
59.21
58.74 61.81 242.5
53
42
43
33
24
31
33
33
41
139
7.49
7.16
7.13
7.33
7.40
7.54
7.34
7.43
7.51 29.462
523
986
1183
829
299
959
806
782
877
3556
9
8
13
11
8
9
10
11
12
39
16
15
15
17
17
17
17
17
18
66
8.40
8.44
8.56
8.36
8.37
8.44
8.39
8.37
8.42 33.606
21.06
23.05
23.76
22.05
22.17
21.73
22.26
21.56
23.43 88.389
Ø
TZ
KE
TZ
89567
22202
17913
78577
19657
15715
426
83.25
85.2
114.28
21.17
22.86
25633 5929.5 5126.6
9080 2152.25
1816
406.62
85.20
81.32
93.38 14.7975 18.676
302.43
60.63
60.49
194
34.75
38.8
36.87
7.37
7.37
3688
889
737.6
52
9.75
10.4
83
16.5
16.6
42.14
8.40
8.43
112.69
22.10
v4: 215 actually
22.54
v7: 268 only
hypo doubted:
h
d bt d TZ more complex!
l !
 less professional journalists?
hypo
h
po supported:
s ppo ted
more clause linkers in KE
DiaPriv
TZ
KE
TZ
KE
TZ
KE
TZ
Σm7
1
Σm9
49
18
16
0
65
18
31
41
96
59
modal
70
23
17
1
87
24
43
56
130
80
spec
2
21
14
27
25
0
0
0
1
0
0
0
1
0
17
12
4
11
48
Σv1
81
32
11
0
92
32
79
45
171
77
Σv2
52
13
8
3
60
16
13
11
73
27
Σv3
45
24
13
2
58
26
29
24
87
50
Σv4
216
23
50
5
266
28
20
56
286
84
Σv55
73
38
13
3
86
41
62
38
148
79
Σv6
9
4
0
0
9
4
17
6
26
10
Σv7
335
120
78
14
413
134
142
183
555
317
Σv8
29
20
4
3
33
23
34
42
67
65
Σv9
66
32
6
7
72
39
29
22
101
eval
total
61
906
306
183
37
1089
343
425
427
1514
770
1102
414
226
41
1328
455
606
551
1934
1006
17/22
17
/22
Aims
IAWE08
Databases/Corpora Concepts & Methods
Aims
written
NonPrinted
KE
TZ more
linkers!!
KE
TZ
TZ
KE
InfReport
TZ
KE
Creative
Persuasiv
TZ
KE
TZ
KE
TZ
KE
KE
TZ
TZ
37
157
95
88
59
66
15
21
20
21
21
17
224
213
261
370
0
0
4
7
3
1
1
0
2
3
2
0
13
11
13
11
1
7
3
7
4
2
0
0
2
0
3
0
12
9
13
16
9
8
14
43
20
7
3
9
9
7
0
0
52
66
61
0
1
1
2
0
0
0
0
2
2
0
1
3
5
3
6
17
27
19
26
18
24
12
10
27
35
12
14
96
109
113
136
modal
27
43
41
85
45
34
16
19
42
47
17
15
176
200
203
243
specific
14
13
17
18
9
8
1
6
4
12
2
4
39
48
53
61
26
48
52
58
58
72
34
24
28
47
42
36
220
237
246
285
6
8
17
12
7
10
2
3
2
4
6
2
35
31
41
39
5
46
28
31
18
26
7
16
17
9
4
0
84
82
89
128
28
26
19
10
15
15
3
3
12
10
4
6
54
44
82
70
32
72
65
56
68
56
21
15
22
28
12
19
201
174
233
246
3
11
5
4
2
4
0
2
0
2
8
3
16
15
19
26
136
204
89
115
120
123
80
70
53
90
79
89
428
487
564
691
link
Σm1
Σm3
Σm6
Σm7
Σm9
Σv11
Σv2
Σv3
Σv4
Σv5
Σv6
Σv7
Σv8
Σv9
evaluative
total
27
29
32
38
42
29
Comparison:ICE Comparison:Nordic Outlook
4. Comparison: Ke and Tz in NordicJ
Printed
InfPop
Databases/Corpora Concepts & Methods
IAWE08
Clause
Cl
li
linkers
k
ICE-EA written
ICEitt
InfLearned
18/22
18
/22
Comparison:ICE Comparison:Nordic Outlook
15
17
18
22
13
14
130
120
157
74
149
18
24
23
29
24
19
3
3
28
21
9
15
94
87
112
111
281
468
330
353
354
354
165
153
180
233
177
184
1262
1277
1543
1745
359
681
483
544
467
462
197
199
246
313
217
220
1701
1738
2060
2419
Hypotheses:
yp


Complexity ICEICE-KE > ICE
ICE--TZ
Clause linking by adjuncts ICE
ICE--KE > ICEICE-TZ
including a few surprises
19/22
19
/22
Aims
IAWE08
Databases/Corpora Concepts & Methods
Aims
4.1. Complexity of KE/TZ articles in NordicJournal Corpus
ComplexAna Scores
ComplexAna Scores
KE01h
KE02h
TZ01h
TZ02h
CM all
CM all
UK01h
10279
6000
7882
4449
8355.3
5709
7552
Types
8650
5092
6690
3792
7063.6
4780
6372
92
96
86
69
118 75
118.75
128
101
Mean words per sentence
15.99
21.67
18.23
19.35
21.25
20.43
20.49
nouns
3099
1732
1822
1350
2524.4
1958
2262.43
nouns considered
nouns considered
1062
616
730
576
969 81
969.81
1958
987 76
987.76
nouns known to WordNet (%)
68.64
83.93
86.03
72.22
78.56
70.84
Databases/Corpora Concepts & Methods
Comparison:ICE Comparison:Nordic Outlook
IAWE08
4.2. Clause Linkers
Conjuncts in the NordicJournal Corpus
mean21
Tokens
Max words per sentence
Max. words per sentence
20/22
20
/22
Comparison:ICE Comparison:Nordic Outlook
ClauseLink
KE01h
KE02h TZ01h
TZ02h
CMall16
UK01h
mean22
19.1
Conj ncts
Conjuncts
but
24
8
66
30
15
6
although
12
6
4
2
4
1
3.5
78.36
while
48
4
4
6
5
7
13.8
14.2
nouns unknown to WordNet (%)
31.36
16.07
13.97
27.78
21.44
29.16
21.64
if
12
2
20
8
7
5
nouns not in frequency list (%)
nouns not in frequency list (%)
64 97
64.97
52 11
52.11
52 74
52.74
63 54
63.54
62 07
62.07
66 70
66.70
60 12
60.12
whether
8
4
4
2
1
6
42
4.2
63
77
18
24
27
21
33.17
because
4
4
50
10
7
7
11.8
Mean length noun
6.71
7.54
7.07
7.13
7.18
5.64
6.95
commas
375
219
437
173
397 06
397.06
208
360 28
360.28
7
9
8
19
15.688
9
13.99
18
30
10
14
14
15
15
14.875
14
14.99
126
58
158
Max. length noun
Max. commas in a sentence
Max. degree sem. specialization of a noun
S
Sem. specialization of the text
i li i
f h
8 28
8.28
8 45
8.45
8 24
8.24
8 09
8.09
8 26
8.26
8 34
8.34
8 24
8.24
Degree of Semantic Difficulty
24.13
20.24
19.77
23.31
22.44
23.86
22.16
in order to
since
sum conjuncts
2
58
4.0
5
4
7.6
43
36
67.8
21/22
21
/22
Aims
IAWE08
Databases/Corpora Concepts & Methods
Aims
KE01h
KE02h
TZ01h
TZ02h
Databases/Corpora Concepts & Methods
Comparison:ICE Comparison:Nordic Outlook
IAWE08
Adjuncts in the NordicJournal Corpus
ClauseLink
22/22
22
/22
Comparison:ICE Comparison:Nordic Outlook
CMall16
UK01h
mean22
5. Outlook
Aduncts
firstly
secondly
4
4
4
on the one hand
on the
h other
h h
hand
d
finally
lastly
also
furthermore
however
6
60
38
10
68
22
14
16
2
10
moreover
y
similarly
3
1
1
1
1
1
2.4
4.0
14
2
7
7
9
3
2
nevertheless
though
4
yet
anyway
otherwise
accordingly
2
2
1.5
1.3
1.0
15
1.5
4.0
2.0
1
4
2.0
4.7
4
2
2
2
2
2
3.0
20
2.0
5.8
2.6
2
3
6
4
41
1
1
21
2.4
12.9
12
9
6.6
66.7
84
57
134.5
consequently
therefore
h
f
thus
sum adjuncts
2
38
188
22
4
90
40
8
8
54
sum conjuncts+adjuncts
314
148
198
112
1
practical:
expand the data base to achieve significance:
NORDICJournal and ICEweb
compare varieties
i ti in
i ICE,
ICE incl.
i l diachronic
di h i and
d texttext
t t-type
t
variation
use similar methodology for Specialised and Popular
Academic English (SPACE) Corpus
27.0
2.0
11.8
2
8
1
2


theoretical:
sequence of ESL author preferences in New Englishes
Englishes,,

e
e.g
e.g.
g. cohesion from
f om fo
formal
mal e
explicit
plicit to semantic implicit
reader adaptation where possible and necessary