Stylistics for A

AntConc EMAIL Workshop
Using AntConc to explore a small corpus of emails
In this seminar we will be investigating a small corpus (approx 12000 words) of email messages
(referred to as EMAIL in this handout). We will mainly be doing this by comparing our email corpus
with a corpus of written British English called FLOB.
1)
Start up AntConc:
2)
We have to set AntConc to ignore tags – which are special codes inside the files. To do this, click
on Global Settings on the menu bar, then click on Tag Settings, and then select Hide Tags, as shown
below (all other options are fine).
Then click
3)
We also need to tell AntConc to include numbers (we’ll see why later). To do this, click on
Global settings, click on Token (Word) Definition, and select Number in the Number Token
Classes box. Then click on Apply.
4) Now we need to tell AntConc which file we’re working
with.
Select File and then Open File(s)…
 A standard file-open box will appear. Navigate to the file
that contains the Email corpus we will be using today,
which should be on your PC desktop.
 Click on EMAILtagged.txt - which will highlight it in blue. Then click on Open.
 You will see that EMAILtagged.txt is now listed in the main AntConc window
Page 1
AntConc EMAIL Workshop
Word Frequency Lists
5)
Create a word frequency list.
Click on the
tab.
Make sure you tick the box that looks like this:
(note: this is very important, if you don’t click it, AntConc will treat the and THE as if they were
two different words). Now click
TASK>>
Note:
.
Use the AntConc Wordlist screen to fill in Table 1 below. Write down the top 20 words
in the EMAIL corpus and their % frequencies in the table provided below – (we’ve
actually done some of them for you to save time).
The % figure is calculated as follows:
%Frequency = raw frequency ÷ total no. of word tokens x 100
The AntConc display tells you the total number of word tokens in the file. Look at the info bar above
the word list – you’ll see something like this:
Table 1:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Word
the
i
to
and
a
you
it
of
in
s
on
that
for
have
we
do
n
t
is
not
TOP 20 words in EMAIL and FLOB
EMAIL
Raw
FLOB
%
373
300
291
2.93
2.36
2.29
148
130
130
129
1.16
1.09
1.02
1.02
1.01
111
0.87
102
0.80
Word
the
of
and
to
a
in
that
is
was
it
for
he
s
as
with
on
i
be
his
by
64470
33951
27159
26940
22973
20737
10748
10378
10135
9626
9307
7961
7621
7470
7099
7089
7011
6497
5794
5392
%
6.28
3.31
2.64
2.62
2.24
2.02
1.05
1.01
0.99
0.94
0.91
0.78
0.74
0.73
0.69
0.69
0.68
0.63
0.56
0.53
QUESTION>> Why have we gone to the trouble to work out % frequencies?
TASK>>
The table also contains the top 20 words from the FLOB corpus of written English.
Compare the words and frequencies for the two corpora and make a note of any
differences/similarities.
Do your results allow you to say anything about the language of emails?
Page 2
AntConc EMAIL Workshop
Concordances
6)
You may have noticed that the EMAIL word list contains letters s, n, and t. In order to
understand why a particular item appears in a wordlist, it’s often useful to investigate using
concordances.
A concordance is a list of all of the occurrences of a certain word, in the context of the sentence(s) that
word occurs in. You can look at a concordance list simply by clicking on any word in the word
frequency list (your cursor should change to a ‘pointing-finger’ icon when you move it over a word).
Alternatively, you can click on the
tab to
take you to the concordance screen.
Then type the word you want in the box labelled
Search Term, and click
TASK>>
Even though you may be able to guess, use concordances to find the meaning of s, n and
t in the EMAIL corpus. Are there any issues regarding the s item in the wordlist?
7)
Further investigation of have.
The % frequency and ranking of have is higher in EMAIL than it is in FLOB. This difference in
frequencies could mean that ‘have’ is an interesting word to investigate in EMAIL, and we’re going to
do this using the sort and the cluster function.
From the wordlist click on have to get a concordance. We can sort the
concordance into alphabetical order of what comes before or after have. The
control for this is at the bottom of the window. You can use the up and down
buttons to change the basis of the concordance sort. 1R means one-word-to-theright of have, 1L means one-word-to-the-left of have, and so on. For now, we want the 1R sort so
select 1R and then press
.
TASK>>
This sorted format should help you to notice more easily how have is used in EMAIL.
Make a note of any patterns you observe. Are there any patterns in the word-class that follows have?
It is also sometimes useful to look at clusters.
Click on the
tab. Set the cluster max and min
size to 10 and 2 respectively (as shown to the right),
and set the min cluster frequency to 3.
Click on start.
TASK>>
make a note of any patterns of usage of have that you notice when looking at clusters
Page 3
AntConc EMAIL Workshop
Keywords
9) Now we’re going to compare the wordlists for EMAIL and FLOB using the Keyword List function.
This allows us to find out which words appear more or less often in the Email corpus than would be
expected by chance alone when compared against the reference corpus (FLOB). These words are
called “keywords”.
Click on
to access the comparison tool.
We have to tell AntConc what we want to compare EMAIL with. This is called setting the reference
corpus: the “reference corpus” is whatever our standard of comparison is. In this case our
standard of comparison is FLOB – a corpus of British written English. We set this up using Tool
Preferences. Click on Tool Preferences on the menu bar, then click on Keyword List (see the
diagram below).
Make sure this box is ticked
Click here to select your
reference corpus
The files you’ve selected
appear here
Click here to access the
Keyword List preferences
First, click the “treat all data as lowercase” box. This is very important!
Then click Choose Files – this takes you to a file-open box just like the one you use to load EMAIL,
only this time we are loading a reference corpus. Select flob.txt and press Open, and the file’s name
will appear as shown. Then click Apply.
Now, press
and AntConc will generate a list of words which are statistically more common in
the EMAIL corpus than in FLOB. (This might take a few seconds).
The keyword lists should look something like this:
Page 4
AntConc EMAIL Workshop
The keywords are sorted by their keyness, which is a statistical measure of the likelihood of the overrepresented words in EMAIL being down to chance alone. The higher the keyness score, the more
confident we can be that the keyword is a characteristic of the data rather than a chance occurrence.
A text might have hundreds of keywords, but often the twenty or thirty words with the highest
keyness are the most useful.
10) The next stage in Keyword analysis is to try to understand why the words in the list are key.
Remember, keywords are just words that appear (statistically) more in one text or corpus than they
do in another (comparison) text or corpus. The statistical test that calculates the keyness merely
indicates that the difference is real and not just a fluke. It is up to you as analysts to work out whether
that statistical keyness equates to linguistic salience. i.e. Are the keywords above telling you anything
interesting about the EMAIL data they come from? We, therefore, need to look at how the words are
used in more context. Keywords, then, could be seen as a list of possibilities for further, more focused,
analysis of a text or corpus.
11)
Further investigation of soon.
Task>>
12)
Use the sort and cluster functions (described in the have example above) to find the
patterns of usage of soon in EMAIL (tip – try sorting one word to the left of soon)
Further investigation of 2
Return to the Keyword list by clicking on
Next, click on 2 to get a concordance (you might have tp scroll down a bit to find it).
QUESTION>>
how is 2 used in EMAIL? (you’ll probably have guessed the answer to this)
Now click on
to see the dispersion of 2 across EMAIL
QUESTION>>
what does this tell us about the usage of 2 in EMAIL?
13)
TASK>>
Using concordance lines and any of the other tools in AntConc, investigate any of the
other keywords from your list.
QUESTION>>
14)





Does your investigation reveal anything about the language use in EMAIL that
you didn’t find when you compared the wordlists?
General discussion points:
What are the problems with drawing conclusions about emails in general from our findings
today?
Is this a representative corpus of emails?
Is FLOB a good choice of comparison corpus?
What could we do to make a future research project into emails better?
Did you find through the course of this seminar any other problems with this kind of
investigation?
Page 5
AntConc EMAIL Workshop
– Notes –
Page 6