word lists

ENG 626
CORPUS APPROACHES TO LANGUAGE STUDIES
exploring frequencies in texts
Bambang Kaswanti Purwo
[email protected]
Adolph, Svenja (2006) Ch. 3
techniques and practices in data analysis
▪ quantitative exploration of texts and text collections
 different types of wordlists
how the wordlists can be used for contrastive studies of
different texts
role of frequency information
in relation to
characterization of the whole texts or collections of texts
▪ generating hypotheses
frequency lists to inform the
generation of hypotheses
and research questions
▪ testing hypotheses
electronic text analysis to
test existing hypotheses
in any area that deals with
the use of language
▪ facilitating manual processes not necessarily motivated
from “manual” to “automated”
by a particular research
e.g. extraction of frequency info
question
some of the software resources to facilitate the research process
▪ software packages
to facilitate the manipulation and analysis of electronic texts
▫ the generation of frequency counts
▫ comparisons of frequency information in different texts
▫ different formats of concordance outputs
[including Key Word In Context (KWIC)]
» [free of charge via internet]
◊ The Compleat Lexical Tutor (Tom Cobb)
◊ View Variation in English Words and Phrases (Mark Davis)
» [commercial]
◊ Wordsmith Tools (Mike Scott)
basic information about the text
most software packages
▪ allow textual data to be sorted into concordance outputs
▪ produce some basic information about the text or collection
of texts
▫ average sentence length
▫ word length
▫ number of paragraphs
▫ number of individual running words (tokens)
▫ number of different words (types)
▫ number of lexical items and number of grammatical
items (in tagged corpora)
» type-token ratio
some of the info can be expressed in terms of ratios:
ratio between grammatical and lexical items in the text
(lexical density)
the type-token ratio
▪ to gain some basic understanding of the lexical variation
within the text
tokens: the number of running words in a text
types: the number of different words
This chapter moves from the discussion of design and
development of electronic text resources to techniques
and practices in data analysis.
How many tokens? 21
How many types? 19
The type-token ratio: divide number of tokens by number of types
21/19 = 1.11
What is it for?
 to asses the level of complexity of a particular text or text
collections (e.g. comparisons between documents for different
types of audiences)
the higher the type-token ratio the less varied the text
watch out:
the overall size of the text(s) on which the ratio is based
 compare type-token ratios of text(s) of similar length
 textual complexity
▪ sentence and word length
▪ linguistic analysis of grammatical structure
▪
semantic fields of the individual items
» word lists
● single words
frequency of a word or phrase in different text types
is important for the description of the context of use
(e.g. for English language teaching)
▪ various word lists exist in the ELT context
e.g. Academic Word List (Coxhead 200)
▪ spoken vs. written discourse
▪ American vs. British English
word list
▪
▪
▪
▪
▪
frequency order
alphabetical order
lemmatized format
grammatical tags
other analytical tags
word list to account for
▪ individual items
▪ recurrent sequences of
two or more items
lemmatized frequency lists
group together words from the same lemma
(all grammatical inflections of a word:
e.g. say, said, saying, says)
▪ often variations of meaning between different variants of
the lemma (Stubbs 1996, Tognini-Bonelli 2001)
▪ [ELT] beneficial to teach all forms of one lemma together
and give priority to the most frequently used form
Table 3.1: one basic information from a frequency list
ten most frequent items in the ▪ spoken CANCODE corpus
▪ written component (BNC)
some of the key differences between the two discourse
modes are highlighted:
▪ both contain mainly grammatical items
▪ the spoken corpus includes the personal pronouns I and you
(interactive nature of the spoken discourse)
▪ Yeah – listener response tokens in conversation
● recurrent continuous sequences
other terms: “lexical bundles” (Biber et al. 1999)
“clusters” (Scott 1996)
corpus research:
a large proportion for particular items to co-occur
in a non-random fashion of language is phrasal in nature
(observable tendency )
collocation: attraction between two words (Ch. 4)
[overall length to be determined at the outset; e.g. Wordsmith Tools]
Table 3.2 ten most frequent two-word, three-word, and fourword recurrent sequences in the CANCODE corpus
most of the sequences are concerned with
▪ the management of discourse
▪ the deictics: you and I
▪ attempt to establish mutual understanding:
know what I mean, I know, I think, do you think, etc.
● comparing frequencies in text collections of different sizes
How to compare the frequencies of individual items in two
corpora of different sizes?
▪ represent them as a percentage of the overall number of
words in the respective corpora
▪ use a norming technique of frequency counts
▫ divide the raw frequency of individual items by the total
number of words in a text
▫ we need to decide on an appropriate number of words
which forms the basis of the norm
▫ multiply the results by this figure
» keywords
◊ keywords = items that occur
▪ either with a significantly higher frequency (positive keywords)
▪ or with a significantly lower frequency (negative keywords)
in a text or collection texts when compared to a larger
reference corpus (Scott 1997)
◊ keywords are identified on the basis of
▪ statistical comparisons of word frequency lists derived from
the target corpus and the reference corpus
▪ [via a chi-square or a log-likelihood analysis] each item in the
target corpus is compared to its equivalent in the reference
corpus and its statistical significance of difference is calculated
 to generate words that are characteristic
uncharacteristic
in a given
target corpus
● single keywords
◊ on the basis of a 35,000 word corpus:
the spoken language of health professionals
◊ five million word CANCODE corpus of general spoken Eng
a study of telephone calls made to the British advice helpline
provided by The National Health Service (NHS-Direct)
 the data from the medical consultations was recorded
▪ most frequent items in both corpora grammatical items
▪ distribution of personal pronouns
Health Service “other-oriented”: you most frequent
▪ the reverse frequency order of you and I
▪ right in Health Service, yeah in CANCODE
both are listener response tokens
▫ right signals more transactional nature
▫ yeah interactional nature (encourage the Sp to continue with
the turn)
▪ comparison of frequency lists can help in the
characterization of different spoken genres
▪ keyword analysis (below), based on a log-likelihood
calculation, better suited to highlight the main elements
that are characteristics for a particular text or collection
of texts
Table 3.4 shows the top 10 positive keywords
the list gives a better idea of the content of the texts
in the HP corpus
▪
▪
▪
▪
▪
reference to medication (antibiotics)
ailments (diarrhoea)
the nature of the discourse (information)
the mode of the discourse (call)
the medical context (NHS, Direct)
the keywords that mark listener response
in an advice-giving setting (ok, okay)
patient-oriented nature (you, your)
Table 3.5 confirms the result of the analysis of
positive keywords
▪ the discourse in the HP corpus oriented towards the hearer
who phones in with a health problem  you, your
third person pronouns – negative keywords (low in HP corpus)
▪ past tense verb was also NEG keywords
HP reports current medical concerns in the present tense
▪ laughter ([laughs]) significantly more in CANCODE
 HP relatively serious nature of medical consultation
● key sequences
analysis of keywords can be extended to include
extended recurrent sequences
Table 3.6 key sequences provides us with even stronger
evidence of the particular domain of HP discourse
▪ quite a few of the recurrent sequences “automated response”
marking the beginning of telephone interaction with NHS Direct
▪ other sequences relate to the gathering of basic information
about the caller
▪ the most significant NEG key sequence in the HP: I don’t know
(professionals providing knowledge and advice)