Introduction to CLAN analyses - Humanities Lab, Lund University

Introduction to CLAN analyses
Victoria Johansson
Språk- och litteraturcentrum, Lingvistik
[email protected]; [email protected]
17 March 2013
Contents
1 Introduction
2 CHILDES-environment
2.1 CHILDES . . . . . . .
2.2 CHAT . . . . . . . . .
2.3 CLAN . . . . . . . . .
2.4 Download CLAN . . .
2
.
.
.
.
2
2
2
2
2
3 Transcriptions
3.1 Example of a cha-fil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Small guide to the transcription symbols . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
3
4
4 Start CLAN and settings
4.1 Your own account . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Start CLAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Settings in CLAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
5
5
5
5 Look at a transcript
7
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 FREQ-analyses
6.1 Flags . . . . . . . . . . . . . . . . .
6.2 Newer versions of CLAN . . . . . .
6.3 Wild cards . . . . . . . . . . . . .
6.4 Summary of the CLAN commands
6.5 Search for specific word . . . . . .
6.6 Search for specific words . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
. 9
. 9
. 9
. 10
. 10
. 10
7 KWAL-analyses
11
8 MLU-analyses
11
9 CHIP-analyses
13
10 COMBO-analyses
13
11 Lexical diversity using VOCD
14
1
12 Lexical density
15
12.1 Using COMBO to tag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1
Introduction
This is a short introduction to CLAN Analysis, based on previous guidelines in Swedish, developed
and adapted for courses of various kinds, given at the Humanities Lab, Lund University.
The aim of this guideline is to introduce some of the most common and useful analyses that can
be performed using CLAN.
The introduction works with the corpus Lara, which is accessible through the WEB-data in CLAN.
Open CLAN, go to the menu “Windows” and choose “WEB data”. In the window that opens, choose
the directory “Eng-UK”, and then “Lara”. In order to work with the files as is done in this guidline,
the files must be downloaded (I have kept all the file names, and just added “lara.” in front of them).
2
2.1
CHILDES-environment
CHILDES
CHILDES is an acronym for the Child Language Data Exchange System. This is a network for
mainly child language researchers from all over the world who is using the transcription standart
CHAT and the analyses tools CLAN.
CHILDES-researchers are often sharing their corpora, on child language development in first and
second languages (from all over the world), as well as bilingual and clinical data. On the CHILDES
webpage, you will also find information on different methods of transcribing and coding.
The program and the tools are free to use, and works on several platforms (mac, pc, unix). The
program is continously updated.
2.2
CHAT
The special transcription standard is called Codes for the Human Analysis of Transcripts, CHAT.
This manual will tell you how to transcribe a simple file according to the CHAT standard.
You can also run some analyses on files (in only text format) that do not follow this standard, for
instance if you want to count word frequencies.
2.3
CLAN
When you want to run analyses on your transcription, you will use the programs, or the program
package called CLAN (Computorized Language ANalysis). The programs will mainly help you
to perform various forms of frequency counts – number of words, morphemes, utterances, t-units,
combination and cooccurences of words and phrases and lexical diversity.
2.4
Download CLAN
The program can be downloaded from the CHILDES homepage:
In addition to this guidline, you will be helped by using the manuals abot CHAT (the transcription standard) and CLAN (the program package to analyse the CHAT-transcriptions) found at the
CHILDES homepage (for free):
http://childes.psy.cmu.edu/
2
3
Transcriptions
The transcriptions in this exercise follows the so-called CHAT-format. It is possible to perform certain
analyses also on files that do not follow this format, although we will not deal with this in this
guideline. However, the more powerful analyses can only be done with transcripts in CHAT-format.
Files following the CHAT-format ends with .cha, and are often called chat-files.
These exercises are based upon chatfiles that you will find in the exercise material at Språk- och
litteraturcentrum:
Gu-material/Humanistlaboratoriet/CreatingalinguisticcorpusVT13/CLAN/Laradata
If you do this laboration at another place, then you can download the files for this exercise like
this:
1. Open the CLAN-program.
2. Go to the Output-window.
3. Choose the menu “Window”.
4. Then choose “WEB data”
5. A window will open. Choose “Eng-UK” here.
6. Then choose the folder “Lara”
7. Click on the files here to download them (you may have to repeat the procedure for each file
individually)
8. When you see a file on the screen, then choose File and Save as.
9. Save the files to a directory on your computer (preferably, choose the same names as the files
have above).
10. When you define the button “Working” below, you will have to set this to the directory where
you have saved these files.
3.1
Example of a cha-fil
@Begin
@Languages: eng
@Participants: CHI Lara Child , MOT Mother
@ID: eng|Lara|CHI|||||Child|||
@ID: eng|Lara|MOT|||||Mother|||
@Date: 09-AUG-1997
@Location: Nottingham , England
@Comment: Filename Lara.3-02-24.45
@Situation: playing with mum and Amy
@Comment: time morning
@Comment: duration 45 minutes
@Comment: transcribed by Sarah Fletcher and checked by Caroline Rowland
*MOT: we’re playing houses with those two .
*MOT: are they going in there ?
*MOT: don’t they want a duvet ?
*CHI: what ?
3
*MOT:
*CHI:
*MOT:
*CHI:
*MOT:
*MOT:
*CHI:
*CHI:
*MOT:
*CHI:
*MOT:
*CHI:
*CHI:
*CHI:
%err:
don’t they want a duvet ?
oh yeah .
do you want a <bobble in> [>] ?
<this is a blanket> [<] duvet .
okay .
do you want a bobble in ?
but they +//. [+ IN]
no .
<is it> [//] is it a bit hot with your hair like that ?
no .
do you want a bobble in like mummy ?
no .
well [/] well .
this a picnic blanket and I will let they [*] have it .
they = them
3.2
Small guide to the transcription symbols
Table 1: Key to the CHAT transcriptions
*
%
#
[]
[?]
[!]
+
?
!
xxx
<word(s)>[/]
<word(s)>[//]
<word(s)>[///]
+...
every speaker line is introduced by a star, followed by a three-letter code,
indicating the speaker. The code is unique for each speaker in the
transcript, but in this project we used the code *SBJ for all the Subjects,
and *INV for the Investigator. This facilitated later analysis.
(In the transcription examples the star (*) in front of the three-letter code has been
excluded.)
starts a dependent tier, containing comments or coding relating to the preceding
speaker tier, e.g. %ces, indicating a center-embedded clause on the previous line.
pause of unspecified length.
square brackets denote a clarification of some kind. In our case we have
mainly used it for correcting misspelled words in writing and to translate (for
comparative reasons) a spoken word form into its written
equivalent. This means that the form that is generally included in the
analysis is the form within square brackets. Example: ja [: jag] (‘I’).
denotes that the transcriber is uncertain of the previous word or utterance.
If the uncertainty covers more than one word, all the words are enclosed in
angle brackets. Example: <hunden som> [?] (‘the dog that’).
denotes that the previous word(s) are emphasized. Example: jag såg henne [!] (‘I saw
her [!]’), where ‘henne’ is emphasized.
denotes a word boundary, e.g. in compounds. For more information, see
Section ?? about the transcription of words.
a question mark at the end of an utterance indicates question intonation
an exclamation mark at the end of an utterance indicates a stressed utterance,
an imperative clause or an interjection.
unintelligible speech.
retracing without correction. Example: han [/] han säger (‘he [/] he says’)
and <han som> [/] han som är där ‘<he who> [/] he who is there’
retracing with correction Example: ’han [//] hon säger’
(‘he [//] she says’)
retracing with reformulation.
Example: <han säger att> [///] jag tycker att man kan säga att
(‘<he says that> [///] I think you can say that’)
trailing off
4
4
Start CLAN and settings
First, some practicalities
4.1
Your own account
When you use CLAN you will sometimes have to create so-called output-files, with the results of the
analysis. If you make these, they can only be saved at “your place” on the server (this is the only place
the computers in the university’s pc-room will let you save files). This is how to find “your place”:
1. Go to the Start meny (down left at the screen)
2. Choose “Den här datorn” (This computer).
3. Choose the drive starting with (H:). (It has a name containing your inlog on the computer,
which is different for every user.) Probably it’s called something like “Studentserver (student.ht.lu.se)(H))”. This is where your output files will go. You can also save everything else
here.
4.2
Start CLAN
Start CLAN like this (valid for PC-room in Humanities lab):
1. Go to START menu (bottom left at the screen).
2. Choose “Alla program”.
3. Choose “SOL” in the list. If “SOL” is missing you can go directly to the next step:
4. Choose “CLAN” in the list.
5. When the program opens, two window will be visible. The smaller one is the so-called Commandswindow, and the bigger one is the so-called output-window. If the commands-window doesn’t
open, you can go to the menu “Window” (found in the top of the output window), and choose
“Commands”, or press the short cut Ctrl+D.
If CLAN is not found this way you can also choose START/Den här datorn/C-drive/CHILDES/CLAN.
4.3
Settings in CLAN
When you have started CLAN you have to set the commands window so that you will work with the
right files. You have to check this every now and then, especially if you start working with a new set
of transcriptions.
In the commands window there are four buttons:
• Working
• Output
• Lib
• Lib mor
Control that the setting is as below, and set them right if necessary:
5
Working The Working will tell the program which files you want to work with. We will start to
work with CHAT-files from a longitudinal corpus of one girl, Lara.
1. Click the button “Working”. A window opens.
2. In the bottom of the window you first have to chose Drives. Choose the one called X:\\staff.ht.lu.se\GU-material
(It can be difficult to see the whole name).
3. In the window now you have to find the folder “Humanistlaboratoriet” and click on it (you may
have to scroll down to find it).
4. Then click on “CreatingalinguisticCorpusVT13”
5. Click on “CLAN”.
6. Click on “Laradata”
7. click on the button “Select Directory” to select this file.
Output The Output button will tell the program where output files should be saved. These should
go to your place on the server.
1. Click on the button “Output”. A window opens.
2. Choose “Drives” in the bottom of the window. Choose the one called h:\\student\stud-xxx .
3. Click on the button “Select Directory”
Lib and Mor lib The buttons Lib and Mor lib will tell the program in which library it should look
for information (e.g. on specific language files). You have to change this very rarely. The presetting
is probably C:\CHILDES\CLAN\lib. If this is not the case, you have to do like this (once for every
button):
1. Click on the button “Lib”/“Mor lib”. A window opens.
2. First choose “Drives” in the botton of the window. Choose the drive called C:\\
3. Then, above, choose the folder called “CHILDES”.
4. click on “CLAN”.
5. Click on “lib”.
6. click on the button “Select Directory”
6
5
Look at a transcript
Start with open a transcript.
1. Go to the START-meny in the bottom left of the screen.
2. Choose “Den här datorn” (This computer)
3. Choose the volume called “GU-material på staff.ht.lu.se”
4. Choose “Humanistlaboratoriet”
5. Choose “CreatingaLinguisticCorpusVT13”
6. Choose “CLAN”
7. Choose “Laradata”
8. Open the transcript lara.3_02-24.45.cha
9. Look at the transcript, and leave it open while you move on with the analyses below.
6
FREQ-analyses
The FREQ program will give you information on the most frequent words. You can choose between
using the program on all words in a file, or just one word, or a certain selection of words. You can also
choose between doing the analysis on one or more files, or to perform the analysis on only one speaker.
1. Write the following command in the commands window. Be careful to use spaces, capitals and
other signs correctly! When you’re done, click the button “Run” in the commands window.
freq lara.3-02-24.45.cha
This command tells the computer the following:
freq
lara.3_02-24.45.cha
meaning that you want to do a frequency analysis
this is a file name, and by writing it you indicate that this is the file
that you would like to analyse (in this case: know which words
that are used in the file)
Since you don’t give any other specifications or restrictions, the program will count
all the words in the utterance that is in the file (i.e. everything that Lara (the child)
says, as well as the adults in the file).
2. Now we will write a command where we only study Lara’s utterances. This time we will look
at all files with Lara’s transcriptions. Write the following in the commands window and click
“Run”.
freq
+u
+t*chi
lara*.cha
>
freqLara.txt
7
+u
+t*chi
lara*.cha
>
freqLara.txt
(’unify’) means that the results from all the files you include in the
calculation will be summarized.
means that you only wants to calculate Lara’s utterance,
i.e. only the tiers starting with *chi. If you want to exclude all Lara’s
utterances, but include all the rest, you should instead write -t*chi.
means that you include all the files that have a file name starting with
“lara” and ends with “.cha”. The star (*) is called the wild card and
stands for one or more characters (letters/numbers)
(Read more in section 6.3 about Wild cards.)
means that the result will be sent to a file (and thus is not directly
visible at the screen)
is the name of the result file (i.e. the output) created when you run
this frequency search. You can give the output file any name you want.
The file will end up in the folder that you have specified under the
“Output-button” in the Commands window.
We give the file the extension “.txt” so that it will open with Notepad
(or similar program; it can also be opened with CLAN or Word)
Tip You can always choose if you want your results directly on the screen (in which
case you don’t write anything special in the commands window) or if you want them
to be saved in a file. I this case you write > and make up a name for your output file.
Make sure the name is in one word!.
If you make new searches, then make sure to also renamne your output files, otherwise
the new file will replace the old one.
The two main reasons for making output files are first, that the output window is too
small to show long outputs (like a frequency list of words from a long file). And second,
because you would like to go back to the files later, or to compare several output files
with each other.
3. Now open the output file “freqLara.txt” you have created. You do this by going to your folder
(Startmenyn/Den här datorn/Nätverksenhet (H:)).
4. Look at the output file. What does it show?
5. Look at the end of the document. Which summary do you find there? What can they be used
fore?
6. Now, try to make CLAN perform a frequency counts on all utterances except Lara’s, in the file
lara.3-03-16.35.cha. Which command should you use? (Try it out!)
7. Did you succeed?
8
6.1
Flags
All CLAN programs have so-called flags. These are used to specify the search in various ways.
You have already used some flags, e.g. +t and +u.
8. Write this in the commands window
freq
In the output in the output window you will now see all flags that are possible to combine with
FREQ. You can always type the name of the program (e.g. KWAL, MLU, COMBO or CHIP
etc.) to see all the flags connected to the programs. Many flags are the same for several programs,
but not all.
9. Now, test the flag +o in combination with a previous command. What is the difference in output?
10. Then, test the flag +d in combination with a previous command. What is the difference in
output?
11. Test some of the other flags for FREQ by repeating any of the previous commands we have
used, and add another flag. Just remember that the filename should always be in the end of the
command.
6.2
Newer versions of CLAN
In recent (March 2013) version of CLAN, the output of freq is by default divided into different
speakers, i.e., if a transcript contains more than one speaker, the freq-output will be counted for
each speaker seperately. In order to get the same output as above, you will have to add the flag
+o3, as below:
Test this command with and without +o3, if you have a new version of CLAN:
freq +u +o +o3 lara.3-03*
What is the difference between that command and this:
freq +u +o lara.3-03*
6.3
Wild cards
In the CLAN programs there are great possibilities to use the so-called Wild card. This means
that you use a star (*) instead for one or more letters/numbers. In this way, you can include
several things at the same time. Above, you saw how to search all Lara’s files by writing
lara*.cha.
Wild cards can be used both in file names, but also replacing one or more characters when you
serach for a specific word and its forms.
9
Programname
freq
6.4
(opt. flags)
+u +d4
filename.cha
ma30_20.cha
(opt. send the output to file)
> freqmarkus
Summary of the CLAN commands
A CLAN command is build this way:
6.5
Search for specific word
If you want to search for a specific word, you can use this with the flag +s directly followed by the
word you want to count.
Try this command:
freq +sthis +t*chi lara.3-00-00.45.cha
What is the result?
Now, try to formulate how you would investigate how other speakers in the same file use the same
word, this. Did it work?
6.6
Search for specific words
You may want to look for several words at the same time. You can do this by creating so-called Include
file.
Do like this:
1. Open a new, empty CLAN-document
2. List the words you want to look at, one word at each line, for instance:
this
that
it
3. Then Go to the menu “File”, and choose “Save as”. Give the file a name, for instance dempron.cut
4. Save the file somewhere in your directory (at (H)).
5. Then go back to CLAN, open the Commands window and redefine the button “Lib” so that it
points at your directory (i.e., where you have saved the file “dempron.cut”.
6. Then write the following command:
freq [email protected] +t*chi lara.3-00-00.45.cha
7. What does the output look like?
8. Now, try to make a new includefile that look at all the personal pronoun in the file. Investigate
how many the child produces and how many the other speakers produce. Is there a relation?
10
7
KWAL-analyses
The program KWAL (Key Word And Line) is good to use if you want to see the context of a word.
1. Select one word of Lara’s that you want to investigate. (Check a transcription file to make sure
she uses it.) Below we will look at the word ‘blue’.
2. Write the command below. Make sure to write the word you want to search for directly after the
flag ’+s’, e.g. ’+sblue’.
kwal
+w2
-w1
+w2
-w1
+sWORD
+t*chi
lara*.cha
+u
+sblue
+t*chi
+u
lara*.cha
means that you will see the two utterances following the keyword you have
chosen.
means that one utterance before you keyword will show.
means that you search for a certain keyword.
s is short for string, replace WORD for your keyword
the analysis will only be performed on Lara’s utterances.
the analysis will be performed on all of Lara’s files.
the analysis include all files at ones (u =’unify’).
You can vary the commands. Choose to include more or less context (use +w and -w and change
the number following these flags). Just as with the FREQ-analysis you can do the analysis on
one file or many files.
3. You can use Wild cards in KWAL. Try for instance this command:
kwal +t*chi +s*ing lara*
What does it give you?
8
MLU-analyses
The MLU-program (MLU = Mean length of utterance) is used to study the quantitative development
of words, and the syntactic development. The output from this file is the number of morphemes (if
the transcription is coded for that) per utterance for every speaker. If the transcription is not coded
for morphemes, you will instead get the number of words for every utterance.
The result of an MLU is thus the mean length (in morphemes or words) for an utterance. In
CLAN, every new speaker turn is an utterance.
1. Now count MLU for Lara’s utterances in the file ‘lara.3-03-09.45.cha’.
mlu
-t%mor +t*chi
lara.3-03-09.45.cha
Earlier, the MLU program always counted the words on the speaking line, but a few years ago,
the program changed and wanted to calculate MLU using the %mor-tier instead. This is why
you have to specify -t%mor in the command (since Lara’s texts do not have any %mor-tier).
11
-t%mor
+t*mar
means that we do not want MLU to be counted using the %mor-tier,
but we want to use the speaker tier/main tier.
means that we only want MLU for the speaker Lara.
You can in principle leave this out, and still get MLU divided for
all speakers.
2. The result is shown in the output window. MLU is the measure called Ratio of morphemes over
utterances.
MLU for Speaker
Number of utterances
Morphemes
Ratio of morphemes over utterances
Standard deviation
specifies the speaker
the number of utterances in the transcription file
number of morphemes in the transcription file,
if the file is coded for morpheme, otherwise
this will show the number of words.
is the MLU-measures (i.e. number of morphemes
divided on the number of utterances)
Don’t put too much stress on this term
if you don’t know statistics. The measure will tell
you how much the data varies.
A high standarddeviation will tell you
that the file consist of on the one hand
many long utterances, and on the other hand
many short utterances, but that there is not so
many utterances in the middle.
12
9
CHIP-analyses
CHIP can be used to investigate to what extent the same word and utterance is used between different
speakers. This can for instance be used to see how much different speakers repeat each others1 .
1. The idea of CHIP is to compare two specific speakers to each other. You will have to say who is
adult and who is child (of course, you can also compare two adults).
After this, you can set if you are interested in self repetitions (i.e. when a speaker repeats herself),
or only of repetitions that someone else have said. We will choose the latter alternative here.
2. Write the following command:
chip +bMOT +cCHI -ns lara.3_02-24.45.cha > larachip1
+bMOT
+cCHI
-ns
tells the program which speaker that is the adult; here MOT (Mother).
tells the program which speaker that is the child; here Chi (Lara).
here you want to exlude output that is self repetition
The output will become rather complicated so it is sent to a file that we call larachip1.
3. Open the output file “larachip1” 2 . what does it look like?
4. Now, look in the bottom of the file. Here you will find a rather complicated table. You can
concentrate on the following:
%_OVERLAP
%_ADD_OPS
%_DEL_OPS
%_EXA_OPS
The percentage of utterances that overlap (in any way)
utterances from another speaker
The following analyses will only be counted on this
percentage overlap.
The percentage of the utterances when the speaker added something
(addition) to what the previous speaker has said.
he percentage of the utterances when the speaker deleted something
(deletion) compared to what the previous speaker has said.
The percentage of the utterances when the speaker exactly repeated
something that the previous speaker has said.
5. If you study the output in the file “larachip1”, can you then say who is mostly expanding the
previous speaker’s utterances?
6. Check the CLAN-manual on how to use the CHIP program in more ways.
10
COMBO-analyses
The program COMBO is used to many things, but among other things it can be used to analyse how
word (pairs) occur together, or to find out which words that do absolutely not occur together!
1. Write the following command to find out if Lara uses the expression I have:
1 Thanks to Sofia Strömbergsson and Erika Ljung who wrote the paper Automatisk analys av interaktionsmönster i
vuxen-barnsamtal. Institutionen för logopedi, audiologi och foniatri, Lunds universitet, vt 2005.
2 You can for instance open it in Excel!
13
combo +sI^have +t*chi lara*.cha
+sWORD
WORD^WORD
indicates which word you want to search for.
is used between the two words you are interested in.
You can also use wild cards to investigate if the words occur
together, but not in connection with each other. Then write:
ORD^*^ORD
2. What is the result? Does Lara use this expression?
3. Maybe there are other occasions, when the words I and have occur together but not next to
each other. How would you write a command to investigate this?
4. Maybe Lara uses the words I and have together, but not necessarily in that order. To look if
two words occurs together, independent of order you can use +x. Write for instance:
combo +sI^have +x +t*chi lara*.cha
5. Maybe Lara sometimes uses I without have. To find all the cases when a word occurs without
another word, you should use ^!. Test for instance:
combo +sI^!have +x +t*chi lara*.cha
6. Read more about COMBO in the CLAN-manual!!
11
Lexical diversity using VOCD
Lexical diversity measures how many different words there is in a text, and can thus tell you something
about the lexical variation for a certain speaker/writer. Often this has been measured using the socalled TTR measure (Type/Token ratio) which is found in the bottom of a frequency count in CLAN,
but this is not so useful if you want to compare texts of different length. You can read more about
this in the CLAN manual, under the VOCD-section.
Using VOCD (Vocabulary Diversity) is really very simple. You should think about:
• The text should have at least 50 words.
• The transcription should have been controled with CHECK, and passed without remarks.
• You can only run VOCD on one file at the time. (If you have several speakers in one and the
same file, you will have to specify every speaker – and every speaker will have had to utter at
least 50 words.)
In this example we work with a data set from the Spencer project, found at:
Gu-material/Humanistlaboratoriet/Logopedlabb/CLAN-data/Spencer/Swedish
Then write the command:
vocd
wg01fAES*.cha
14
The program will run through the file and make some counts. In principle, the VOCD counts the
value three times and the mean value of these three calculations will be the “D” value in the end. At
the very bottom of the file you will see the header D_optimum average, which will give you the result
57.81 (in this case).
VOCD RESULTS SUMMARY
====================
Command line: vocd Macintosh HD:Users:victoria:
Documents:Spencerstudien:Alla:Allasamlade:wg01fAES.cha
File name: Macintosh HD:Users:victoria:
Documents:Spencerstudien:Alla:Allasamlade:wg01fAES.cha
Types,Tokens,TTR: <72,122,0.590164>
D_optimum values: <57.46, 58.60, 57.36>
D_optimum average: 57.81
What ‘D’ (the value that VOCD calculates) really is remains a bit unclear (but read more under
section 9.25 in the huge CLAN-manual). You can calculate VOCD for different speakers and texts
and compare them to each other, but egentligen är förblir lite oklart. Värdet kan användas för att
jämföra ordförrådet mellan olika grupper, men man bör vara försiktigt med att t ex göra tvärspråkliga
jämförelser, eftersom olika morfologi kan göra att värdena skiljer sig mycket åt.
12
Lexical density
Lexical density or information packaging measures the percentage content words (mainly the word
classes nouns, verbs, adjectives) of all words in a text, i.e. one gives a measure on how high percentage
the nouns, adjectives and verbs constitutes of all words in the text. Often some lexical adverbs (’fast’)
are also included in this category of content words, for instance adverbs derived from adjectives (e.g.
’slowly’).
If the words in the texts are not tagged for parts-of-speech one may wonder how to calculate this.
But while the content words are unlimited, the function words are usually easy to find – pronouns,
conjunctions and subjunctions, prepositions and the rest can be collected from a grammar book for
instance. So the procedure of caculating lexical density with CLAN means that one makes a list of all
the function words (it does not have to be all the function words in the language, but to list the ones
occurring in the corpus is enough for this purpose), and then use a negative include file search to sort
out all words that are not function words.
You can for instance do like this:
Du kan till exempel göra så här:
1. Start with creating a new include file where you list all the function words you can think of (use
grammar books, but also make a frequency count of all the most common words in your corpus
– there will most likely be a lot of function words among the top 50 words or so).
2. The include file is created by starting CLAN, select File and then New. In this new document
you list all the words, one on every line. List pronouns, prepositions, con/subjunctions, the most
common count words and interjections. List (if you have decided to do so) the most common
adverbs: ’not’, ’where’, ’how’, ’actually’, etc.
3. Save the file, for instance with the name ’functionwords.cut’ in the same file where you have
saved CLAN’s lib-file. And don’t forget to give the name the file extension ’.cut’.
4. Then make a frequency count using the file on your material. In this frequency count you should
exclude everything that you have listed in your function word file. Make sure to send the output
15
to a file (otherwise it will most likely be so long that you will not be able to see the whole result).
If you let the output be a word list where you see the words alphabetically with no frequency
information, it will be easy for you to notice whether there are many function words you have
forgotten (these should be picked out and put on your function list). Repeat this procedure until
you are sure that there are no function words left on the list.
The commands can look like this:
freq +d1
[email protected] filenames.cha > functionoutput
5. Then open the file which in this case is called ’functionoutput’, and check if there are any function
words on this list. If you find any, then add them to your file functionwords.cut.
6. In the end you will have a complete list of all the function words that exist in your corpus. (If you
later run that one against another corpus, you will probably have to add more function words
that were not in your original corpus.)
7. Now, once you have a complete function word list, run it against your corpus, file by file and
note how many function words and content words every file contain (you get function words if
you run a negative frequency count with the include list, and content words if you run a positive
frequency count).
Something like this to calculate all the content words:
freq +d4
[email protected] filenames.cha > contentwordoutput
Or this to calculate all the function words:
freq +d4
[email protected] filenames.cha > functionwordsoutput
8. Also, make sure to run “ordinary” frequency counts to know the total amount of words in every
file:
freq +d4
filenames.cha > allwordsoutput
9. Put the results for instance in an excel file, and based on the total amount of words in the file,
and the amount of concent words, calculate how big percentage the content words contain.
12.1
Using COMBO to tag
COMBO can also be used to tag and code your file.
In this small exercise we will code all the tokens of the word ‘we’ that the mother utters in the small
file “engtest.cha” (found in the Exercisematerial in Humanistlaboratoriet/CreatingalinguisticcorpusVT13/CLAN/Exercises
Do like this:
1. Open a new window in CLAN (go to the File menu and select “New”. )
16
2. Write the word “we” in the file.
3. The choose File again, and select “Save as...”.
4. Save the file with the name “we.cut”. Preferably you save the file in the same folder as the depfile
(e.g in the CLAN program folder). If you don’t do this you will have to (later) specify the morlib
button in the Working window to work on the folder where you have put your “we.cut”-file.
5. Then open yet another new windown in CLAN, by choosing File and select “New”.
6. In this file you should write exactly as below, and be careful with quotation marks, full stops,
etc. You have to have a tab between the three groups of words (thus NO regular spaces here).
"@we.cut" "$we" "%cod:"
7. This command will make the program to look for the include file “we.cut” (i.e. the file you
created previously), and then code everything that is listed in that file (i.e. in this case only
the word ‘we’) using the code $we on a specific coding tier called %cod . The program will
automatically generate this coding tier in the cases where the word ‘we’ occurs. In a minute we
will see how it works.
8. Now, save this file with the name “taggawe.cut”. Save it at the same place/the same folder as
the file “we.cut” is saved.
9. Return to CLAN and open the Commands window. Redefine the morlib button so that it points
at the folder where you have saved the files “we.cut” and “taggawe.cut”
10. Then write this command:
combo [email protected] +t*MAM +d4 engtest.cha
11. In this case the file name is “engtest.cha”, since we are working with that file.
12. The commando says that you want to use COMBO to call for the include file “taggawe.cut” (you
call for this file by the command +@s). Then you want to search for all the utterances by the
mother (+t*MAM). The flag +d4 says that if a match is found (i.e. if the mother says ‘we’),
then CLAN should perform what is found in the file “taggawe.cut” (i.e. write a coding tier %cod
with the code $we).
17