Introduction to CLAN analyses Victoria Johansson Språk- och litteraturcentrum, Lingvistik [email protected]; [email protected] 17 March 2013 Contents 1 Introduction 2 CHILDES-environment 2.1 CHILDES . . . . . . . 2.2 CHAT . . . . . . . . . 2.3 CLAN . . . . . . . . . 2.4 Download CLAN . . . 2 . . . . 2 2 2 2 2 3 Transcriptions 3.1 Example of a cha-fil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Small guide to the transcription symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 4 4 Start CLAN and settings 4.1 Your own account . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Start CLAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Settings in CLAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 5 5 5 Look at a transcript 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 FREQ-analyses 6.1 Flags . . . . . . . . . . . . . . . . . 6.2 Newer versions of CLAN . . . . . . 6.3 Wild cards . . . . . . . . . . . . . 6.4 Summary of the CLAN commands 6.5 Search for specific word . . . . . . 6.6 Search for specific words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 . 9 . 9 . 9 . 10 . 10 . 10 7 KWAL-analyses 11 8 MLU-analyses 11 9 CHIP-analyses 13 10 COMBO-analyses 13 11 Lexical diversity using VOCD 14 1 12 Lexical density 15 12.1 Using COMBO to tag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1 Introduction This is a short introduction to CLAN Analysis, based on previous guidelines in Swedish, developed and adapted for courses of various kinds, given at the Humanities Lab, Lund University. The aim of this guideline is to introduce some of the most common and useful analyses that can be performed using CLAN. The introduction works with the corpus Lara, which is accessible through the WEB-data in CLAN. Open CLAN, go to the menu “Windows” and choose “WEB data”. In the window that opens, choose the directory “Eng-UK”, and then “Lara”. In order to work with the files as is done in this guidline, the files must be downloaded (I have kept all the file names, and just added “lara.” in front of them). 2 2.1 CHILDES-environment CHILDES CHILDES is an acronym for the Child Language Data Exchange System. This is a network for mainly child language researchers from all over the world who is using the transcription standart CHAT and the analyses tools CLAN. CHILDES-researchers are often sharing their corpora, on child language development in first and second languages (from all over the world), as well as bilingual and clinical data. On the CHILDES webpage, you will also find information on different methods of transcribing and coding. The program and the tools are free to use, and works on several platforms (mac, pc, unix). The program is continously updated. 2.2 CHAT The special transcription standard is called Codes for the Human Analysis of Transcripts, CHAT. This manual will tell you how to transcribe a simple file according to the CHAT standard. You can also run some analyses on files (in only text format) that do not follow this standard, for instance if you want to count word frequencies. 2.3 CLAN When you want to run analyses on your transcription, you will use the programs, or the program package called CLAN (Computorized Language ANalysis). The programs will mainly help you to perform various forms of frequency counts – number of words, morphemes, utterances, t-units, combination and cooccurences of words and phrases and lexical diversity. 2.4 Download CLAN The program can be downloaded from the CHILDES homepage: In addition to this guidline, you will be helped by using the manuals abot CHAT (the transcription standard) and CLAN (the program package to analyse the CHAT-transcriptions) found at the CHILDES homepage (for free): http://childes.psy.cmu.edu/ 2 3 Transcriptions The transcriptions in this exercise follows the so-called CHAT-format. It is possible to perform certain analyses also on files that do not follow this format, although we will not deal with this in this guideline. However, the more powerful analyses can only be done with transcripts in CHAT-format. Files following the CHAT-format ends with .cha, and are often called chat-files. These exercises are based upon chatfiles that you will find in the exercise material at Språk- och litteraturcentrum: Gu-material/Humanistlaboratoriet/CreatingalinguisticcorpusVT13/CLAN/Laradata If you do this laboration at another place, then you can download the files for this exercise like this: 1. Open the CLAN-program. 2. Go to the Output-window. 3. Choose the menu “Window”. 4. Then choose “WEB data” 5. A window will open. Choose “Eng-UK” here. 6. Then choose the folder “Lara” 7. Click on the files here to download them (you may have to repeat the procedure for each file individually) 8. When you see a file on the screen, then choose File and Save as. 9. Save the files to a directory on your computer (preferably, choose the same names as the files have above). 10. When you define the button “Working” below, you will have to set this to the directory where you have saved these files. 3.1 Example of a cha-fil @Begin @Languages: eng @Participants: CHI Lara Child , MOT Mother @ID: eng|Lara|CHI|||||Child||| @ID: eng|Lara|MOT|||||Mother||| @Date: 09-AUG-1997 @Location: Nottingham , England @Comment: Filename Lara.3-02-24.45 @Situation: playing with mum and Amy @Comment: time morning @Comment: duration 45 minutes @Comment: transcribed by Sarah Fletcher and checked by Caroline Rowland *MOT: we’re playing houses with those two . *MOT: are they going in there ? *MOT: don’t they want a duvet ? *CHI: what ? 3 *MOT: *CHI: *MOT: *CHI: *MOT: *MOT: *CHI: *CHI: *MOT: *CHI: *MOT: *CHI: *CHI: *CHI: %err: don’t they want a duvet ? oh yeah . do you want a <bobble in> [>] ? <this is a blanket> [<] duvet . okay . do you want a bobble in ? but they +//. [+ IN] no . <is it> [//] is it a bit hot with your hair like that ? no . do you want a bobble in like mummy ? no . well [/] well . this a picnic blanket and I will let they [*] have it . they = them 3.2 Small guide to the transcription symbols Table 1: Key to the CHAT transcriptions * % # [] [?] [!] + ? ! xxx <word(s)>[/] <word(s)>[//] <word(s)>[///] +... every speaker line is introduced by a star, followed by a three-letter code, indicating the speaker. The code is unique for each speaker in the transcript, but in this project we used the code *SBJ for all the Subjects, and *INV for the Investigator. This facilitated later analysis. (In the transcription examples the star (*) in front of the three-letter code has been excluded.) starts a dependent tier, containing comments or coding relating to the preceding speaker tier, e.g. %ces, indicating a center-embedded clause on the previous line. pause of unspecified length. square brackets denote a clarification of some kind. In our case we have mainly used it for correcting misspelled words in writing and to translate (for comparative reasons) a spoken word form into its written equivalent. This means that the form that is generally included in the analysis is the form within square brackets. Example: ja [: jag] (‘I’). denotes that the transcriber is uncertain of the previous word or utterance. If the uncertainty covers more than one word, all the words are enclosed in angle brackets. Example: <hunden som> [?] (‘the dog that’). denotes that the previous word(s) are emphasized. Example: jag såg henne [!] (‘I saw her [!]’), where ‘henne’ is emphasized. denotes a word boundary, e.g. in compounds. For more information, see Section ?? about the transcription of words. a question mark at the end of an utterance indicates question intonation an exclamation mark at the end of an utterance indicates a stressed utterance, an imperative clause or an interjection. unintelligible speech. retracing without correction. Example: han [/] han säger (‘he [/] he says’) and <han som> [/] han som är där ‘<he who> [/] he who is there’ retracing with correction Example: ’han [//] hon säger’ (‘he [//] she says’) retracing with reformulation. Example: <han säger att> [///] jag tycker att man kan säga att (‘<he says that> [///] I think you can say that’) trailing off 4 4 Start CLAN and settings First, some practicalities 4.1 Your own account When you use CLAN you will sometimes have to create so-called output-files, with the results of the analysis. If you make these, they can only be saved at “your place” on the server (this is the only place the computers in the university’s pc-room will let you save files). This is how to find “your place”: 1. Go to the Start meny (down left at the screen) 2. Choose “Den här datorn” (This computer). 3. Choose the drive starting with (H:). (It has a name containing your inlog on the computer, which is different for every user.) Probably it’s called something like “Studentserver (student.ht.lu.se)(H))”. This is where your output files will go. You can also save everything else here. 4.2 Start CLAN Start CLAN like this (valid for PC-room in Humanities lab): 1. Go to START menu (bottom left at the screen). 2. Choose “Alla program”. 3. Choose “SOL” in the list. If “SOL” is missing you can go directly to the next step: 4. Choose “CLAN” in the list. 5. When the program opens, two window will be visible. The smaller one is the so-called Commandswindow, and the bigger one is the so-called output-window. If the commands-window doesn’t open, you can go to the menu “Window” (found in the top of the output window), and choose “Commands”, or press the short cut Ctrl+D. If CLAN is not found this way you can also choose START/Den här datorn/C-drive/CHILDES/CLAN. 4.3 Settings in CLAN When you have started CLAN you have to set the commands window so that you will work with the right files. You have to check this every now and then, especially if you start working with a new set of transcriptions. In the commands window there are four buttons: • Working • Output • Lib • Lib mor Control that the setting is as below, and set them right if necessary: 5 Working The Working will tell the program which files you want to work with. We will start to work with CHAT-files from a longitudinal corpus of one girl, Lara. 1. Click the button “Working”. A window opens. 2. In the bottom of the window you first have to chose Drives. Choose the one called X:\\staff.ht.lu.se\GU-material (It can be difficult to see the whole name). 3. In the window now you have to find the folder “Humanistlaboratoriet” and click on it (you may have to scroll down to find it). 4. Then click on “CreatingalinguisticCorpusVT13” 5. Click on “CLAN”. 6. Click on “Laradata” 7. click on the button “Select Directory” to select this file. Output The Output button will tell the program where output files should be saved. These should go to your place on the server. 1. Click on the button “Output”. A window opens. 2. Choose “Drives” in the bottom of the window. Choose the one called h:\\student\stud-xxx . 3. Click on the button “Select Directory” Lib and Mor lib The buttons Lib and Mor lib will tell the program in which library it should look for information (e.g. on specific language files). You have to change this very rarely. The presetting is probably C:\CHILDES\CLAN\lib. If this is not the case, you have to do like this (once for every button): 1. Click on the button “Lib”/“Mor lib”. A window opens. 2. First choose “Drives” in the botton of the window. Choose the drive called C:\\ 3. Then, above, choose the folder called “CHILDES”. 4. click on “CLAN”. 5. Click on “lib”. 6. click on the button “Select Directory” 6 5 Look at a transcript Start with open a transcript. 1. Go to the START-meny in the bottom left of the screen. 2. Choose “Den här datorn” (This computer) 3. Choose the volume called “GU-material på staff.ht.lu.se” 4. Choose “Humanistlaboratoriet” 5. Choose “CreatingaLinguisticCorpusVT13” 6. Choose “CLAN” 7. Choose “Laradata” 8. Open the transcript lara.3_02-24.45.cha 9. Look at the transcript, and leave it open while you move on with the analyses below. 6 FREQ-analyses The FREQ program will give you information on the most frequent words. You can choose between using the program on all words in a file, or just one word, or a certain selection of words. You can also choose between doing the analysis on one or more files, or to perform the analysis on only one speaker. 1. Write the following command in the commands window. Be careful to use spaces, capitals and other signs correctly! When you’re done, click the button “Run” in the commands window. freq lara.3-02-24.45.cha This command tells the computer the following: freq lara.3_02-24.45.cha meaning that you want to do a frequency analysis this is a file name, and by writing it you indicate that this is the file that you would like to analyse (in this case: know which words that are used in the file) Since you don’t give any other specifications or restrictions, the program will count all the words in the utterance that is in the file (i.e. everything that Lara (the child) says, as well as the adults in the file). 2. Now we will write a command where we only study Lara’s utterances. This time we will look at all files with Lara’s transcriptions. Write the following in the commands window and click “Run”. freq +u +t*chi lara*.cha > freqLara.txt 7 +u +t*chi lara*.cha > freqLara.txt (’unify’) means that the results from all the files you include in the calculation will be summarized. means that you only wants to calculate Lara’s utterance, i.e. only the tiers starting with *chi. If you want to exclude all Lara’s utterances, but include all the rest, you should instead write -t*chi. means that you include all the files that have a file name starting with “lara” and ends with “.cha”. The star (*) is called the wild card and stands for one or more characters (letters/numbers) (Read more in section 6.3 about Wild cards.) means that the result will be sent to a file (and thus is not directly visible at the screen) is the name of the result file (i.e. the output) created when you run this frequency search. You can give the output file any name you want. The file will end up in the folder that you have specified under the “Output-button” in the Commands window. We give the file the extension “.txt” so that it will open with Notepad (or similar program; it can also be opened with CLAN or Word) Tip You can always choose if you want your results directly on the screen (in which case you don’t write anything special in the commands window) or if you want them to be saved in a file. I this case you write > and make up a name for your output file. Make sure the name is in one word!. If you make new searches, then make sure to also renamne your output files, otherwise the new file will replace the old one. The two main reasons for making output files are first, that the output window is too small to show long outputs (like a frequency list of words from a long file). And second, because you would like to go back to the files later, or to compare several output files with each other. 3. Now open the output file “freqLara.txt” you have created. You do this by going to your folder (Startmenyn/Den här datorn/Nätverksenhet (H:)). 4. Look at the output file. What does it show? 5. Look at the end of the document. Which summary do you find there? What can they be used fore? 6. Now, try to make CLAN perform a frequency counts on all utterances except Lara’s, in the file lara.3-03-16.35.cha. Which command should you use? (Try it out!) 7. Did you succeed? 8 6.1 Flags All CLAN programs have so-called flags. These are used to specify the search in various ways. You have already used some flags, e.g. +t and +u. 8. Write this in the commands window freq In the output in the output window you will now see all flags that are possible to combine with FREQ. You can always type the name of the program (e.g. KWAL, MLU, COMBO or CHIP etc.) to see all the flags connected to the programs. Many flags are the same for several programs, but not all. 9. Now, test the flag +o in combination with a previous command. What is the difference in output? 10. Then, test the flag +d in combination with a previous command. What is the difference in output? 11. Test some of the other flags for FREQ by repeating any of the previous commands we have used, and add another flag. Just remember that the filename should always be in the end of the command. 6.2 Newer versions of CLAN In recent (March 2013) version of CLAN, the output of freq is by default divided into different speakers, i.e., if a transcript contains more than one speaker, the freq-output will be counted for each speaker seperately. In order to get the same output as above, you will have to add the flag +o3, as below: Test this command with and without +o3, if you have a new version of CLAN: freq +u +o +o3 lara.3-03* What is the difference between that command and this: freq +u +o lara.3-03* 6.3 Wild cards In the CLAN programs there are great possibilities to use the so-called Wild card. This means that you use a star (*) instead for one or more letters/numbers. In this way, you can include several things at the same time. Above, you saw how to search all Lara’s files by writing lara*.cha. Wild cards can be used both in file names, but also replacing one or more characters when you serach for a specific word and its forms. 9 Programname freq 6.4 (opt. flags) +u +d4 filename.cha ma30_20.cha (opt. send the output to file) > freqmarkus Summary of the CLAN commands A CLAN command is build this way: 6.5 Search for specific word If you want to search for a specific word, you can use this with the flag +s directly followed by the word you want to count. Try this command: freq +sthis +t*chi lara.3-00-00.45.cha What is the result? Now, try to formulate how you would investigate how other speakers in the same file use the same word, this. Did it work? 6.6 Search for specific words You may want to look for several words at the same time. You can do this by creating so-called Include file. Do like this: 1. Open a new, empty CLAN-document 2. List the words you want to look at, one word at each line, for instance: this that it 3. Then Go to the menu “File”, and choose “Save as”. Give the file a name, for instance dempron.cut 4. Save the file somewhere in your directory (at (H)). 5. Then go back to CLAN, open the Commands window and redefine the button “Lib” so that it points at your directory (i.e., where you have saved the file “dempron.cut”. 6. Then write the following command: freq [email protected] +t*chi lara.3-00-00.45.cha 7. What does the output look like? 8. Now, try to make a new includefile that look at all the personal pronoun in the file. Investigate how many the child produces and how many the other speakers produce. Is there a relation? 10 7 KWAL-analyses The program KWAL (Key Word And Line) is good to use if you want to see the context of a word. 1. Select one word of Lara’s that you want to investigate. (Check a transcription file to make sure she uses it.) Below we will look at the word ‘blue’. 2. Write the command below. Make sure to write the word you want to search for directly after the flag ’+s’, e.g. ’+sblue’. kwal +w2 -w1 +w2 -w1 +sWORD +t*chi lara*.cha +u +sblue +t*chi +u lara*.cha means that you will see the two utterances following the keyword you have chosen. means that one utterance before you keyword will show. means that you search for a certain keyword. s is short for string, replace WORD for your keyword the analysis will only be performed on Lara’s utterances. the analysis will be performed on all of Lara’s files. the analysis include all files at ones (u =’unify’). You can vary the commands. Choose to include more or less context (use +w and -w and change the number following these flags). Just as with the FREQ-analysis you can do the analysis on one file or many files. 3. You can use Wild cards in KWAL. Try for instance this command: kwal +t*chi +s*ing lara* What does it give you? 8 MLU-analyses The MLU-program (MLU = Mean length of utterance) is used to study the quantitative development of words, and the syntactic development. The output from this file is the number of morphemes (if the transcription is coded for that) per utterance for every speaker. If the transcription is not coded for morphemes, you will instead get the number of words for every utterance. The result of an MLU is thus the mean length (in morphemes or words) for an utterance. In CLAN, every new speaker turn is an utterance. 1. Now count MLU for Lara’s utterances in the file ‘lara.3-03-09.45.cha’. mlu -t%mor +t*chi lara.3-03-09.45.cha Earlier, the MLU program always counted the words on the speaking line, but a few years ago, the program changed and wanted to calculate MLU using the %mor-tier instead. This is why you have to specify -t%mor in the command (since Lara’s texts do not have any %mor-tier). 11 -t%mor +t*mar means that we do not want MLU to be counted using the %mor-tier, but we want to use the speaker tier/main tier. means that we only want MLU for the speaker Lara. You can in principle leave this out, and still get MLU divided for all speakers. 2. The result is shown in the output window. MLU is the measure called Ratio of morphemes over utterances. MLU for Speaker Number of utterances Morphemes Ratio of morphemes over utterances Standard deviation specifies the speaker the number of utterances in the transcription file number of morphemes in the transcription file, if the file is coded for morpheme, otherwise this will show the number of words. is the MLU-measures (i.e. number of morphemes divided on the number of utterances) Don’t put too much stress on this term if you don’t know statistics. The measure will tell you how much the data varies. A high standarddeviation will tell you that the file consist of on the one hand many long utterances, and on the other hand many short utterances, but that there is not so many utterances in the middle. 12 9 CHIP-analyses CHIP can be used to investigate to what extent the same word and utterance is used between different speakers. This can for instance be used to see how much different speakers repeat each others1 . 1. The idea of CHIP is to compare two specific speakers to each other. You will have to say who is adult and who is child (of course, you can also compare two adults). After this, you can set if you are interested in self repetitions (i.e. when a speaker repeats herself), or only of repetitions that someone else have said. We will choose the latter alternative here. 2. Write the following command: chip +bMOT +cCHI -ns lara.3_02-24.45.cha > larachip1 +bMOT +cCHI -ns tells the program which speaker that is the adult; here MOT (Mother). tells the program which speaker that is the child; here Chi (Lara). here you want to exlude output that is self repetition The output will become rather complicated so it is sent to a file that we call larachip1. 3. Open the output file “larachip1” 2 . what does it look like? 4. Now, look in the bottom of the file. Here you will find a rather complicated table. You can concentrate on the following: %_OVERLAP %_ADD_OPS %_DEL_OPS %_EXA_OPS The percentage of utterances that overlap (in any way) utterances from another speaker The following analyses will only be counted on this percentage overlap. The percentage of the utterances when the speaker added something (addition) to what the previous speaker has said. he percentage of the utterances when the speaker deleted something (deletion) compared to what the previous speaker has said. The percentage of the utterances when the speaker exactly repeated something that the previous speaker has said. 5. If you study the output in the file “larachip1”, can you then say who is mostly expanding the previous speaker’s utterances? 6. Check the CLAN-manual on how to use the CHIP program in more ways. 10 COMBO-analyses The program COMBO is used to many things, but among other things it can be used to analyse how word (pairs) occur together, or to find out which words that do absolutely not occur together! 1. Write the following command to find out if Lara uses the expression I have: 1 Thanks to Sofia Strömbergsson and Erika Ljung who wrote the paper Automatisk analys av interaktionsmönster i vuxen-barnsamtal. Institutionen för logopedi, audiologi och foniatri, Lunds universitet, vt 2005. 2 You can for instance open it in Excel! 13 combo +sI^have +t*chi lara*.cha +sWORD WORD^WORD indicates which word you want to search for. is used between the two words you are interested in. You can also use wild cards to investigate if the words occur together, but not in connection with each other. Then write: ORD^*^ORD 2. What is the result? Does Lara use this expression? 3. Maybe there are other occasions, when the words I and have occur together but not next to each other. How would you write a command to investigate this? 4. Maybe Lara uses the words I and have together, but not necessarily in that order. To look if two words occurs together, independent of order you can use +x. Write for instance: combo +sI^have +x +t*chi lara*.cha 5. Maybe Lara sometimes uses I without have. To find all the cases when a word occurs without another word, you should use ^!. Test for instance: combo +sI^!have +x +t*chi lara*.cha 6. Read more about COMBO in the CLAN-manual!! 11 Lexical diversity using VOCD Lexical diversity measures how many different words there is in a text, and can thus tell you something about the lexical variation for a certain speaker/writer. Often this has been measured using the socalled TTR measure (Type/Token ratio) which is found in the bottom of a frequency count in CLAN, but this is not so useful if you want to compare texts of different length. You can read more about this in the CLAN manual, under the VOCD-section. Using VOCD (Vocabulary Diversity) is really very simple. You should think about: • The text should have at least 50 words. • The transcription should have been controled with CHECK, and passed without remarks. • You can only run VOCD on one file at the time. (If you have several speakers in one and the same file, you will have to specify every speaker – and every speaker will have had to utter at least 50 words.) In this example we work with a data set from the Spencer project, found at: Gu-material/Humanistlaboratoriet/Logopedlabb/CLAN-data/Spencer/Swedish Then write the command: vocd wg01fAES*.cha 14 The program will run through the file and make some counts. In principle, the VOCD counts the value three times and the mean value of these three calculations will be the “D” value in the end. At the very bottom of the file you will see the header D_optimum average, which will give you the result 57.81 (in this case). VOCD RESULTS SUMMARY ==================== Command line: vocd Macintosh HD:Users:victoria: Documents:Spencerstudien:Alla:Allasamlade:wg01fAES.cha File name: Macintosh HD:Users:victoria: Documents:Spencerstudien:Alla:Allasamlade:wg01fAES.cha Types,Tokens,TTR: <72,122,0.590164> D_optimum values: <57.46, 58.60, 57.36> D_optimum average: 57.81 What ‘D’ (the value that VOCD calculates) really is remains a bit unclear (but read more under section 9.25 in the huge CLAN-manual). You can calculate VOCD for different speakers and texts and compare them to each other, but egentligen är förblir lite oklart. Värdet kan användas för att jämföra ordförrådet mellan olika grupper, men man bör vara försiktigt med att t ex göra tvärspråkliga jämförelser, eftersom olika morfologi kan göra att värdena skiljer sig mycket åt. 12 Lexical density Lexical density or information packaging measures the percentage content words (mainly the word classes nouns, verbs, adjectives) of all words in a text, i.e. one gives a measure on how high percentage the nouns, adjectives and verbs constitutes of all words in the text. Often some lexical adverbs (’fast’) are also included in this category of content words, for instance adverbs derived from adjectives (e.g. ’slowly’). If the words in the texts are not tagged for parts-of-speech one may wonder how to calculate this. But while the content words are unlimited, the function words are usually easy to find – pronouns, conjunctions and subjunctions, prepositions and the rest can be collected from a grammar book for instance. So the procedure of caculating lexical density with CLAN means that one makes a list of all the function words (it does not have to be all the function words in the language, but to list the ones occurring in the corpus is enough for this purpose), and then use a negative include file search to sort out all words that are not function words. You can for instance do like this: Du kan till exempel göra så här: 1. Start with creating a new include file where you list all the function words you can think of (use grammar books, but also make a frequency count of all the most common words in your corpus – there will most likely be a lot of function words among the top 50 words or so). 2. The include file is created by starting CLAN, select File and then New. In this new document you list all the words, one on every line. List pronouns, prepositions, con/subjunctions, the most common count words and interjections. List (if you have decided to do so) the most common adverbs: ’not’, ’where’, ’how’, ’actually’, etc. 3. Save the file, for instance with the name ’functionwords.cut’ in the same file where you have saved CLAN’s lib-file. And don’t forget to give the name the file extension ’.cut’. 4. Then make a frequency count using the file on your material. In this frequency count you should exclude everything that you have listed in your function word file. Make sure to send the output 15 to a file (otherwise it will most likely be so long that you will not be able to see the whole result). If you let the output be a word list where you see the words alphabetically with no frequency information, it will be easy for you to notice whether there are many function words you have forgotten (these should be picked out and put on your function list). Repeat this procedure until you are sure that there are no function words left on the list. The commands can look like this: freq +d1 [email protected] filenames.cha > functionoutput 5. Then open the file which in this case is called ’functionoutput’, and check if there are any function words on this list. If you find any, then add them to your file functionwords.cut. 6. In the end you will have a complete list of all the function words that exist in your corpus. (If you later run that one against another corpus, you will probably have to add more function words that were not in your original corpus.) 7. Now, once you have a complete function word list, run it against your corpus, file by file and note how many function words and content words every file contain (you get function words if you run a negative frequency count with the include list, and content words if you run a positive frequency count). Something like this to calculate all the content words: freq +d4 [email protected] filenames.cha > contentwordoutput Or this to calculate all the function words: freq +d4 [email protected] filenames.cha > functionwordsoutput 8. Also, make sure to run “ordinary” frequency counts to know the total amount of words in every file: freq +d4 filenames.cha > allwordsoutput 9. Put the results for instance in an excel file, and based on the total amount of words in the file, and the amount of concent words, calculate how big percentage the content words contain. 12.1 Using COMBO to tag COMBO can also be used to tag and code your file. In this small exercise we will code all the tokens of the word ‘we’ that the mother utters in the small file “engtest.cha” (found in the Exercisematerial in Humanistlaboratoriet/CreatingalinguisticcorpusVT13/CLAN/Exercises Do like this: 1. Open a new window in CLAN (go to the File menu and select “New”. ) 16 2. Write the word “we” in the file. 3. The choose File again, and select “Save as...”. 4. Save the file with the name “we.cut”. Preferably you save the file in the same folder as the depfile (e.g in the CLAN program folder). If you don’t do this you will have to (later) specify the morlib button in the Working window to work on the folder where you have put your “we.cut”-file. 5. Then open yet another new windown in CLAN, by choosing File and select “New”. 6. In this file you should write exactly as below, and be careful with quotation marks, full stops, etc. You have to have a tab between the three groups of words (thus NO regular spaces here). "@we.cut" "$we" "%cod:" 7. This command will make the program to look for the include file “we.cut” (i.e. the file you created previously), and then code everything that is listed in that file (i.e. in this case only the word ‘we’) using the code $we on a specific coding tier called %cod . The program will automatically generate this coding tier in the cases where the word ‘we’ occurs. In a minute we will see how it works. 8. Now, save this file with the name “taggawe.cut”. Save it at the same place/the same folder as the file “we.cut” is saved. 9. Return to CLAN and open the Commands window. Redefine the morlib button so that it points at the folder where you have saved the files “we.cut” and “taggawe.cut” 10. Then write this command: combo [email protected] +t*MAM +d4 engtest.cha 11. In this case the file name is “engtest.cha”, since we are working with that file. 12. The commando says that you want to use COMBO to call for the include file “taggawe.cut” (you call for this file by the command +@s). Then you want to search for all the utterances by the mother (+t*MAM). The flag +d4 says that if a match is found (i.e. if the mother says ‘we’), then CLAN should perform what is found in the file “taggawe.cut” (i.e. write a coding tier %cod with the code $we). 17
© Copyright 2026 Paperzz