Computer-assisted Text Analysis: Transcription How should audio record be transcribed ? – Not usually phonetically ... – nor include extraneous noise – though coughs, pauses, hesitation etc of interest to conversational analyst! Before we look at how to transcribe, let’s look at two transcriptions of the same audio … CATA- BLOCK 1 Pt2 Pre-Transcription COMPARE “repaired “ and more detailed transcription3 Sanitized: Normally, after heavy rain, when you’re driving along the road, you see far away a series of stripes formed like a bow made out of seven colours (or, rather, it’s a series of colours because they are hard to separate) Closer to original: Normally after + very heavy rain + or something like that + and + you’re driving along the road + and + far away + you see + well + er + a series of stripes formed like a bow+an arch ++ very far away+ er + seven colours but ++ I guess you hardly ever see seven it’s just a + a series of + colours which +they seem to be separate but if you try to look for the separate [kz] colours they always seem + very hard+ to separate + if you see what I mean • pauses, repetitions, incomplete sentences, fillers, slip-of the tongue ...(redundancy ... managing turn-taking; “pre-delicate hitch”) ... but semantically the same? 3Brown & Yule (1983) Discourse Analysis, Cambridge UP, p 18 CATA- BLOCK 1 Pt2 Transcription audio -> text See supplementary files (Website): – B1-RecordingandTranscribing.pdf – Useful info. From Calgary Uni. – Raw-SanitSegregTranscript – Segment from Hierarchies task (POOC) in three transcript versions » LET’s LOOK! – TranscribingSites – Web references for some sites giving advice, and “Do’s & Don’t’s” – … the most important of which is: – http://www.esds.ac.uk/qualidata/create/transcription.asp read first -- and gives general guidelines from QUALIDATA (ESRC Data Archive). CATA- BLOCK 1 Pt2 Computer-assisted Text Analysis: Vocabulary Computer utilities/programs can be used whatever the form of textual analysis. – We’ll use HAMLET … easy and useful for analysis .FIRST THING: Vocabulary List – Word (and/or word sense) count – “word” = every distinct set of letters separated by spaces ( so ‘tree’ and ‘trees’ are distinct ) – “word-sense” includes all inflexions and variants of common root (am, art,is,are .. Were .. Been … I.e. BE+) – A word reduced to its root + inflexions has been “lemmatised” CATA- BLOCK 1 Pt2 Computer-assisted Text Analysis: Zipf’s Law The distribution of the frequency of words is : – HIGHLY skew, with l-o-n-g- tail: • A few words appear VERY frequently • A HUGE number appear very rarely – The more frequent, the shorter (usually) – Most common are … • The, I, you, and it …. – The frequency of a word is inversely proportional to its statistical rank r CATA- BLOCK 1 Pt2 Computer-assisted Text Analysis: Zipf’s Law Frequency against Rank … Zipf’s Law: CATA- BLOCK 1 Pt2 Norms of English Written/Spoken English Would be useful to have sample of spoken and written English to act as yard-stick against which to assess word-frequency of one’s textual corpus There is ☺ – British National Corpus (Univ. Lancaster) a 100,000,000 word electronic databank sampled from the whole range of present-day English, spoken and written – Rank-ordered and alphabetical frequency lists Includes discussions of a number of thematic frequency lists such as colour terms, female vs. male terms, etc • http://www.comp.lancs.ac.uk/computing/research/ucrel/bncfreq/ CATA- BLOCK 1 Pt2 Comparison against BNC Example: Police-Officer#2 from EDDEP • Interviewer components removed • All 113 distinct words $ 1% in order of frequency in interview compared to Lancs data [file in Lancs-EDDEPC2-Freq.pdf] • Remarkable how such a small file produces marked similarities with Lancs. Corpus • Marked differences emboldened – Give flavour of significant material – Compare “Leftover List” in GI-III CATA- BLOCK 1 Pt2 Comparison: ED-PC2 with BNC see Lancs-EDDEPC2-Freq file for longer list Lancs Uni Spoken English Word Part /p.m. the Det 39605 I Pron 29448 you Pron 25957 and Conj 25210 it Pron 24508 a Det 18637 's Verb 17677 to Inf 14912 of Prep 14550 that DetP 14252 Edinburgh Dep. Police2 Wordfreq % the232 27.6 and156 18.5 that 121 14.4 you121 14.4 to119 14.1 a101 12 I 92 10.9 in90 10.7 of 79 9.4 is78 9.3 CATA- BLOCK 1 Pt2
© Copyright 2026 Paperzz