Computer-assisted Text Analysis: Transcription

Computer-assisted Text Analysis:
Transcription
How should audio record be transcribed ?
– Not usually phonetically ...
– nor include extraneous noise
– though coughs, pauses, hesitation etc of interest
to conversational analyst!
Before we look at how to transcribe,
let’s look at two transcriptions of
the same audio …
CATA- BLOCK 1 Pt2
Pre-Transcription
COMPARE “repaired “ and more detailed transcription3
Sanitized:
Normally, after heavy rain, when you’re driving along the road, you see far away
a series of stripes formed like a bow made out of seven colours (or, rather,
it’s a series of colours because they are hard to separate)
Closer to original:
Normally after + very heavy rain + or something like that + and + you’re driving
along the road + and + far away + you see + well + er + a series of stripes
formed like a bow+an arch ++ very far away+ er + seven colours but ++ I
guess you hardly ever see seven it’s just a + a series of + colours which
+they seem to be separate but if you try to look for the separate [kz] colours
they always seem + very hard+ to separate + if you see what I mean
• pauses, repetitions, incomplete sentences, fillers, slip-of the tongue
...(redundancy ... managing turn-taking; “pre-delicate hitch”)
... but semantically the same?
3Brown & Yule (1983) Discourse Analysis, Cambridge UP, p 18
CATA- BLOCK 1 Pt2
Transcription audio -> text
See supplementary files (Website):
– B1-RecordingandTranscribing.pdf
– Useful info. From Calgary Uni.
– Raw-SanitSegregTranscript
– Segment from Hierarchies task (POOC) in three transcript versions
» LET’s LOOK!
– TranscribingSites
– Web references for some sites giving advice, and “Do’s & Don’t’s”
– … the most important of which is:
– http://www.esds.ac.uk/qualidata/create/transcription.asp
read first -- and gives general guidelines from
QUALIDATA (ESRC Data Archive).
CATA- BLOCK 1 Pt2
Computer-assisted Text Analysis: Vocabulary
Computer utilities/programs can be used whatever
the form of textual analysis.
– We’ll use HAMLET … easy and useful for analysis
.FIRST THING: Vocabulary List
– Word (and/or word sense) count
– “word” = every distinct set of letters separated by spaces
( so ‘tree’ and ‘trees’ are distinct )
– “word-sense” includes all inflexions and variants of
common root (am, art,is,are .. Were .. Been … I.e. BE+)
– A word reduced to its root + inflexions has been
“lemmatised”
CATA- BLOCK 1 Pt2
Computer-assisted Text Analysis: Zipf’s Law
The distribution of the frequency of words is :
– HIGHLY skew, with l-o-n-g- tail:
• A few words appear VERY frequently
• A HUGE number appear very rarely
– The more frequent, the shorter (usually)
– Most common are …
• The, I, you, and it ….
– The frequency of a word is
inversely proportional to its
statistical rank r
CATA- BLOCK 1 Pt2
Computer-assisted Text Analysis: Zipf’s Law
Frequency against Rank … Zipf’s Law:
CATA- BLOCK 1 Pt2
Norms of English Written/Spoken English
Would be useful to have sample of spoken and written
English to act as yard-stick against which to assess
word-frequency of one’s textual corpus
There is ☺
– British National Corpus (Univ. Lancaster) a 100,000,000
word electronic databank sampled from the whole range
of present-day English, spoken and written
– Rank-ordered and alphabetical frequency lists Includes
discussions of a number of thematic frequency lists such as
colour terms, female vs. male terms, etc
• http://www.comp.lancs.ac.uk/computing/research/ucrel/bncfreq/
CATA- BLOCK 1 Pt2
Comparison against BNC
Example: Police-Officer#2 from EDDEP
• Interviewer components removed
• All 113 distinct words $ 1% in order of frequency
in interview compared to Lancs data [file in
Lancs-EDDEPC2-Freq.pdf]
• Remarkable how such a small file produces
marked similarities with Lancs. Corpus
• Marked differences emboldened
– Give flavour of significant material
– Compare “Leftover List” in GI-III
CATA- BLOCK 1 Pt2
Comparison: ED-PC2 with BNC
see Lancs-EDDEPC2-Freq file for longer list
Lancs Uni Spoken English
Word
Part /p.m.
the
Det 39605
I
Pron 29448
you
Pron 25957
and
Conj 25210
it
Pron 24508
a
Det 18637
's
Verb 17677
to
Inf
14912
of
Prep 14550
that
DetP 14252
Edinburgh Dep. Police2
Wordfreq %
the232
27.6
and156
18.5
that 121
14.4
you121
14.4
to119
14.1
a101
12
I 92
10.9
in90
10.7
of 79
9.4
is78
9.3
CATA- BLOCK 1 Pt2