The Reliability of Formant Measurements in High Quality

The Reliability of Formant Measurements
in High Quality Audio Data: The Effect of
Agreeing Measurement Procedures
Martin Duckworth, Kirsty McDougall,
Gea de Jong, Linda Shockey
Introduction
• Formant measurement implicitly required
legally in the UK in speaker comparison
cases
• Measurements on analogue spectrograms
had to be by hand and eye
• Measurements on digital spectrograms
can be assisted by formant trackers, LPC
is common
Introduction
• How replicable are measurements by eye
on digital spectrograms?
Introduction
• How replicable are measurement by eye
on digital spectrograms?
• If LPC tracking is used what can lead to
variability?
Introduction
• How replicable are measurement by eye
on digital spectrograms?
• If LPC tracking is used what can lead to
variability?
− Software settings
Introduction
• How replicable are measurement by eye
on digital spectrograms?
• If LPC tracking is used what can lead to
variability?
− Software settings
− Point at which data is extracted
Study Aims
• What is required in order to make
measurements more replicable?
Study Aims
• What is required in order to make
measurements more replicable?
• If software (but not method) is held
constant and data is high quality, can
different laboratories make the same F1-3
measurements?
Study Aims
• What is required in order to make
measurements more replicable?
• If software (but not method) is held
constant and data is high quality, can
different laboratories make the same F1-3
measurements?
• If method of analysis is the same does this
lead to statistically improved reliability
between laboratories?
Aims continued
• We are aiming to find a reliable means of
obtaining formant values
• We are examining reliability, not validity
Data
• read speech from Cambridge DyViS
database
• male
• Standard Southern British English
• aged 18-25
• 40 speakers: Set 1 (20 speakers)
Set 2 (20 speakers)
Data
• 6 monophthongs: /
iː, æ, ɑː, ɔː, ʊ, uː /
• 6 repetitions per vowel per speaker
• elicited in hVd contexts in sentences:
It’s a warning we’d better HEED today.
It’s only one loaf, but it’s all Peter HAD today.
We worked rather HARD today.
We built up quite a HOARD today.
He insisted on wearing a HOOD today.
He hates contracting words, but he said a WHO’D today.
Measurements
• Analysts from 3 labs – Cambridge,
Plymouth, Reading
• Task: to measure F1, F2, F3 for each
vowel token using Praat
• Set 1 – using individual – but constrainedmethods
• Set 2 – after a meeting at which a single
method is agreed
Set 1 Methods
• Measure the formants at a relatively early
point in the vowel
Set 1 Methods
• Measure the formants at a relatively early
point in the vowel
• Measure formants over no more than 5
glottal pulses
Set 1 Methods
• Measure the formants at a relatively early
point in the vowel
• Measure formants over no more than 5
glottal pulses
• Use either:
− LPC tracking checked against the
spectrogram or
Set 1 Methods
• Measure the formants at a relatively early
point in the vowel
• Measure formants over no more than 5
glottal pulses
• Use either:
− LPC tracking checked against the
spectrogram or
− hand/eye measures
Set 2 Method
• Measure towards the start of the vowel
Set 2 Method
• Measure towards the start of the vowel
• Measure in a relatively steady early part of
the vowel
Set 2 Method
• Measure towards the start of the vowel
• Measure in a relatively steady early part of
the vowel
• Measure around the vowel's maximum
intensity
Set 2 Method
• Measure towards the start of the vowel
• Measure in a relatively steady early part of
the vowel
• Measure around the vowel's maximum
intensity
• Use a single time slice
Set 2 Method (continued)
• Use the LPC formant tracker adjusted for
best visual fit
Set 2 Method (continued)
• Use the LPC formant tracker adjusted for
best visual fit
• When values generated by Praat are
judged by visual inspection to be incorrect,
replace them by correct values from a
time-slice immediately preceding or
following the slice being measured.
Results: HAD, F1
Set 1
Lab1
Lab2
Lab3
Results: HAD, F1
Set 1
Lab1
Lab2
Lab3
Results: HAD, F1
Set 1
Lab1
Set 2
Lab2
Lab3
Lab1
Lab2
Lab3
Results: HAD, F1
Set 1
Lab1
Set 2
Lab2
Lab3
Lab1
Lab2
Lab3
Statistical Analysis
• 3 formants  6 vowels  2 datasets
= 36 tests
• Two-way ANOVA
- repeated measures on the factor Lab (3)
- between-groups factor Speaker (20)
• If Lab signficant at p < 0.05:
Pairwise comparisons with Sidak
correction
Results: HAD, F1
Set 1
Lab1
Set 2
Lab2
Lab3
Lab1
Lab2
Lab3
Results: HAD, F1
Set 1
Lab1
Set 2
Lab2
Lab3
Lab: significant
Lab1
Lab2
Lab3
Results: HAD, F1
Set 1
Set 2
0.001
0.000
0.000
Lab1
Lab2
Lab3
Lab: significant
Lab1
Lab2
Lab3
Results: HAD, F1
Set 1
Set 2
0.001
0.000
0.000
Lab1
Lab2
Lab3
Lab: significant
Lab1
Lab2
Lab3
Lab: significant but
pairwise comparisons NS
Results: HAD, F1
Set 1
Set 2
0.001
0.000
0.000
Lab1
Lab2
NS
NS
Lab3
Lab: significant
Lab1
NS
Lab2
Lab3
Lab: significant but
pairwise comparisons NS
Results: HAD, F2
Results: HAD, F2
Set 1
Set 2
NS
NS
NS
Lab1
NS
Lab2
Lab3
Lab: not significant
NS
NS
Lab1
Lab2
Lab3
Lab: not significant
Results: HAD, F3
Results: HAD, F3
Set 1
Set 2
NS
0.000
NS
0.000
Lab1
Lab2
NS
NS
Lab3
Lab: significant
Lab1
Lab2
Lab3
Lab: not significant
Summary - HAD
Set 1
Set 2
F1
F2
F3
F1
F2
F3
Lab
sig
NS
sig
sig
NS
NS
1 vs 2
sig
NS
NS
NS
NS
NS
1 vs 3
sig
NS
sig
NS
NS
NS
2 vs 3
sig
NS
sig
NS
NS
NS
Summary - HAD
main effect
Set 1
Set 2
F1
F2
F3
F1
F2
F3
Lab
sig
NS
sig
sig
NS
NS
1 vs 2
sig
NS
NS
NS
NS
NS
1 vs 3
sig
NS
sig
NS
NS
NS
2 vs 3
sig
NS
sig
NS
NS
NS
Summary - HAD
Set 1
Set 2
F1
F2
F3
F1
F2
F3
Lab
sig
NS
sig
sig
NS
NS
1 vs 2
sig
NS
NS
NS
NS
NS
1 vs 3
sig
NS
sig
NS
NS
NS
2 vs 3
sig
NS
sig
NS
NS
NS
pairwise comparisons
Summary - HAD
Set 1
Set 2
F1
F2
F3
F1
F2
F3
Lab
sig
NS
sig
sig
NS
NS
1 vs 2
sig
NS
NS
NS
NS
NS
1 vs 3
sig
NS
sig
NS
NS
NS
2 vs 3
sig
NS
sig
NS
NS
NS
Summary - HAD
Set 1
Set 2
F1
F2
F3
F1
F2
F3
Lab
sig
NS
sig
sig
NS
NS
1 vs 2
sig
NS
NS
NS
NS
NS
1 vs 3
sig
NS
sig
NS
NS
NS
2 vs 3
sig
NS
sig
NS
NS
NS
improvement
Summary - HAD
Set 1
Set 2
F1
F2
F3
F1
F2
F3
Lab
sig
NS
sig
sig
NS
NS
1 vs 2
sig
NS
NS
NS
NS
NS
1 vs 3
sig
NS
sig
NS
NS
NS
2 vs 3
sig
NS
sig
NS
NS
NS
Summary - HAD
Set 1
Set 2
F1
F2
F3
F1
F2
F3
Lab
sig
NS
sig
sig
NS
NS
1 vs 2
sig
NS
NS
NS
NS
NS
1 vs 3
sig
NS
sig
NS
NS
NS
2 vs 3
sig
NS
sig
NS
NS
NS
Summary - HAD
Set 1
Set 2
F1
F2
F3
F1
F2
F3
Lab
sig
NS
sig
sig
NS
NS
1 vs 2
sig
NS
NS
NS
NS
NS
1 vs 3
sig
NS
sig
NS
NS
NS
2 vs 3
sig
NS
sig
NS
NS
NS
improvement
Summary - HAD
Set 1
Set 2
F1
F2
F3
F1
F2
F3
Lab
sig
NS
sig
sig
NS
NS
1 vs 2
sig
NS
NS
NS
NS
NS
1 vs 3
sig
NS
sig
NS
NS
NS
2 vs 3
sig
NS
sig
NS
NS
NS
Summary - HAD
Set 1
Set 2
F1
F2
F3
F1
F2
F3
Lab
sig
NS
sig
sig
NS
NS
1 vs 2
sig
NS
NS
NS
NS
NS
1 vs 3
sig
NS
sig
NS
NS
NS
2 vs 3
sig
NS
sig
NS
NS
NS
Set 2: good news
Effect of Lab - 6 vowels
Set 1
F1
F2
F3
heed
sig
NS
sig
had
sig
NS
sig
hard
sig
sig
sig
hoard
sig
sig
sig
who’d
sig
sig
NS
hood
sig
sig
sig
Effect of Lab - 6 vowels
Set 1
Set 2
F1
F2
F3
F1
F2
F3
heed
sig
NS
sig
sig
NS
sig
had
sig
NS
sig
sig
NS
NS
hard
sig
sig
sig
NS
sig
sig
hoard
sig
sig
sig
sig
sig
NS
who’d
sig
sig
NS
sig
sig
sig
hood
sig
sig
sig
NS
sig
NS
Influence of Speaker
• Interaction Lab x Speaker significant
(p < 0.05) for F1-F3 of all 6 vowels
for both Set 1 and Set 2
 certain speakers lead to measurement
differences among labs
for example…
F3 of HARD (Set 2)
means by speaker
F3 of HARD (Set 2)
means by speaker
Agreement
across labs in
most cases,
but certain
individuals
lead to
measurement
differences
among labs
F3 of HARD (Set 2)
means by speaker
Agreement
across labs in
most cases,
but certain
individuals
lead to
measurement
differences
among labs
Difficult cases: subject 42 F3
Subject 42 HARD4 F3 = 2219Hz
Subject 42 HARD2 F3 = 2579Hz
Subject 42 HARD6 F3 = 3325 Hz
Difficult cases: subject 43 F3
Visual inspection vs formant tracker
Visual inspection
Subject 43 HARD1 F3?
Visual inspection
Subject 43 HARD2 F3?
Visual inspection
Visual inspection
Subject 43 HARD2 F3?
Subject 43 HARD1 F3?
Tracker
Tracker
The effect of intraspeaker
variability, possibly voice quality
• This can affect:
− The visibility of formants
− The functioning of the LPC tracker
for example…
The effect of intraspeaker
variability
..had today.
Subject 37: HAD1 F1=??
..had today.
Subject 37: HAD6 F1
Discussion: Laboratory Effects
• Do different laboratories produce different
formant values?
Discussion: Laboratory Effects
• Do different laboratories produce different
formant values? YES
Discussion: Laboratory Effects
• Do different laboratories produce different
values formant values? YES
• Does replicating the measurement method
reduce these differences?
Discussion: Laboratory Effects
• Do different laboratories produce different
formant values? YES
• Does replicating the measurement method
reduce these differences? YES
Discussion: Laboratory Effects
• Do different laboratories produce different
formant values? YES
• Does replicating the measurement method
reduce these differences? YES
• Could these be reduced further?
Discussion: Laboratory Effects
• Do different laboratories produce different
formant values? YES
• Does replicating the measurement method
reduce these differences? YES
• Could these be reduced further? YES
Other sources of variability
• Settings (e.g. No. of poles; No of Formants
in Praat)
Other sources of variability
• Settings
• The exact point in the vowel at which the
measure is taken
Other sources of variability
• Settings
• The exact point in the vowel at which the
measure is taken
• The ‘readability’ of the spectrogram which
can be affected by speaker characteristics
Conclusion
• Developing standard ways of collecting
formant values could assist comparisons
between experts in case work
• If records are kept relating to time points,
software and settings then the
measurement process can be replicated
Acknowledgements
• IAFPA Research Grant for travel expenses
• Economic and Social Research
Council UK for funding the DyViS
Project ‘Dynamic Variability in
Speech: A Forensic Phonetic
Study of British English’ [RES-000-23-1248]
• Other members of the DyViS project –
Francis Nolan and Toby Hudson