On Use of Temporal Dynamics of Speech for Language Identification

On Use of Temporal Dynamics
of Speech for Language
Identification
Andre Adami
Petr Schwarz
Pavel Matejka
Hynek Hermansky
Anthropic Signal Processing Group
http://www.asp.ogi.edu
2003 NIST LID Evaluation
AGA 4/28/2003
OGI-4 – ASP System
•
Goal
–
•
Approach
–
–
•
Convert the speech signal into a sequence of discrete sub-word units
that can characterize the language
Use temporal trajectories of speech parameters to obtain the sequence
of units
Model the sequence of discrete sub-word units using a N-gram
language model
Sub-word units
–
–
–
TRAP-derived American English phonemes
Symbols derived from prosodic cues dynamics
Phonemes from OGI-LID
Speech
signal
Target language
model
Segmentation
units
Background
model
(+)
+
score
(-)
2003 NIST LID Evaluation
2
AGA 4/28/2003
American English Phoneme Recognition
Frequency
Short-term analysis
classifier
Time
phone
Temporal patterns paradigm
•
•
Phoneme set
–
Phoneme Recognizer
–
–
–
•
39 American English phonemes (CMU-like)
trained on NTIMIT
TRAP (Temporal Patterns) based
Speech segments for training obtained from energy-based speech/nonspeech
segmentation
Modeling
–
3-gram language model
2003 NIST LID Evaluation
3
AGA 4/28/2003
English Phoneme System
frequency
Band Classifier 1
Band Classifier 2
Merger
Viterbi
search
Band Classifier N
time
•
Temporal trajectories
–
–
•
23 mel-scale frequency band
1 s segments of log energy
trajectory
Band classifiers
–
–
–
MLP (101x300x39)
Hidden unit nonlinearities:
sigmoids
Output nonlinearities: softmax
•
•
Merger
–
MLP (897x300x39)
Viterbi search
–
Penalty factor tuning : deletions =
insertions
• Training
–
NTIMIT
2003 NIST LID Evaluation
4
AGA 4/28/2003
Prosodic Cues Dynamics
• Technique
– Using prosodic cues
(intensity and pitch
trajectories) to derive the
sub-word units
• Approach
– Segment the speech
signal at the inflection
points of trajectories
(zero-crossings of the
derivative) and at the
onsets and offsets of
voicing
– Label the segment by the
direction of change of
the parameter within the
segment
Class
Temporal Trajectory Description
1
rising f0 and rising energy
2
rising f0 and falling energy
3
falling f0 and rising energy
4
falling f0 and falling energy
5
unvoiced segment
2003 NIST LID Evaluation
5
AGA 4/28/2003
Prosodic Cues Dynamics
•
Duration
–
•
•
10 symbols
Broad-phonetic-category (BFC)
–
–
•
•
The duration of the segment is
characterized as “short” (less
than 8 frames) or “long”
Finer labeling achieved by
estimating the broad-phonetic
category
(vowel+diphthong+glide,
schwa, stop, fricative, flap,
nasal, and silence) coinciding
with each prosodic segment
BFC TRAPs trained on NTIMIT
is used for deriving the broad
phonetic categories
61 symbols
3-gram language model
• BFC TRAPS Setup
– Input temporal vectors
• 15 bark-scale frequency
band energy
• 1s segments of log
energy trajectory
• Mean and variance
normalized
• Dimension
reduction:DCT
– Band classifiers
• MLP (15x100x7)
• Hidden units: sigmoid
• Output units: softmax
– Merger
• MLP (105x100x7)
2003 NIST LID Evaluation
6
AGA 4/28/2003
Minimum DCF
OGI-4 – ASP System
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
3 secs
10 secs
30 secs
OGI-LID
EER30s=41.4%
TRAPs
Phonemes
EER30s=19.3%
Prosody Cues
Dynamics
EER30s=32.1%
Fusion
EER30s=17.8%
2003 NIST LID Evaluation
7
AGA 4/28/2003
OGI-4 – ASP System
EER30s=17.8%
2003 NIST LID Evaluation
8
AGA 4/28/2003
Post-Evaluation – Phoneme System
•
•
Speech-nonspeech segmentation using silence
classes from TRAP-based classification
TRAPs classifier
– Temporal trajectory duration - 400ms
– 3 bands as the input trajectory for each band classifier
to explore the correlation between adjacent bands
• The trajectories of 3 bands are projected into a DCT basis
(20 coefficients)
– Viterbi search tuned for language identification
•
Training data
– CallFriend training and development sets
2003 NIST LID Evaluation
9
AGA 4/28/2003
Post-Evaluation – Phoneme System
34% relative
improvement
EER30s=12.7%
2003 NIST LID Evaluation
10
AGA 4/28/2003
Post-Evaluation – Prosodic Cues
System
•
No energy-based segmentation
– Unvoiced segments longer than 2 seconds are considered
non-speech
• No broad-phonetic category labeling applied
– Rate of change plus the quantized duration (10 tokens)
•
Training data
– CallFriend training and development sets
2003 NIST LID Evaluation
11
AGA 4/28/2003
Post-Evaluation – Prosodic Cues
System
30% relative
improvement
EER30s=22.2%
2003 NIST LID Evaluation
12
AGA 4/28/2003
Fusion - 30 sec condition
• Fusing the scores from the
prosodic cues system
EER30s=5.7%
– with TRAP-derived
phonemes: EER30s= 10.5%
(17% relative improvement)
– with OGI-LID derived
phonemes: EER30s= 6.6%
14% relative improvement
• TRAP-derived phoneme
system fused with OGI-LID:
26% relative
improvement
– EER30s= 6.2%
19% relative improvement
2003 NIST LID Evaluation
13
AGA 4/28/2003
Conclusions
•
•
Sequences of discrete symbols derived from
speech dynamics provide useful information for
characterizing the language
Two techniques for deriving the sequences of
symbols investigated
– segmentation and labeling based on prosodic cues
– segmentation and labeling based on TRAP-derived
phonetic labels
•
The introduced techniques combine well with
each other as well as with the more conventional
language ID techniques
2003 NIST LID Evaluation
14
AGA 4/28/2003