On Use of Temporal Dynamics of Speech for Language Identification Andre Adami Petr Schwarz Pavel Matejka Hynek Hermansky Anthropic Signal Processing Group http://www.asp.ogi.edu 2003 NIST LID Evaluation AGA 4/28/2003 OGI-4 – ASP System • Goal – • Approach – – • Convert the speech signal into a sequence of discrete sub-word units that can characterize the language Use temporal trajectories of speech parameters to obtain the sequence of units Model the sequence of discrete sub-word units using a N-gram language model Sub-word units – – – TRAP-derived American English phonemes Symbols derived from prosodic cues dynamics Phonemes from OGI-LID Speech signal Target language model Segmentation units Background model (+) + score (-) 2003 NIST LID Evaluation 2 AGA 4/28/2003 American English Phoneme Recognition Frequency Short-term analysis classifier Time phone Temporal patterns paradigm • • Phoneme set – Phoneme Recognizer – – – • 39 American English phonemes (CMU-like) trained on NTIMIT TRAP (Temporal Patterns) based Speech segments for training obtained from energy-based speech/nonspeech segmentation Modeling – 3-gram language model 2003 NIST LID Evaluation 3 AGA 4/28/2003 English Phoneme System frequency Band Classifier 1 Band Classifier 2 Merger Viterbi search Band Classifier N time • Temporal trajectories – – • 23 mel-scale frequency band 1 s segments of log energy trajectory Band classifiers – – – MLP (101x300x39) Hidden unit nonlinearities: sigmoids Output nonlinearities: softmax • • Merger – MLP (897x300x39) Viterbi search – Penalty factor tuning : deletions = insertions • Training – NTIMIT 2003 NIST LID Evaluation 4 AGA 4/28/2003 Prosodic Cues Dynamics • Technique – Using prosodic cues (intensity and pitch trajectories) to derive the sub-word units • Approach – Segment the speech signal at the inflection points of trajectories (zero-crossings of the derivative) and at the onsets and offsets of voicing – Label the segment by the direction of change of the parameter within the segment Class Temporal Trajectory Description 1 rising f0 and rising energy 2 rising f0 and falling energy 3 falling f0 and rising energy 4 falling f0 and falling energy 5 unvoiced segment 2003 NIST LID Evaluation 5 AGA 4/28/2003 Prosodic Cues Dynamics • Duration – • • 10 symbols Broad-phonetic-category (BFC) – – • • The duration of the segment is characterized as “short” (less than 8 frames) or “long” Finer labeling achieved by estimating the broad-phonetic category (vowel+diphthong+glide, schwa, stop, fricative, flap, nasal, and silence) coinciding with each prosodic segment BFC TRAPs trained on NTIMIT is used for deriving the broad phonetic categories 61 symbols 3-gram language model • BFC TRAPS Setup – Input temporal vectors • 15 bark-scale frequency band energy • 1s segments of log energy trajectory • Mean and variance normalized • Dimension reduction:DCT – Band classifiers • MLP (15x100x7) • Hidden units: sigmoid • Output units: softmax – Merger • MLP (105x100x7) 2003 NIST LID Evaluation 6 AGA 4/28/2003 Minimum DCF OGI-4 – ASP System 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 3 secs 10 secs 30 secs OGI-LID EER30s=41.4% TRAPs Phonemes EER30s=19.3% Prosody Cues Dynamics EER30s=32.1% Fusion EER30s=17.8% 2003 NIST LID Evaluation 7 AGA 4/28/2003 OGI-4 – ASP System EER30s=17.8% 2003 NIST LID Evaluation 8 AGA 4/28/2003 Post-Evaluation – Phoneme System • • Speech-nonspeech segmentation using silence classes from TRAP-based classification TRAPs classifier – Temporal trajectory duration - 400ms – 3 bands as the input trajectory for each band classifier to explore the correlation between adjacent bands • The trajectories of 3 bands are projected into a DCT basis (20 coefficients) – Viterbi search tuned for language identification • Training data – CallFriend training and development sets 2003 NIST LID Evaluation 9 AGA 4/28/2003 Post-Evaluation – Phoneme System 34% relative improvement EER30s=12.7% 2003 NIST LID Evaluation 10 AGA 4/28/2003 Post-Evaluation – Prosodic Cues System • No energy-based segmentation – Unvoiced segments longer than 2 seconds are considered non-speech • No broad-phonetic category labeling applied – Rate of change plus the quantized duration (10 tokens) • Training data – CallFriend training and development sets 2003 NIST LID Evaluation 11 AGA 4/28/2003 Post-Evaluation – Prosodic Cues System 30% relative improvement EER30s=22.2% 2003 NIST LID Evaluation 12 AGA 4/28/2003 Fusion - 30 sec condition • Fusing the scores from the prosodic cues system EER30s=5.7% – with TRAP-derived phonemes: EER30s= 10.5% (17% relative improvement) – with OGI-LID derived phonemes: EER30s= 6.6% 14% relative improvement • TRAP-derived phoneme system fused with OGI-LID: 26% relative improvement – EER30s= 6.2% 19% relative improvement 2003 NIST LID Evaluation 13 AGA 4/28/2003 Conclusions • • Sequences of discrete symbols derived from speech dynamics provide useful information for characterizing the language Two techniques for deriving the sequences of symbols investigated – segmentation and labeling based on prosodic cues – segmentation and labeling based on TRAP-derived phonetic labels • The introduced techniques combine well with each other as well as with the more conventional language ID techniques 2003 NIST LID Evaluation 14 AGA 4/28/2003
© Copyright 2024 Paperzz