Recognition of Phoneme Strings using TRAP Technique

Phoneme Recognition using
Temporal Patterns
Petr Schwarz, Pavel Matějka
Brno University of Technology, Czech Republic
OGI School of Science and Engineering at OHSU, USA
E-mail: [email protected], [email protected]
September 23-24 2003
M4 meeting Delft
1
Outline
•
•
•
•
•
•
•
•
The goal
Experimental setup and system
Baseline experiment with MFCC and MFCC multi-frame
Comparison of conventional MFCC and novel TempoRAl
Patterns (TRAPs) features under well matched and
mismatched conditions
Optimization of TRAPs for our task
New three-band TRAPs system
Implementation and distribution of the SW
Conclusions and future work…
September 23-24 2003
M4 meeting Delft
2
The goal
• For many applications, speech needs to be transcribed into
discrete symbols.
• very reliable phoneme recognizer (not only) for meeting
domain
• no language constraints
• suitable as a front end to LVCSR, for keyword spotting,
speaker recognition, language recognition or recognition of
out-of-vocabulary words
Comparison of several techniques for automatic
recognition of unconstrained contextindependent phonemes
September 23-24 2003
M4 meeting Delft
3
Experimental setup
• Two databases TIMIT and NTIMIT
- all SA records are removed
- databases down-sampled to 8000 Hz
- 412 speakers for training, 50 for CV, 168 for test
• The phoneme set contains 39 phonemes
- very similar to CMU/MIT phoneme set
- closures are merged with burst (bcl b  b)
• Experimental system is NN/HMM hybrid
- phoneme insertion penalty tuned to the equal
number of inserted and deleted phonemes
September 23-24 2003
M4 meeting Delft
4
Experimental system
Posterior probablity
estimator
MFCC, TRAPS
Feature
Viterbi decoder
aa
ae
extraction
z
Which classifier?
pau hh ae l ow pau
September 23-24 2003
M4 meeting Delft
5
Which classifier, GMM or NN?
• HMM-GMM and HMM-NN with one-state models
• MFCC + Δ + ΔΔ features
• Number of parameters is increased until the decrease
in phoneme error rate (PER) is negligible (<0.5 %)
System
PER [%]
Parameters
GMM
42.0
788736
NN
41.6
31200
NN doesn’t degrade performance compared to GMM
+ 2 % absolute by merging
September 23-24 2003
M4 meeting Delft
6
Single frame and multi-frame input
with MFCC – FeatureNet
• Subsequent frames are joined together
• Size of context is being increased to find minimal PER
• 300, 400 and 500 neurons in hidden layer tested minimum change but the best is 400
42
PER [%]
41
40
frames
39
1
41.6
5
37.5
PER = 37.5 %
38
PER [%]
37
0
5
10
15
frames
September 23-24 2003
M4 meeting Delft
7
TempoRAl Patterns
1. frequency-localized posterior probabilities of phonemes
are estimated from temporal evolution of critical band
energies within a single critical band
2. such estimates are used in another class-posterior
estimator which estimates the overall phoneme probability
from the probabilities in the individual critical bands.
1. band classifier
2. band classifier
N. band classifier
September 23-24 2003
M4 meeting Delft
8
TRAP system scheme
Norm
a
a
e
z
Norm
pau hh ae l ow pau
September 23-24 2003
M4 meeting Delft
9
MFCC and TRAP on
well-matched conditions
• Training and testing data are from the same database
• Similar performance of MFCC multi-frame and 1s long TRAPs
• Improvement can be obtained when length of TRAP is optimized
PER [%]
TIMIT
NTIMIT
MFCC39
41.6
55.6
MFCC39 5frames
37.5
49.0
TRAP 1sec
37.9
49.6
September 23-24 2003
M4 meeting Delft
10
MFCC and TRAP on mismatched
conditions
• Training and testing data are from different databases
• TRAP system yielded better results in both mismatched
conditions
• It’s better to train the system on corrupted speech rather
than on clean one
PER [%]
TIMIT/NTIMIT
NTIMIT/TIMIT
MFCC39
80.9
63.4
MFCC39 5frames
80.1
75.7
TRAP 1sec
75.0
56.6
September 23-24 2003
M4 meeting Delft
11
Effect of length of TRAP
• The original TRAP length was kept 1 second long to be sure that
it covers all information about phoneme in the critical band, but
the length is not optimal
• 300 ms long context is the best for the TIMIT database
42
41
PER [%]
40
39
PER = 36.1 %
38
37
36
35
0
September 23-24 2003
200
400
600
time [ms]
M4 meeting Delft
800
1000
12
Effect of mean and variance
normalization
• Experiment was performed on original 1 second long TRAPs
• Significant degradation caused by both normalizations can be
seen in well-matched conditions
• Mean normalization always helps in mismatched condition,
the benefit of variance normalization is less clear
Normalization /
PER [%]
TIMIT/ NTIMIT/
NTIMIT TIMIT
TIMIT
NTIMIT
None
37.9
49.6
75.0
56.6
Mean
40.5
51.8
73.5
54.7
Mean & variance
42.6
53.2
74.8
54.1
September 23-24 2003
M4 meeting Delft
13
TRAP with more than one critical
band
• Three neighboring temporal vectors were merged
together and sent to one classifier
Posterior probabilities
of phonemes for
each triple of bands
system
TRAPS
3 band TRAPS
September 23-24 2003
M4 meeting Delft
PER [%]
36.1
33.7
14
Implementation and distribution of
the SW: phnrec
• Early experiments performed with a set of scripts
interconnecting execs: trapper, QuickNet, HTK,… – still
used for the training.
• Phoneme recognition – in phnrec containing:
– feature extraction (MFCC (compat HTK), FeatureNet, TRAPS) –
from files or microphone
– posterior-probability estimator (NN –compatible with QuickNet
nets)
– Viterbi decoder – can work also on-line with fixed delay.
• Very good as black-box for people what want to consider
speech-to-phoneme transcription as front-end
September 23-24 2003
M4 meeting Delft
15
phnrec (2)
• Source codes for Linux and EXE for Windows available
for free for research.
• Available with nets trained on US-English (TIMIT) and
Czech (SpeechDat-E).
• More languages to come (also some Language ID
experiments running in Brno)
• Works on-line
http://www.fit.vutbr.cz/speech/sw/phnrec.html
September 23-24 2003
M4 meeting Delft
16
Conclusion
• TRAP based phoneme recognizer was built, comparison
to MFCC.
• Properties of TRAPs were studied and TRAPs were
optimized for phoneme recognition
• New multi-band TRAPs approach was tested and its
benefit is proved
• The recognizer was successfully evaluated in language
identification task
• An easy-to-use software was written and is available for
research community.
September 23-24 2003
M4 meeting Delft
17
But …
• Adaptation to meeting data necessary (TIMIT clean
training not good at all), updating the distribution on
www.
• Tests on ICSI, IDIAP and Brno data (which phonemes
going to work the best for us CzEnglish ?)
• Applications – LID already tested, kwd spotting and
LVCSR (some papers at Eurospeech making use of
phoneme strings).
• Phoneme lattices
• Real-time issues (1 band version running ok on
reasonable machine, 3 band not) – NN weights pruning?
September 23-24 2003
M4 meeting Delft
18
THE END
• A demo during the break.
• Please download phnrec, test it and
comment !!!
• Questions ?
September 23-24 2003
M4 meeting Delft
19