Voicing Features
Horacio Franco, Martin Graciarena
Andreas Stolcke, Dimitra Vergyri, Jing Zheng
STAR Lab. SRI International
1
Phonetically Motivated Features
• Problem:
– Cepstral coefficients fail to capture many discriminative
cues.
– Front-end optimized for traditional Mel cepstral features.
– Front-end parameters are a compromise solution for all
phones.
2
Phonetically Motivated Features
• Proposal:
– Enrich Mel cepstral feature representation with phonetically
motivated features from independent front-ends.
– Optimize each specific front-end to improve discrimination.
– Robust broad class phonetic features provide “anchor
points” in acoustic phonetic decoding.
– General framework for multiple phonetic features. First
approach: voicing features.
3
Voicing Features
• Voicing features algorithms:
1. Normalized peak autocorrelation (PA) . For time frame X
Rxx (i ) E[ X (t ) X (t i )];
PA max i{Rxx(i)}/ Rxx (0)
max computed in pitch region 80Hz to 450Hz
2. Entropy of high order cepstrum (EC) and linear spectra (ES).
2
If LSPEC DFT ( X ) ; CEPS IDFT ( Log ( LSPEC ));
And H is the entropy of Y,
2
2
H (Y ) f P ( Y ( f ) ) log( P( Y ( f ) ));
P( Y ( f ) )
2
Y( f )
2
f Y ( f )
2
then EC H (CEPS ); ES H ( SPEC );
Entropy computed in pitch region 80Hz to 450Hz
4
Voicing Features
3. Correlation with template and DP alignment [Arcienega,
ICSLP’02]. The Discrete Logarithm Fourier Transform DLFT
for the frequency band [ f1, f 2 ] for speech signal x (n )
1
Yi
x(n)e jwi n ; wi 2 Ts eln( f1 ) idlnf ;
n
N
ln( f 2 ) ln( f1 )
dlnf
N 1
If IT is an impulse train, the template is T DLFT (IT )
2
Y
DLFT
(
X
)
;
and the signal DLFT
the correlation for frame j with the template is
2
Ryt (i, j ) E[Y ( f , j ) T ( f i )];
DP
the DP optimal correlation is CT max i {Ryt (i, j ) / Ryt (0, j )}
max computed in pitch region 80Hz to 450Hz
5
Voicing Features
• Preliminary exploration of voicing features:
- Best feature combination: Peak Autocorrelation +
Entropy Cepstrum
- Complementary behavior of autocorrelation and
entropy features for high and low pitch.
Low pitch: time periods are well separated therefore
correlation is well defined.
High pitch: harmonics are well separated and
cepstrum is well defined.
6
Voicing Features
• Graph of voicing features:
w er k ay n d ax f
s:
aw
th ax v
dh ey ax r
7
Voicing Features
• Integration of Voicing Features:
1 - Juxtaposing Voicing Features:
•
Juxtapose two voicing features to traditional Mel cepstral
feature vector (MFCC) plus delta and delta-delta
features (MFCC+D+DD)
•
Voicing feature front-end: use same MFCC frame rate
and optimize temporal window duration.
8
Voicing Features
•
•
•
Train small switchboard database (64 hours). Test on
dev 2001. WER for both sexes.
Features: MFCC+D+DD, 25.6 msec. frame every 10 msec.
VTL and speaker mean and var. norm. Genone acoustic
model. Non-X-word, MLE trained, Gender Dep. Bigram LM.
Window Length Optimization
WER
Baseline
41.4%
Baseline + 2 voicing (25.6 msec)
41.2 %
Baseline + 2 voicing (75 msec)
40.7 %
Baseline + 2 voicing (87.5 msec)
40.5 %
Baseline + 2 voicing (100 msec)
40.4 %
Baseline + 2 voicing (112.5 msec)
41.2 %
9
Voicing Features
2 – Voiced/Unvoiced Posterior Features:
•
•
Use a posterior voicing probability as feature. Computed
from 2 state HMM. Juxtaposed feature dim is 40.
Similar setup as before. Males only results.
Recognition Systems
WER
Baseline
39.2 %
Baseline + voicing posterior
•
39.7 %
Soft V/UV transitions may be not captured because
posterior feature behaves similar to binary feature.
10
Voicing Features
3 – Window of Voicing Features + HLDA:
•
•
•
•
Juxtapose MFCC features and window of voicing
features around current frame.
Apply dimensionality reduction with HLDA. Final feature
had 39 dimensions.
Same setup as before, MFCC+D+DD+3rd diffs. Both sexes.
Baseline 1.5% abs. better, Voicing improves 1% more.
Recognition Systems
WER %
Baseline + HLDA
39.9
Baseline + 1 frame, 2 voicing + HLDA
Baseline + 5 frames, 2 voicing + HLDA
39.5
38.9
Baseline + 9 frames, 2 voicing + HLDA
39.5
11
Voicing Features
4 – Delta of Voicing Features + HLDA:
•
•
Use delta and delta-delta features instead of window of
voicing features. Apply HLDA to juxtaposed feature.
Same setup as before, MFCC+D+DD+3rd diffs. Males only.
Recognition Systems
WER
Baseline + HLDA
37.5 %
Baseline + voicing + delta voicing + HLDA
37.6 %
• Reason may be variability in voicing features produce
noisy deltas.
• HLDA weighting of “window of voicing features” is
similar to average.
--------------------------------------------------------------------------------- The best overall configuration was MFCC+D+DD+3rd diffs.
and 10 voicing features + HLDA.
12
Voicing Features
• Voicing Features in SRI CTS Eval. Sept 03 System:
•
•
•
•
•
•
Adaptation of MMIE cross-word models w/wo voicing
features.
Used best configuration of voicing features.
Train on Full SWBD+CTRANS data. Test on EVAL’02.
Feature: MFCC+D+DD+3rd diffs.+HLDA
Adaptation: 9 transforms full matrix MLLR.
Adaptation hypothesis from: MLE non cross-word model,
PLP front end with voicing features.
Recognition Systems
WER
Baseline EVAL
25.6 %
Baseline EVAL + voicing
25.1 %
13
Voicing Features
• Hypothesis Examples:
REF:
HYP BASELINE:
HYP VOICING:
OH REALLY WHAT WHAT KIND OF PAPER
OH REALLY WHICH WAS KIND OF PAPER
OH REALLY WHAT WHAT KIND OF PAPER
REF:
HYP BASELINE:
HYP VOICING:
YOU KNOW HE S JUST SO
UNHAPPY
YOU KNOW YOU JUST
I WANT HAPPY
YOU KNOW HE S JUST SO I WANT HAPPY
14
Voicing Features
• Error analysis:
– In one experiment: 54% of speakers got WER reduction (some
up to 4% abs. reduction). Rest 46% small WER increase.
– Still need a more detailed study of speaker dependent
performance.
• Implementation:
– Implemented a voicing feature engine in DECIPHER system.
– Fast computation, using one FFT and two IFFTs per frame
for both voicing features.
15
Voicing Features
• Conclusions:
– Explored how to represent/integrate the voicing features for
best performance.
– Achieved 1% abs (~2 % rel) gain in first pass (using small
training set), and >0.5 % abs (2 % rel) (using full training set) in
higher rescoring passes of DECIPHER LVCSR system.
• Future work:
– Still need to further explore feature combination/selection
– Develop more reliable voicing features, features not always
reflect actual voicing activity
– Develop other phonetically derived features
(vowels/consonants, occlusion, nasality, etc).
16
17
© Copyright 2026 Paperzz