B
b
iy
b-iy
b
sil
b
A
ey2
ey1
sil
FFT
word boundaries
word boundary found
by free alignment
(a) alignment across wor d boundar i es
C
B
A
prob(d)
d (duration)
( b) dur at i on dependent wor d penal t i es
C
A
B
C
B
A
C
A
A
B
Positive Training on Correct Path (C A B)
Negative Training on Incorrect Path (C A A B)
( c) t r ai ni ng on t he s ent ence l evel
Fi gur e 2: Var i ous t echni ques t o i mpr ove s ent ence l evel r ecogni t i on per for mance
Speaker Dependent (CMU Alph Data)
500/2500 t r ai n, 100/500 cr os s val i dat i on, 400/2000 t es t s ent ences /wor ds
our
s peaker
SPHINX[HFW91] MS-TDNN[ HFW91]
MS- TDNN
mjmt
96.0
97. 5
98.5
mdbs
83. 9
89. 7
91. 1
maem
{
{
94. 6
f caw
{
{
98. 8
gt
{
{
86. 9
f ee
{
{
91. 0
Speaker Independent (Resource M
anagement Spel l -M
ode)
109 ( ca. 11000) t r ai n, 11 ( ca. 900) t es t s peaker ( wor ds ) .
SPHI NX[ HH92]
our MS- TDNN
+ Senone
gender specic
88. 7
90. 4
90. 8
92. 0
Tabl e 1: W
or d accur acy ( i n % on t he t es t s et s ) on s peaker dependent and s peaker
i ndependent connect ed l et t er t as ks .
M. Fanty and R. Cole. Spoken l etter recogni ti on. In Proceedi ngs of t he Neural
Denver, November 1990.
[ Haf92]
P. Haner. Connecti oni st Word-Level Cl assi cati on i n Speech Recogni ti on.
In Proc. IEEE Int ernat i onal Conf erence on Acoust i cs, Speech, and Si gnal
Processi ng. IEEE, 1992.
[ HFW
91]
P. Haner, M. Franzi ni , and A. W
ai bel . Integrati ng Ti me Al i gnment and
Neural Networks f or Hi gh Perf ormance Conti nuous Speech Recogni ti on. In
Proc. Int . Conf . on Acoust i cs, Speech, and Si gnal Processi ng. IEEE, 1991.
[ HH92]
M.-Y. Hwang and X. Huang. Subphoneti c Model i ng wi th Markov States Senone. In Proc. IEEE Int ernat i onal Conf erence on Acoust i cs, Speech, and
Si gnal Processi ng, pages I33 { I37. IEEE, 1992.
[ HW
90]
J. Hampshi re and A. W
ai bel . A Novel Objecti ve Functi on f or Improved
Phoneme Recogni ti on Usi ng Ti me Del ay Neural Networks. IEEE Transact i ons on Neural Net works , June 1990.
P. Haner and A. W
ai bel . Mul ti -state Ti me Del ay Neural Networks f or ConSpeech Recogni ti on. In NIPS(4). Morgan Kauf man, 1992.
W
ai bel . Mul ti -Speaker/Speaker-Independent Archi tectures f or
me Del ay Neural Network. In Proc. IEEE Int ernat i onal
s, Speech, and Si gnal Processi ng. IEEE, 1993.
and R. A. Col e. Speaker-i ndependent Phoneti c
gl i sh Letters. In Proceedi ngs of t he IJCNN
ami c Programmi ng Al gori thmf or Con-
ons on Acoust i cs, Speech, and
984.
K. Lang. Phoneme
nsact i ons on
[FC90]
In Proceedings of the Int ernat i onal Conf erence on Acoust i cs, Speech and Si gnal Processi ng, Toronto, Ontari o, Canada, May 1991. IEEE.
Inf ormat i on Processi ng Syst ems Conf erence NIPS,
Phoneme Layer
Hidden Layer
Gender
Selection
...
I nput Layer
F F T
Fi gur e 4: A networ k ar chi t ect ur e wi t h gender - s peci c and s har ed connect i ons . Onl y
t he f r ont - end TDNNi s s hown.
i ng, cr os s - val i dat i on and t es t s et , r es pect i vel y. The DARPARes our ce Management
Spel l - Mode Dat a wer e us ed f or speaker i ndependent t es t i ng. Thi s dat a bas e
cont ai ns about 1700 s ent ences , s pel l ed by 85 mal e and 35 f emal e s peaker s . The
s peech of 7 mal e and 4 f emal e s peaker s was s et as i de f or t he t es t s et , one s ent ence
f r omal l 109 and al l s ent ences f r om6 t r ai ni ng s peaker s wer e us ed f or cr os s val i
Tabl e 1 s ummar i zes our r es ul t s . Wit h t he hel p of t he t r ai ni ng t echni q
above we wer e abl e t o out per f or mpr evi ous l y r epor t ed [ HFW91] s
r es ul t s as wel l as t he HMM- bas ed SPHI NX Sys t em.
5
SUMMARY A ND FU TU R E WOR K
W
e have pr es ent ed a connect i oni s t s peech r ecogni t i on s
connect ed l et t er r ecogni t i on. Newt r ai ni ng t echni q
l evel r ecogni t i on enabl ed our MS- TDNNt o out
ki nd as wel l as a s t at e- of - t he ar t HMM- bas
s peci c s ubnet s , we ar e exper i ment i ng
\i nt er nal s peaker model s " f or a mor e
t he f ut ur e we wi l l al s o exper i
Acknowl edgements
The aut hor s gr at ef ul l y ackn
and DARPA. W
e wi s h t o t
McNai r f or keepi ng ou
t he i deas pr es en
R efe re
[ CFGJ91]
In
100.0
(b)
(c)
95.0
(a)
90.0
(d)
(e)
85.0
80.0
0
20
40
60
80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420
Fi gur e 3: Lear ni ng cur ves ( a = boot s t r appi ng, b, c =wor d l evel ( excer pt ed wor ds ) ,
d, e =s ent ence l evel t r ai ni ng ( cont i nuous s peech) ) on t he t r ai ni ng ( ) , cr os s val i dat i on ( -) and t es t s et ( x) f or t he s peaker - i ndependent RMSpel l - Mode dat a.
3
GE N D E R S PE CIF I C S U BN E T S
As t r ai ght f or war d appr oach t o bui l di ng a mor e s peci al i zed s ys t em
two ent i r el y i ndi vi dual networ ks f or mal e and f emal e s peaker
t he gender of a s peaker i s known, dur i ng t es t i ng i t
\gender i dent i cat i on networ k", whi ch i s s i m
put uni t s r epr es ent i ng mal e and f emal e s pe
networ k cl as s i es t he s peaker 's gen
ul ar i zed networ k i mpr oved t he
s ee t abl e 1) t o 91. 3%. How
connect i ons at t he l
wor ked even b
cat i
i n t he s ame way as t he phoneme boundar i es wi t hi n a wor d. Fi gur e 2( a) s hows an
exampl e i n whi ch t he wor d t o r ecogni ze i s s ur r ounded by a s i l ence and a `B' , t hus
t he l ef t and r i ght cont ext ( f or al l wor ds t o be r ecogni zed) i s t he phoneme ` sil' and
` b' , r es pect i vel y. The gr ay s haded ar ea i ndi cat es t he ext ens i on neces s ar y t o t he
DTWal i gnment . The di agr ams hows how a new boundar y f or t he begi nni ng of
t he wor d ` A' i s f ound. As i ndi cat ed i n gur e 3, t hi s t echni ques i mpr oves cont i
r ecogni t i on s i gni cant l y, but i t does n' t hel p f or excer pt ed wor ds .
2. 2 WORDDURATI ON DEPENDENT PENALI ZI NG OF
I NSERTI ONANDDELETI ONERRORS
I n \cont i nuous t es t i ng mode", i ns t ead of l ooki ng at wor d uni t s t he wel l - known \One
St age DTW" al gor i t hm[ Ney84] i s us ed t o nd an opt i mal pat h t hr ough an uns peci ed s equence of wor ds . The s hor t and conf us abl e Engl i s h l et t er s caus e many wor d
i ns er t i on and del et i on er r or s , s uch as \T E" vs . \T" or \O" vs . \O O", t her
pr oper dur at i on model i ng i s es s ent i al .
As s ugges t ed i n [ HW92] , mi ni mumphoneme dur at i on can be enf or ced
dupl i cat i on". I n addi t i on, we ar e model i ng a dur at i on and wor d de
P enw ( d) = log( k +probw ( d) ) , wher e t he pdf prob w ( d) i s app
t r ai ni ng dat a and k i s a s mal l cons t ant t o avoi d zer o pr
added t o t he accumul at ed s cor e AS of t he s ear ch p
whenever i t cr os s es t he boundar y of a wor d w i n
r i t hm, as i ndi cat ed i n gur e 2( b) . The r at i o i nuence of t he dur at i on penal ty, i s anot
i s no s t r ai ght f or war d mat hemat i cal l
of t he \wei ght " w t o t he i ns er
gr adi ent des cent , whi ch c
i . e. we ar e t r yi ng t o m
2. 3
ERRORB
Us ual l y t he MS- TDNNi s t r ai ned t o
t i nuous l y s poken s ent ences .
t r ai ni ng on t he s ent ence
\C A B", i n whi c
al i gnme
copi ed f r omt he Phoneme Layer i nt o t he wor d model s of t he DTWLayer , wher e an
opt i mal al i gnment pat h i s f ound f or each wor d. The act i vat i ons al ong t hes e pat hs
ar e t hen col l ect ed i n t he wor d out put uni t s . Al l uni t s i n t he DTWand W
or d Layer
ar e l i near and have no bi as es . 15 ( 25 t o 100) hi dden uni t s per f r ame wer e us ed
f or s peaker - dependent ( - i ndependent ) exper i ment s , t he ent i r e 26 l et t er networ k has
appr oxi mat el y 5200 ( 8600 t o 34500) par amet er s .
Trai ni ng s t ar t s wi t h \boot s t r appi ng", dur i ng whi ch onl y t he f r ont - end TDNN
us ed wi t h xed phoneme boundar i es as t ar get s . I n a s econd phas e, t r ai ni n
f or med wi t h wor d l evel t ar get s . Phoneme boundar i es ar e f r eel y al i gn
wor d boundar i es i n t he DTWl ayer . The er r or der i vat i ves ar e ba
t he wor d uni t s t hr ough t he al i gnment pat h and t he f r ont - en
The choi ce of s ens i bl e obj ect i ve f unct i ons i s of gr e
( y1 ; : : : ; yn) t he out put and T = ( t1 ; : : : ; t n ) t h
t he phoneme l evel ( boot s t r appi ng) , t her e
t i me, r epr es ent i ng t he cor r ect pho
s ee why t he s t andar d Mean
f or \1- out - of - n" codi n
f or a t ar get ( 1:
th
W
ord Layer
27 word uni ts
(onl y 4 shown)
DTWLayer
uni ty wei ghts
5 ti me del ays
-
27 word templ ates
(onl y 4 shown)
Phonem
e Layer
59 phoneme uni ts
(onl y 9 shown)
Hi dden Layer
3 ti me del ays
-
15 hi dden uni ts
I nput Layer
-
16 melscal e FFT coecients
Fi gur e 1: The MS- TDNNr ecogni zi ng t he excer pt ed wor d ` B' . Onl y t he act i vat i ons
f or t he wor ds ` SIL' , ` A' , ` B' , and ` C' ar e s hown.
cl as s i ed by anot her networ k.
I n t hi s paper , we pr es ent t he MS- TDNN as a
connect i oni s t s peech r ecogni t i on s ys t emf or connect ed l et t er r ecogni t i on. Af t er des cr i bi ng t he bas el i ne ar chi t ect ur e, t r ai ni ng t echni ques ai med at i mpr ovi ng s ent ence
l evel per f or mance and ar chi t ect ur es wi t h gender - s peci c s ubnet s ar e i nt r od
Basel i ne Archi tecture. Ti me Del ay Neur al Networ ks ( TDNNs ) can combi
r obus t nes s and di s cr i mi nat i ve power of Neur al Net s wi t h a t i me- s hi f t
chi t ect ur e t o f or mhi gh accur acy phoneme cl as s i er s [ WHH+ 89]
TDNN( MS- TDNN) [ HFW91, Haf 92, HW92] , an ext ens i on of t he TD
of cl as s i f yi ng wor ds ( r epr es ent ed as s equences of phonemes ) by
ear t i me al i gnment pr ocedur e ( DTW) i nt o t he TDNNar chi
an MS- TDNNi n t he pr oces s of r ecogni zi ng t he excer pt
16 mel s cal e FFT coeci ent s at a 10- ms ec f r ame r a
t ut e a s t andar d TDNN, whi ch us es s l i di ng w
t o comput e a s cor e f or each phoneme ( s t
t i ons i n t he \Phoneme Layer ". I n
i s model ed by a s equence of pho
Advances in Neural Informati on Proces s i ng Sys t ems 5 (NI PS*5).
San Mat eo, CA: Mor gan Kauf mann Publ i s her s , 1993
Connecte d Le t t e r Re c ognit i on wi t h a
Mult i -St a t e Ti me De l ay Ne ur a l Ne t wor k
and Al ex W
ai bel
School of Comput er Sci ence
Car negi e Mel l on Uni ver s i ty
Pi t t s bur gh, PA15213- 3891, USA
Hermann Hi l d
Abs tr ac t
The Mul t i - St at e Ti me Del ay Neur al Networ k ( MS- TDNN) i nt egr at es a nonl i near t i me al i gnment pr ocedur e ( DTW) and t he hi ghaccur acy phoneme s pot t i ng capabi l i t i es of a TDNNi nt o a connect i oni s t s peech r ecogni t i on s ys t emwi t h wor d- l evel cl as s i cat i on a
er r or backpr opagat i on. W
e pr es ent an MS- TDNN f or r ecogni z
cont i nuous l y s pel l ed l et t er s , a t as k char act er i zed by a
hi ghl y conf us abl e vocabul ar y. Our MS- TDNNachi eves
wor d accur acy on s peaker dependent /i ndependent
f or mi ng pr evi ous l y r epor t ed r es ul t s on t he s am
pos e t r ai ni ng t echni ques ai med at i mpr ovi
mance, i ncl udi ng f r ee al i gnment ac
r at i on model i ng and er r or backp
t han t he wor d l evel . Ar chi t
i zed on a s ubs et of
1
I NTRODUCT
The r ecogni t i on of s p
pr oper
© Copyright 2026 Paperzz