PPT slides

Using TDT Data to Improve BN
Acoustic Models
Long Nguyen and Bing Xiang
STT Workshop
Martigny, Switzerland, Sept. 5-6, 2003
1
Overview
 TDT Data
 Selection procedure
– Data pre-processing
– Lightly-supervised decoding
– Selection
 Experimental results
 Conclusion & future work
2
TDT Data
 TDT2:
– Jan 1998 – June 1998
– Four sources (ABC, CNN, PRI, VOA)
– 1034 shows , 633 hrs
 TDT3:
– Oct 1998 – Dec 1998
– Six sources (ABC, CNN, MNB/MSN, NBC, PRI, VOA)
– 731 shows , 475 hrs
 TDT4:
– Oct 2000 – Jan 2001
– Same six sources as in TDT3
– 425 shows , 294 hrs
3
Selection Procedure
TDT Audio
H4 AM
H4 LM TDT Caption
Normalization
+
Recognition
Hypotheses
SNOR
STM
Biased LM
Scoring (Sclite)
SGML
Selection
Transcripts
4
Closed Caption Format
 TDT audio have closed caption (CC) transcripts
– TDT2: *.sgm
– TDT3: *.src_sgm
– TDT4: *.tkn_sgm or *.src_sgm (in different tagging scheme)
 Example: 20001103_2100_2200_VOA_ENG.src_sgm
<DOC>
<DOCNO> VOA20001103.2100.0345 </DOCNO>
<DOCTYPE> NEWS STORY </DOCTYPE>
<DATE_TIME> 11/03/2000 21:05:45.18 </DATE_TIME>
<BODY>
<TEXT>
US share prices closed mixed, Friday. The DOW Jones Industrial average ended the day
down 63 points. The NASDAQ Composite Index was 23 points higher. I'm John
Bashard, VOA News.
</TEXT>
</BODY>
<END_TIME> 11/03/2000 21:06:04.42 </END_TIME>
</DOC>
5
CC to SNOR
 Normalize CC to SNOR transcripts for LM training
–
–
–
–
Break into sentences
‘Verbalize’ numbers (63 => SIXTY THREE)
Normalize acronyms and abbreviation (US => U. S.)
Etc.
U. S. SHARE PRICES CLOSED MIXED FRIDAY
THE DOW JONES INDUSTRIAL AVERAGE ENDED THE DAY DOWN
SIXTY THREE POINTS
THE NASDAQ COMPOSITE INDEX WAS TWENTY THREE POINTS
HIGHER
I'M JOHN BASHARD V. O. A. NEWS
6
CC and SNOR to STM
 Convert CC (and SNOR) to STM format for
scoring/aligning later
20001103_2100_2200_VOA_ENG 1 S8 345.000
364.000 <o,f0,unknown> u. s. share prices closed
mixed friday the dow jones industrial average ended the
day down sixty three points the nasdaq composite index
was twenty three points higher i'm john bashard v. o. a.
news
7
Lightly-supervised Decoding
 Start with a reasonable Hub4 system
– Acoustic models: ML-trained on H4-141hrs corpus
– Language models: Trigrams estimated on 1998-2000 data
subset of the GigaWord corpus
 Make up biased LMs by adding TDT data with bigger
weights
– Three LMs, one for each TDT 2, 3, and 4
– 40k-word lexicon including all new words found in TDT that
have phonetic pronunciations
 Decode each show separately (as if it’s a new test set)
– N-Best decoder followed by N-Best rescoring using SI
acoustic models (GD, band-specific)
– Decode again after adapting acoustic models
– Total runtime is about 10xRT
8
Alignment
 Use sclite to align hypotheses with CC transcripts to
take advantage of the time-stamped word alignments
stored in the SGML output
C,"u.","u.",345.270+345.440:C,"s.","s.",345.440+345.610:C,"share","share",345.610+3
45.840:C,"prices","prices",345.840+346.290:C,"closed","closed",346.290+346.630:
C,"mixed","mixed",346.630+347.000:C,"friday","friday",347.000+347.490:C,"the","t
he",347.490+347.580:C,"dow","dow",347.580+347.800:C,"jones","jones",347.800+
348.110:C,"industrial","industrial",348.110+348.720:C,"average","average",348.720
+349.130:C,"ended","ended",349.130+349.350:C,"the","the",349.350+349.430:C,"d
ay","day",349.430+349.610:C,"down","down",349.610+350.020:C,"sixty","sixty",35
0.020+350.410:C,"three","three",350.410+350.680:C,"points","points",350.680+351
.220:C,"the","the",351.840+351.940:C,"nasdaq","nasdaq",351.940+352.500:C,"com
posite","composite",352.500+352.960:C,"index","index",352.960+353.330:C,"was","
was",353.330+353.490:C,"twenty","twenty",353.490+353.800:C,"three","three",353.
800+354.020:C,"points","points",354.020+354.380:C,"higher","higher",354.380+35
4.860:C,"i'm","i'm",355.710+355.900:C,"john","john",355.900+356.180:I,,"burr",35
6.180+356.310:S,"bashard","shard",356.310+356.860:C,"v.","v.",356.860+357.010:
C,"o.","o.",357.010+357.160:C,"a.","a.",357.150+357.300:C,"news","news",357.300
+358.060
9
Selection Strategy

Search through the SGML file to select
– Utterances having no errors
– Phrases of 3+ contiguous correct words

10
In effect, use only a subset of words that both
the CC transcripts and the decoder’s
hypotheses agree
Selection Results
 The amount of data selected from TDT2, TDT3
and TDT4 (in hours)
Set
TDT2
TDT3
TDT4
All
Raw Transcribed Cor. Utts Cor. Utts & Phrases
633
425
143
305
475
328
119
241
294
213
73
156
1402
966
335
702
 Only 68% of the TDT audio have CC (966/1402hrs)
– Based on the observation of long passages of contiguous
insertion errors
 Selection yield rate is 72% (702/966) [Or 50% yield
rate relative to the amount of raw audio data]
11
Scalability
 Trained 4 sets of acoustic models
– ML, HLDA-SAT only
– (not quite ready to use MMI training yet)
 System parameters grow as more and more data
added if thresholds and/or criteria of speaker
clustering, state clustering, and Gaussian mixing stay
fixed.
Training Set
h4
h4+tdt4
h4+tdt4+tdt2
h4+tdt4+tdt2+tdt3
12
Amount #spkrs
141hrs
7k
297hrs
12k
602hrs
23k
843hrs
31k
#cbks
6k
13k
26k
34k
#gauss
164k
354k
720k
983k
Experimental Results
 Tested on the BN dev03 test set (h4d03)
 Used same RT03 Eval LMs
 Double the data (150 => 300hrs) provided 0.7%
abs reduction in WER
 Double again (300 => 600hrs) provided an
additional 0.6% abs reduction
AM trained on
141 hrs
297 hrs
602 hrs
843 hrs
13
SI
17.2
15.4
14.7
14.5
Adapt 1
13.0
12.2
11.6
?
Adapt 2
12.7
12.0
11.4
?
Un-Adapted Results in Detail

Significant reduction across all shows when
adding TDT4 data into Hub-4 BN data
Set
141 hrs
297 hrs
602 hrs
14
ABC
15.5
13.8
13.6
CNN MSN NBC
22.3 13.2 13.6
21.4 11.5 11.8
20.0 11.7 10.9
PRI VOA
12.1 25.6
10.6 22.9
10.3 21.3
All
17.2
15.4
14.7
Adapted Results in Detail

No noticable reduction observed for the MSN
and NBC shows when adding the TDT2 data.
[These two types of shows not part of the
TDT2 corpus]
Set
141 hrs
297 hrs
602 hrs
15
ABC
11.4
10.8
10.3
CNN MSN NBC
18.6
9.8
11.5
18
9.0
10.2
16.7
9.0
10.0
PRI
9.7
9.2
8.8
VOA
15.6
15.0
14.0
All
12.7
12.0
11.4
Summary




16
Proposed an effective strategy for automatic
selection of BN audio data having closed
caption transcripts as (additional) acoustic
training data
68% of TDT audio data are captioned
Selection yield rate is 72% of captioned data
Adding 450hrs of selected data from the TDT2
and TDT4 corpus provides 1.3% abs reduction
in WER for the BN dev03 test set
Future Work





17
Obtain results when adding the TDT3 data
Improve the biased LMs and retry
Understand the differences/errors in aligning the
hypotheses and the closed caption to refine the
selection criteria
Cooperate with other sites to speed up and
improve the data selection process
Use MMI training with this large amount of
training data