language model transformation applied to lightly supervised training

LANGUAGE MODEL TRANSFORMATION APPLIED TO LIGHTLY SUPERVISED
TRAINING OF ACOUSTIC MODEL FOR CONGRESS MEETINGS
Tatsuya Kawahara
Masato Mimura
Yuya Akita
Kyoto University, Academic Center for Computing and Media Studies
Sakyo-ku, Kyoto 606-8501, Japan
[email protected]
ABSTRACT
For effective training of acoustic and language models for
spontaneous speech such as meetings, it is significant to exploit the texts available in a large scale, which may not be
faithful transcripts of the utterances. We have proposed a
language model transformation scheme to cope with the differences between verbatim transcripts of spontaneous utterances and human-made transcripts such as those in proceedings. In this paper, we investigate its application to lightly
supervised training of the acoustic model. By transforming the corresponding text in the proceedings, we can generate a very constrained model to predict the actual utterances.
The experimental evaluation with the transcription system for
the Japanese Congress meetings demonstrated that the proposed scheme can generate accurate labels for acoustic model
training and thus realizes the comparable ASR (Automatic
Speech Recognition) performance to the case using manual
transcripts.
Index Terms— speech recognition, language model, acoustic
model, lightly supervised training
1. INTRODUCTION
Recently, the major target of LVCSR (Large-Vocabulary Continuous Speech Recognition) systems has been shifting to
spontaneous speech such as conversations, lectures and meetings. One of the most fundamental problems in training an
acoustic model for this kind of spontaneous speech is the insufficient amount of training data to cover wide variation of
the acoustic features. It is very difficult and costly to prepare a
large speech corpus, because it involves manual transcription
of utterances with many disfluencies, compared to the reading
of prepared materials.
To address this problem, the approach of lightly supervised training has been proposed and investigated[1]. The
approach exploits texts, which are not faithful transcripts of
what was spoken, but “cleaned” by human transcribers, for
generating labels for acoustic model training. Most of the
conventional works made use of closed caption texts in broadcasts in two ways; first, to generate a biased language model
978-1-4244-2354-5/09/$25.00 ©2009 IEEE
3853
to automatically transcribe the speech, and then to filter the
recognition hypotheses by aligning with the caption texts. On
the other hand, Chan et al.[2] reported that the second step of
the data selection was not useful, and all the data should be
used instead. And Nguyen et al.[3] pointed out that it is vital to obtain the words the recognizer could have failed when
using a baseline model.
By considering these findings, in this paper, we address
effective label generation for lightly supervised training. We
have been developing an automatic transcription system for
the Japanese Congress (Diet) meetings[4]. Unlike European
Parliament Plenary Sessions (EPPS), which were targeted by
the TC-Star Project[5][6][7], the great majority of the sessions in the Japanese Congress are in committee meetings.
They are more interactive and thus spontaneous, compared
to plenary sessions. This means that there are many differences between the actual utterances and the final text in the
proceedings or minutes. In our preliminary analysis, it is observed that an edit distance between the two is around 11%.
Lightly supervised training is explored in the EPPS recently
by Paulik et al.[8], but they simply use the final text edition
(FTE) as it is, although they try to use the translated version
as well.
We present a method which estimates the language model
to accurately predict the actual utterances from the proceedings text. We have proposed a scheme of language model
transformation and demonstrated its effectiveness in the baseline language modeling[4]. In this paper, the scheme is applied to the transcript generation for lightly supervised training of the acoustic model, so that we do not have to prepare
manual transcripts for updating the model.
The remainder of this paper is organized as follows. Section 2 gives an overview of the proposed scheme of language
model transformation. In Section 3, we describe lightly supervised training of the acoustic model based on the language model transformation. Experimental evaluations with
the transcription system for the Japanese Congress meetings
are presented in Section 4. Finally, Section 5 concludes the
paper.
ICASSP 2009
2. STATISTICAL TRANSFORMATION OF
LANGUAGE MODEL
We have proposed a scheme of language model transformation to cope with the differences between spontaneous
utterances (verbatim text: V ) and human-made transcripts
(written-style text: W )[9]. In this scheme, the two are regarded as different languages and statistical machine translation (SMT) is applied. It can be applied in both directions:
to convert a faithful transcript of the spoken utterances to a
document-style text, and to recover the faithful transcripts
from the human-made texts.
The decoding process is formulated in the same manner
as SMT, which is based on the following Bayes’ rule.
p(W |V ) =
p(W )·p(V |W )
p(V )
(1)
p(V |W ) =
p(V )·p(W |V )
p(W )
(2)
Here the denominator is usually ignored in the decoding.
Note that, in this case, the process to uniquely determine
V (eq. 2) is much more difficult than the cleaning process
(eq. 1) because there are more arbitrary choices in this direction; for example, fillers can be randomly inserted in (eq.
2) while all fillers are removed in (eq. 1). Therefore, we are
more interested in estimating the statistical language model
of V , rather than recovering the text of V . Thus, we derive
the following estimation formula.
p(V ) = p(W )·
p(V |W )
p(W |V )
(3)
The key point of this scheme is that the available text size
of the document-style texts W is much larger than that of
the verbatim texts V needed for training ASR systems. For
the Congress meetings, we have huge archives of proceedings texts. Therefore, we fully exploit their statistics p(W ) to
estimate the language model p(V ) for ASR.
The transformation is actually performed on occurrence
counts of N-grams as below.
Ngram (v1n ) = Ngram (w1n )·
p(v|w)
p(w|v)
verbatim
transcript V
speech X
(4)
Here v and w are individual transformation patterns. We
model substitution w→v, deletion of w, and insertion of v,
by considering their contextual words. 1 Ngram (w1n ) is
an N-gram entry (count) including them, thus to be revised
to Ngram (v1n ). Estimation of the conditional probabilities
p(v|w) and p(w|v) requires an aligned corpus of verbatim
transcripts and their corresponding document-style texts. We
have constructed the “parallel” corpus by using a part of the
1 Unlike ordinary SMT, permutation of words is not considered in this
transformation.
3854
training
tranformation
model
final text in
proceedings W
(manual)
huge
archives
align
parallel corpus
huge
archives
p(V|W)
p(W)
p(W|V)
LM transformation
p(V)=
label generation
for AM training
ASR
(automatic)
AM training
Fig. 1. Procedure overview of the proposed method
proceedings of the Japanese Congress meetings. The conditional probabilities are estimated by counting the corresponding patterns observed in the corpus. Their neighboring words are taken into account in defining the transformation patterns for precise modeling. For example, an insertion of a filler “ah” is modeled by {w = (w−1 , w+1 ) →
v = (w−1 , ah, w+1 )}, and the N-gram entries affected by
this insertion are revised. A smoothing technique based on
POS (Part-Of-Speech) information is introduced to mitigate
the data sparseness problem. Please refer to [4] for implementation details and evaluation. The approach was shown to
be effective in language modeling for the automatic meeting
transcription system.
3. LIGHTLY SUPERVISED TRAINING OF
ACOUSTIC MODEL
In this paper, we apply the language transformation scheme
to lightly supervised training of the acoustic model. For the
Congress meetings, we have large archives of speech which
are not faithfully transcribed but have edited texts in the proceedings. Since it is not possible to uniquely recover the original verbatim transcript from the proceedings text, as mentioned in the previous section, we generate a dedicated language model for decoding the speech using the corresponding
proceedings text. As a result of ASR, we expect to obtain a
verbatim transcript with high accuracy.
The whole process is depicted in Fig. 1. For each turn
(typically ten seconds to three minutes, and on the average
one minute) of the meetings, we compute N-gram counts
from the corresponding final text in the proceedings. Here,
we adopt a turn as a processing unit, because the whole session (typically two to five hours) is too long, containing a
variety of topics and speakers. A preliminary experimental
comparison is presented in the next section. Automatic topic
segmentation[10] may be incorporated for precise modeling,
but we did not use it in this work.
The N-gram entries and counts are then converted to the
verbatim style using the transformation model (eq. 4). Here
we count up to trigrams. Insertion of fillers and omission of
particles in the N-gram chain are also modeled considering
their context in this process.
Then, ASR is conducted using the dedicated model to produce a verbatim transcript. The model is very constrained
and still expected to predict spontaneous phenomena such as
filler insertion. For acoustic model training, phone sequences
are required. To cope with pronunciation variation in spontaneous speech, possible pronunciation variation patterns (surface forms) are predicted based on our generalized statistical
model[11], and they are added to the dictionary used in ASR.
The best phone hypothesis is used as the label for the standard HMM training based on ML (Maximum Likelihood) criterion. For discriminative training such as MPE (Minimum
Phone Error) criterion, we also generate competing hypotheses using a baseline language model. It is also possible to
explore further correction of the ASR result by referring to
the cleaned text[12].
4. EXPERIMENTAL EVALUATION
The proposed scheme has been implemented and evaluated
with the automatic transcription system for the Japanese
Congress meetings. The training corpus for the language
transformation model and the baseline acoustic model via the
standard supervised training was built from dozens of meetings in the years 2003-2005. It consists of 134 hour speech
and it was faithfully transcribed and aligned with the final text
in the proceedings to train the transformation model. The text
size is about 1.8M words.
The acoustic features were 12-dimensional MFCCs and a
power together with their Δ and ΔΔ. CMN (Cepstral Mean
Normalization), CVN (Cepstral Variance Normalization), and
VTLN (Vocal Tract Length Normalization) were performed.
The acoustic model is triphone HMMs, which consist of 5000
shared states, each having 32 Gaussian components. We also
incorporate the MPE training.
4.1. Evaluation on Label Generation
The additional training data for lightly supervised training of
the acoustic model were prepared from the meetings in the
years of 2006 and 2007. For these years, the general election
was held and then the prime minister and cabinet members
were replaced, thus many of major speakers in the meetings
were different from those of the training corpus. In this experiment, we aim to update the acoustic model using these
speech data. The data consist of 26 sessions and 91 hour
speech in total.
3855
Table 1. Accuracy of transcripts for acoustic model training
unit
session
turn
method
baseline
proceedings
proc. + baseline
proposed method
proceedings
proc. + filler
proposed method
Corr.
82.3%
83.6%
86.5%
85.6%
86.1%
88.7%
91.6%
Acc.
79.5%
81.3%
83.9%
83.3%
83.5%
86.2%
89.9%
The proposed method was applied to generate transcripts
for them. We first investigate the accuracy of the transcripts,
which is measured by the word accuracy. Faithful transcripts
were manually prepared for this evaluation. We made a comparison of the turn-based modeling with the case using a single model for the whole session. For the whole session, we
can generate a biased language mode by interpolating the
final text of proceedings with the baseline model. This is
the conventional method widely used in the previous works.
Note that applying this method to the turn-based modeling is
not practical because there are a huge number of turns and
the size of the biased language model is very large (1.6GB
in this experiment). On the other hand, the proposed language model transformation method generates a very compact model (100MB), thus can be easily applied to many turn
units.
For comparison within the turn-based modeling, we generated two different language models dedicated to the label
generation. One model (proceedings) simply used the corresponding final text of proceedings. This corresponds to the
conventional method such as [8]. The other (proc + filler) incorporated filler entries into the language model. Specifically,
we added all filler entries to the lexicon, and used the unigram
statistics of the training corpus by discounting the pause entry in the proceedings model. This is a simple version of the
proposed method by focusing on the fillers and disregarding
their context.
Table 1 lists the word percent correct (Corr.) and accuracy (Acc.). It is apparent that the turn-based models result in
significantly higher accuracy than the whole-session models,
because they provide stronger constraints. With the proposed
language transformation method, we could recall 92% of the
words. The accuracy is much higher than the case simply using the proceedings by 6% absolute. It is also confirmed that
the proposed statistical model outperforms the simple fillerinsertion model.
4.2. Evaluation on ASR
Then, we investigate the effect of the lightly supervised training on ASR using another test set of two sessions of the budget committee, which were held in February 2008. They con-
gramme (SCOPE), Ministry of Internal Affairs and Communications, Japan.
Table 2. ASR performance for test-set (WER)
session
baseline model
proposed method
reference (manual)
ML training
14.6%
14.2%
14.1%
MPE training
13.2%
12.6%
12.4%
6. REFERENCES
[1] L.Lamel, J.Gauvain, and G.Adda. Investigating lightly
supervised acoustic model training. In Proc. IEEEICASSP, volume 1, pages 477–480, 2001.
stitute 2.5 hour speech of 121 turns. The language model
for ASR was trained based on the transformation from large
archives of proceedings whose text size is 87M words. The
vocabulary size is 52K. Our decoder Julius2 was used for
LVCSR.
Table 2 shows the recognition performance by the Word
Error Rate (WER) in the cases of ML training and MPE training. The baseline is the result of the baseline model without incorporating the additional data for acoustic model training. The proposed method of lightly supervised training reduced the WER by 0.6% absolute in the MPE training, which
is a statistically significant improvement. For reference, the
acoustic model was trained with the same data using the manual transcripts, and the recognition performance was much
the same as that of the proposed method. This result demonstrates that we can update the acoustic model without manual
transcription. 3
In this evaluation setup, the size of the additional data was
smaller than that of the baseline training corpus, since the primary goal was updating the model. We expect that the effect
by the proposed method will be larger when a much larger
amount of data are provided in a longer period.
5. CONCLUSIONS
We have presented a language model transformation scheme
applied to lightly supervised training of the acoustic model. In
the experimental evaluation, it is confirmed that the proposed
method can generate accurate labels for the model training,
and thus the acoustic model can be updated effectively, realizing the comparable performance to the case using the manual
transcripts. The technique is vital for updating the acoustic
model by reflecting the change in the group of members of
the Congress, together with updating the language model to
reflect new topics, which can also be automatically done with
the transformation model.
We will investigate the effect of the scheme in a long term
when the ASR module is deployed in the Japanese Congress.
We also plan to apply the proposed scheme to the lecture transcription system.
Acknowledgment: The work is partially supported by Strategic Information and Communications R&D Promotion Pro2 http://julius.sourceforge.jp/
3 We could not conduct evaluation of the other methods in Table 1 in
lightly supervised training since it would take too much time.
3856
[2] H.Y.Chan and P.Woodland. Improving broadcast news
transcription by lightly supervised discriminative training. In Proc. IEEE-ICASSP, volume 1, pages 737–740,
2004.
[3] L.Nguyen and B.Xiang. Light supervision in acoustic
model training. In Proc. IEEE-ICASSP, volume 1, pages
185–188, 2004.
[4] Y.Akita and T.Kawahara. Topic-independent speakingstyle transformation of language model for spontaneous
speech recognition. In Proc. IEEE-ICASSP, volume 4,
pages 33–36, 2007.
[5] C.Gollan, M.Bisani, S.Kanthak, R.Schluter, and H.Ney.
Cross domain automatic transcription on the TC-STAR
EPPS corpus. In Proc. IEEE-ICASSP, volume 1, pages
825–828, 2005.
[6] J.Loeoef,
M.Bisani,
C.Gollan,
G.Heigold,
B.Hoffmeister, C.Plahl, R.Schlueter, and H.Ney.
The 2006 RWTH parliamentary speeches transcription
system. In Proc. TC-STAR Workshop on Speech-toSpeech Translation, pages 133–138, 2006.
[7] B.Ramabhadran, O.Siohan, L.Mangu, G.Zweig,
M.Westphal, H.Schulz, and A.Soneiro. The IBM 2006
speech transcription system for European parliamentary
speeches. In Proc. INTERSPEECH, pages 1225–1228,
2006.
[8] M.Paulik and A.Waibel. Lightly supervised acoustic model training EPPS recordings. In Proc. INTERSPEECH, pages 224–227, 2008.
[9] Y.Akita and T.Kawahara. Efficient estimation of language model statistics of spontaneous speech via statistical transformation model. In Proc. IEEE-ICASSP,
volume 1, pages 1049–1052, 2006.
[10] Y.Akita, Y.Nemoto, and T.Kawahara. PLSA-based topic
detection in meetings for adaptation of lexicon and language model. In Proc. INTERSPEECH, pages 602–605,
2007.
[11] Y.Akita and T.Kawahara. Generalized statistical modeling of pronunciation variations using variable-length
phone context. In Proc. IEEE-ICASSP, volume 1, pages
689–692, 2005.
[12] S.Petrik and G.Kubin. Reconstructing medical dictations from automatically recognized and non-literal
transcripts with phonetic similarity matching. In Proc.
IEEE-ICASSP, volume 4, pages 1125–1128, 2007.