LYU0103 Speech Recognition Techniques for Digital Video Library

LYU0103
Speech Recognition
Techniques for
Digital Video Library
Supervisor : Prof Michael R. Lyu
Students:
Gao Zheng Hong
Lei Mo
Outline of Presentation
Project objectives
 ViaVoice recognition experiments
 Speech information processor
 Audio information retrieval
 Summary

Our Project Objectives
Speech recognition
 Audio information retrieval

Last Term’s Work




Extract audio channel (stereo 44.1 kHz) from
mpeg video files into wave files (mono 22 kHz)
Segment the wave files into sentences by detecting
its frame energy
Realtime dictation with IBM ViaVoice (ViaVoice
is a speech recognition engine developed by IBM)
Developed a visual training tool
Visual Training Tool
Video Window; Dictation Window; Text Editor
IBM ViaVoice Experiments
Employed 7 student helpers
 Produce transcripts of 77 news video clips
 Four experiments:
 Baseline measurement
 Trained model measurement
 Slow down measurement
 Indoor news measurement

Baseline Measurement
To measure the ViaVoice recognition
accuracy using TVB news video
 Testing set: 10 video clips
 The segmented wav files are dictated
 Employ the hidden Markov model toolkit
(HTK) to examine the accuracy

Trained Model Measurement





To measure the accuracy of ViaVoice, trained by
its correctly recognized words
10 videos clips are segmented and dictated
The correctly dictated words of training set are
used to train the ViaVoice by the SMAPI function
SmWordCorrection
Repeat the procedures of “baseline measurement”
after training to get the recognition performance
Repeat the procedures of using 20 videos clips
Slow Down Measurement
Investigate the effect of slowing down the
audio channel
 Resample the segment wave files in the
testing set by the ratio of 1.05, 1.1, 1.15, 1.2,
1.3, 1.4, and 1.6
 Repeat the procedures of “baseline
measurement”

Indoor News Measurement
Eliminate the effect of noise
 Select the indoor news reporter sentence
 Dictate the test set using untrained model
 Repeat the procedure using trained model

Experimental Results
Experiment
Baseline
Trained Model
Slow Speech
Indoor Speech (untrained
model)
Indoor Speech (trained
model)
Accuracy (Max. performance)
25.27%
25.87% (with 20 video
trained)
25.67% (max. at ratio = 1.15)
35.22%
36.31% (with 20 video
trained)
Overall Recognition Results (ViaVoice, TVB News )
Experimental Result Cont.
Trained Video
Number
Accuracy
Untrained 10 videos
20 videos
25.27%
25.87%
25.82%
Result of trained model with different number of training videos
Ratio
1
1.05 1.1
1.15 1.2
1.3
1.4
1.5
Accuracy 25.27 25.46 25.63 25.67 25.82 17.18 12.34 4.04
(%)
Result of using different slow down ratio
Analysis of Experimental Result
Trained model: about 1% accuracy
improvement
 Slowing down speeches: about 1% accuracy
improvement
 Indoor speeches are recognized much better
 Mandarin: estimated baseline accuracy is
about 70 % ( >> Cantonese)

Experiment Conclusions
Four reasons for low accuracy
 Language model mismatch
 Voice channel mismatch
 The broadcast is very fast and some
characters are not so clear
 The voice of video clips is too loud
 The first two reasons are the most critical
ones

Speech Recognition Approach





We cannot do much acoustic model training with
the ViaVoice API
Training is speaker dependent
Great difference between the news audio and the
training speech for ViaVoice
The tool to adapt acoustic model is not currently
available
Manually editing is necessary for producing
correct subtitles
Speech Information Processor (SIP)
Media player, Text editor, Audio information panel
Main Features
Media playback
 Real-time dictation
 Word time information
 Dynamic recognition text editing
 Audio scene change detection
 Audio segments classification
 Gender classification

System Chart
Timing Information Retrieval





Use ViaVoice Speech Manager API (SMAPI)
Asynchronous callback
The recognized text is organized in a basic unit
called “firm word”
SIP builds an index to store the position and time
of firm words
Highlight corresponding firm words during video
playback
Highlight words during playback
Dynamic Index Alignment





While editing recognized result, firm word
structure might be changed
Word index need to be updated accordingly
SIP captures WM_CHAR event of the text editor
Then search for the modified words, and update
the corresponding entries in the index
In practice, binary search provides good
responding time
Time Index Alignment Example
Before Editing
Editing
After Editing
Audio Information Panel
The entire clip is divided into segments
separated by audio scene changes
 SIP classifies the segments into three
categories, male, female, and non-speech
 Click a segment to preview it

Audio Information Retrieval
Detection of Audio Scene Change
--Motivations
Segments of different properties can be
handled differently
 Apply unsupervised learning to different
clusters
 Assistant tool to video scene change
detection

Bayesian Information Criterion (BIC)
Gaussian Distribution—model input stream
 Maximum Likelihood—detect turns
 BIC– make a decision

Principle of BIC
Bayesian information criterion (BIC) is a
likelihood criterion
 The main principle is to penalize the system
by the model complexity

Detection of a single point
change using BIC
H0:x1,x2…xN~N(μ,Σ)
H1:x1,x2…xi~N(μ1,Σ1),
H2:xi+1,xi+2…xN~N(μ2,Σ2),
The maximum likelihood ratio is defined as:
R(I)=Nlog| Σ|-N1log| Σ1|-N2log| Σ2|
Detection of a single point
change using BIC
The difference between the BIC values of
two models can be expressed as:
BIC(I) = R(I) – λP
P=(1/2)(d+(1/2d(d+1)))logN
 If BIC value>0, detection of scene change

Detection of multiple point changes by
BIC



a.
Initialize the interval [a, b] with a=1, b=2
b.
Detect if there is one changing point in interval [a, b]
using BIC
c.
If (there is no change in [a, b])
let b= b + 1
else
let t be the changing point detected
assign a = t +1; b = a+1;
end
d. go to step (b) if necessary
Advantages of BIC approach
Robustness
 Thresholding-free
 Optimality

Comparison of different algorithms
Gender Classification:
Motivation and Purpose
Allowing different speech analysis
algorithms for each gender
 Facilitating speech recognition by cutting
the search space in half
 Helping us to build gender-dependent
recognition model and better training of the
system

Gender Classification
Male
Female
Speech/Non-Speech Classification
Motivation
 One method we used : pitch tracking

Speech/Non-Speech classification
Speech
Non-Speech
Summary
ViaVoice training experiments
 Speech recognition editing
 Dynamic index alignment
 Audio scene change detection
 Speech classification
 Integrated the above functions into a speech
processor

Q&A