LYU0103 Speech Recognition Techniques for Digital Video Library

LYU0103
Speech Recognition
Techniques for
Digital Video Library
Supervisor : Prof Michael R. Lyu
Students:
Gao Zheng Hong
Lei Mo
Outline of Presentation
Project objectives
 ViaVoice recognition experiments
 Speech recognition editing tool
 Audio scene change detection
 Speech classification
 Summary

Our Project Objectives
Audio information retrieval
 Speech recognition

Last Term’s Work
Extract audio channel (stereo 44.1 kHz)
from a mpeg video files into wave files
(mono 22 kHz)
 Segmented the wave files into sentences by
detecting its frame energy
 Developed a visual training tool

Visual Training Tool
Video Window; Dictation Window; Text Editor
IBM ViaVoice Experiments
Employed 7 student helpers
 Produce transcripts of 77 news video clips
 Four experiments:
 Baseline measurement
 Trained model measurement
 Slow down measurement
 Indoor news measurement

Baseline Measurement
To measure the ViaVoice recognition
accuracy using TVB news video
 Testing set: 10 video clips
 The segmented wav files are dictated
 Employ the hidden Markov model toolkit
(HTK) to examine the accuracy

Trained Model Measurement





To measure the accuracy of ViaVoice, trained by
its correctly recognized words
10 videos clips are segmented and dictated
The correctly dictated words of training set are
used to train the ViaVoice by the SMAPI function
SmWordCorrection
Repeat the procedures of “baseline measurement”
after training to get the recognition performance
Repeat the procedures of using 20 videos clips
Slow Down Measurement
Investigate the effect of slowing down the
audio channel
 Resample the segment wave files in the
testing set by the ratio of 1.05, 1.1, 1.15, 1.2,
1.3, 1.4, and 1.6
 Repeat the procedures of “baseline
measurement”

Indoor News Measurement
Eliminate the effect of noise
 Select the indoor news reporter sentence
 Dictate the test set using untrained model
 Repeat the procedure using trained model

Experimental Results
Experiment
Baseline
Trained Model
Slow Speech
Indoor Speech (untrained
model)
Indoor Speech (trained
model)
Accuracy (Max. performance)
25.27%
25.87% (with 20 video
trained)
25.67% (max. at ratio = 1.15)
35.22%
36.31% (with 20 video
trained)
Overall Recognition Results (ViaVoice, TVB News )
Experimental Result Cont.
Trained Video
Number
Accuracy
Untrained 10 videos
20 videos
25.27%
25.87%
25.82%
Result of trained model with different number of training videos
Ratio
1
1.05 1.1
1.15 1.2
1.3
1.4
1.5
Accuracy 25.27 25.46 25.63 25.67 25.82 17.18 12.34 4.04
(%)
Result of using different slow down ratio
Analysis of Experimental Result
Trained model: about 1% accuracy
improvement
 Slowing down speeches: about 1% accuracy
improvement
 Indoor speeches are recognized much better
 Mandarin: estimated baseline accuracy is
about 70 % ( >> Cantonese)

Speech Processor
Training does not increase accuracy
significantly
 Need manually editing of the recognition
result
 Word timing information is also important

Editing Functionality
The recognition result is organized in a
basic unit called “firm word”
 Retrieve the timing information from the
speech engine
 Record the timing information of every firm
word in an index
 Highlight corresponding firm word during
video playback

Dynamic Time Index Alignment
While editing recognition result, firm word
structure may be changed
 Time index need to be updated to maintain
new firm word
 In speech processor, time index is aligned
with firm words whenever user edits the
text

Time Index Alignment Example
Before Editing
Editing
After Editing
Motivation for Doing Speech
Segmentation and Classification
Gender classification can help us to build
gender dependent model
 Detection of scene changes from video
content is not accurate enough, so we need
audio scene change detection as an assistant
tool

Flow Diagram of Audio
Information Retrieval System
Audio
Signal
From
News’ Audio
Channel
By MFCC var.
MFCC
Segmentation
Detect cont’
vowel > 30%
By 256 GMM
Male?
Speech
Audio
Signal
Feature
Extraction
Audio
Scene
Change
Female?
NonSpeech
Music
Pattern
Matching
By Clustering
Speaker
Identification/
Classification
Feature Extraction by MFCC
The first thing we should do on the raw
audio input data
 MFCC stands for “mel-frequency cepstral
coefficient”
 Human perception of the frequency of
sound does not follow a linear scale

Detection of Audio Scene Change by
Bayesian Information Criterion (BIC)
Bayesian information criterion (BIC) is a
likelihood criterion
 We maximize the likelihood functions
separately for each model M and obtain L
(X,M)
 The main principle is to penalize the system
by the model complexity

Detection of a single point
change using BIC
We define:
H0 : x1, x2 … xN ~ N(μ,Σ)
to be the whole sequence without changes and
H1: x1, x2 … xL ~ N(μ1,Σ1), xL+1, xL+2 … xN ~ N(μ2,Σ2),
is the hypothesis that change occurring at time i.
The maximum likelihood ratio is defined as:
R(I)=Nlog| Σ|-N1log| Σ1|-N2log| Σ2|

Detection of a single point
change using BIC
The difference between the BIC values of
two models can be expressed as:
BIC(I) = R(I) – λP
P=(1/2)(d+(1/2d(d+1))logN
 If BIC value>0, detection of scene change

Detection of multiple point
changes by BIC



a.
Initialize the interval [a, b] with a=1, b=2
b.
Detect if there is one changing point in interval [a, b]
using BIC
c.
If (there is no change in [a, b])
let b= b + 1
else
let t be the changing point detected
assign a = t +1; b = a+1;
end
d. go to step (b) if necessary
Advantages of BIC approach
Robustness
 Thresholding-free
 Optimality

Comparison of different algorithms
Audio scene change detection
Gender Classification


The mean and covariance of male and female
feature vector is quite different
So we can model it by a Gaussian Mixture
Model (GMM)
Male/Female Classification
(freq count vs. values)
Male
Female
Gender Classification
Music/Speech classification by
pitch tracking
speech has more continue contour than
music.
 Speech clip always has 30%-55%
continuous contour whereas silence or
music has1%-15%
 Thus, we choose >20% for speech.

Frequency Vs no of frames
Speech
Music
Summary
ViaVoice training experiments
 Speech recognition editing tool
 Dynamic time index alignment
 Audio scene change detection
 Speech classification
 Integrated the above functions into a speech
processor

Future Work
Classify the indoor news and outdoor news
for further process the video clips
 Train the gender dependent models for
ViaVoice engine. It may increase the
recognition accuracy by having a gender
dependent model
