Hybrid Time-Scale Modification of Audio

Patrick-André Savard, Philippe Gournay
and Roch Lefebvre
Université de Sherbrooke, Québec, Canada


Problem description
Prior art
◦ Synchronized overlap-add w/fixed syn. (SOLAFS)
◦ Improved phase vocoder

Hybrid time-scale modification
◦
◦
◦
◦

High level algorithm
Classification
Main algorithm
Mode transition
Performance evaluation
◦ Classification performance
◦ Subjective testing results


What is time-scale modification?
Subject of interest:
◦ Subjective quality of time-scaled signals

Existing methods:
◦ Time vs frequency approaches
◦ High quality results on specific types of signals

TSM applied to various signal types
◦ Can be speech, music, or mixed-type signals

There is a need for a more “universal” method
WLEN
delay
delay
Sa
Input Signal
Ss
Output Signal




Based on the blockby-block STFT
analysis/synthesis
model
STFT phases are
updated so as to
preserve
instantaneous
frequencies
STFT amplitudes are
preserved
STFT modification
Improvements
Ra
N

FFT

STFT modification
stage

IFFT

Overlap-add and gain control
Rs
Peak-detection
Define regions
of influence
Compute inst .
freq. for peaks
Update peak
phases
Apply phase lock . to ROIs




Uses a frame-byframe model
Each frame goes
through a classifier
Signals identified as
monophonic are
processed using
SOLAFS
Signals identified as
polyphonic or noisy
are processed using
the phase vocoder
Read input
frame
Monophonic
Classify
signal
Polyphonic, noisy
Process samples
using the phase
vocoder
Process samples
using SOLAFS
Write output
frame

Goal:
◦ Discriminate monophonic/polyphonic/noise signals
Method used:
◦ Test the maximum of the normalized crosscorrelation (C.C.) measure in SOLAFS for each
analysis window
Music: Low to
medium C.C.
Music Signal
Amplitude
1
Unvoiced
0.5
Voiced
0
-0.5
-1
Speech Signal
Normalized cross-correlation

0
100
200
300
400
Time (ms)
500
600
700
1
Unvoiced speech:
Low & high C.C.
0.5
0
Voiced speech:
High C.C.
0
5
10
15
20
25
30
Synthesis window number
35
40
45



Default method:
SOLAFS
Switches to phase
vocoder when
Rmax<Txcorr
Constraint on
minimum length of
a SOLAFS synthesis
segment
Frame 2
Frame 1
Rmax<Txcorr
SOLAFS processing
Frame 1
SOLAFS
processing
Phase vocoder
processing
Frame 2
Rmax<Txcorr
Phase vocoder
processing
Phase vocoder
processing
SOLAFS processing
discarded
Phase vocoder
processing


Output signal padded
Last SOLAFS
Phase vocoder
with input samples
synthesis window
initialization:
Initialization
 Synthesis padded with
based on
input samples
matching
input/output
 Initialization based on
samples
matching input/output
samples
Gain control:
Previously
More padding using
padded synthesis
input samples
 More padding needed
 Synthesis further
padded and windowed
to reproduce a phase
vocoder output
Resulting synthesis
is windowed
First phase
vocoder synthesis
window overlaps
coherently

Current frame’s first
analysis window is out of
phase with current output
signal




Assume that the current
input frame contains a
stationary signal
First input window is one
phase vocoder analysis
step ahead
First SOLAFS segment is
OLA at the last phase
vocoder synthesis step
SOLAFS synthesis samples
(after the first OLA region)
replace synthesis samples
obtained by the phase
vocoder
Previous frame
Current frame
Current frame’s first
analysis window
Approximately
(not in phase
in phase
with current
with current
output)
output
Synthesis signal
(before transition)
First SOLAFS
synthesis window
Subsequent
SOLAFS synthesis
windows



Signal length =1
second
Tmax=0.6
Unvoiced speech
is successfully
detected
 Triggers
phase vocoder
processing
Time-scaled speech signal ( =2, Tmax =0.6)
0
0.2
0
0.2
0.4
0.6
Time (s)
Classification results
0.8
1
0.8
1
Phase vocoder
SOLAFS
0.4
0.6
Time (s)



Signal length =
25 seconds
Tmax=0.6
Classification
results:
0
 91 % phase
Phase vocoder
vocoder
 9 % SOLAFS
SOLAFS
0
Time-scaled music signal ( =2, Tmax =0.6)
5
5
10
15
Time (s)
Classification results
10
15
Time (s)
20
25
20
25






A/B method
Speech, music and mixed content (speech
over music) samples tested
Hybrid method compared to stand-alone
techniques
Comparisons performed on compressed and
expanded signals
Eight listeners took part of the test
Samples evaluated using a 5 step scale
70%
60%
50%
40%
Speech
30%
20%
10%
0%
H >> SOLA
H > SOLA
H = SOLA
H < SOLA
H << SOLA
70%
60%
50%
40%
Speech
Music
30%
20%
10%
0%
H >> SOLA
H > SOLA
H = SOLA
H < SOLA
H << SOLA
70%
60%
50%
40%
Speech
Music
30%
Mixed
20%
10%
0%
H >> SOLA
H > SOLA
H = SOLA
H < SOLA
H << SOLA
50%
45%
40%
35%
30%
25%
Speech
20%
15%
10%
5%
0%
H >> PV
H > PV
H = PV
H < PV
H << PV
50%
45%
40%
35%
30%
Speech
25%
Music
20%
15%
10%
5%
0%
H >> PV
H > PV
H = PV
H < PV
H << PV
50%
45%
40%
35%
30%
Speech
25%
Music
Mixed
20%
15%
10%
5%
0%
H >> PV
H > PV
H = PV
H < PV
H << PV
60%
50%
40%
Speech
30%
Music
Mixed
20%
10%
0%
H >> SOLA
H > SOLA
H = SOLA
H < SOLA
H << SOLA
60%
50%
40%
Speech
30%
Music
20%
10%
0%
H >> SOLA
H > SOLA
H = SOLA
H < SOLA
H << SOLA
60%
50%
40%
Speech
30%
Music
Mixed
20%
10%
0%
H >> SOLA
H > SOLA
H = SOLA
H < SOLA
H << SOLA
60%
50%
40%
30%
Speech
20%
10%
0%
H >> PV
H > PV
H = PV
H < PV
H << PV
60%
50%
40%
Speech
30%
Music
20%
10%
0%
H >> PV
H > PV
H = PV
H < PV
H << PV
60%
50%
40%
Speech
30%
Music
Mixed
20%
10%
0%
H >> PV
H > PV
H = PV
H < PV
H << PV

A hybrid TSM method is presented
◦ Uses a frame-by-frame classification stage
◦ Selects the best method based on the input signal
monophonic/polyphonic/noise character
◦ Mode transitions

High quality results are obtained
◦ Using speech, music and mixed-content signals

Future work
◦ Refine the classification criterion
◦ Use of phase flexibility to improve phase coherence
would improve phase vocoder to SOLAFS transitions

Contact: [email protected]