Patrick-André Savard, Philippe Gournay and Roch Lefebvre Université de Sherbrooke, Québec, Canada Problem description Prior art ◦ Synchronized overlap-add w/fixed syn. (SOLAFS) ◦ Improved phase vocoder Hybrid time-scale modification ◦ ◦ ◦ ◦ High level algorithm Classification Main algorithm Mode transition Performance evaluation ◦ Classification performance ◦ Subjective testing results What is time-scale modification? Subject of interest: ◦ Subjective quality of time-scaled signals Existing methods: ◦ Time vs frequency approaches ◦ High quality results on specific types of signals TSM applied to various signal types ◦ Can be speech, music, or mixed-type signals There is a need for a more “universal” method WLEN delay delay Sa Input Signal Ss Output Signal Based on the blockby-block STFT analysis/synthesis model STFT phases are updated so as to preserve instantaneous frequencies STFT amplitudes are preserved STFT modification Improvements Ra N FFT STFT modification stage IFFT Overlap-add and gain control Rs Peak-detection Define regions of influence Compute inst . freq. for peaks Update peak phases Apply phase lock . to ROIs Uses a frame-byframe model Each frame goes through a classifier Signals identified as monophonic are processed using SOLAFS Signals identified as polyphonic or noisy are processed using the phase vocoder Read input frame Monophonic Classify signal Polyphonic, noisy Process samples using the phase vocoder Process samples using SOLAFS Write output frame Goal: ◦ Discriminate monophonic/polyphonic/noise signals Method used: ◦ Test the maximum of the normalized crosscorrelation (C.C.) measure in SOLAFS for each analysis window Music: Low to medium C.C. Music Signal Amplitude 1 Unvoiced 0.5 Voiced 0 -0.5 -1 Speech Signal Normalized cross-correlation 0 100 200 300 400 Time (ms) 500 600 700 1 Unvoiced speech: Low & high C.C. 0.5 0 Voiced speech: High C.C. 0 5 10 15 20 25 30 Synthesis window number 35 40 45 Default method: SOLAFS Switches to phase vocoder when Rmax<Txcorr Constraint on minimum length of a SOLAFS synthesis segment Frame 2 Frame 1 Rmax<Txcorr SOLAFS processing Frame 1 SOLAFS processing Phase vocoder processing Frame 2 Rmax<Txcorr Phase vocoder processing Phase vocoder processing SOLAFS processing discarded Phase vocoder processing Output signal padded Last SOLAFS Phase vocoder with input samples synthesis window initialization: Initialization Synthesis padded with based on input samples matching input/output Initialization based on samples matching input/output samples Gain control: Previously More padding using padded synthesis input samples More padding needed Synthesis further padded and windowed to reproduce a phase vocoder output Resulting synthesis is windowed First phase vocoder synthesis window overlaps coherently Current frame’s first analysis window is out of phase with current output signal Assume that the current input frame contains a stationary signal First input window is one phase vocoder analysis step ahead First SOLAFS segment is OLA at the last phase vocoder synthesis step SOLAFS synthesis samples (after the first OLA region) replace synthesis samples obtained by the phase vocoder Previous frame Current frame Current frame’s first analysis window Approximately (not in phase in phase with current with current output) output Synthesis signal (before transition) First SOLAFS synthesis window Subsequent SOLAFS synthesis windows Signal length =1 second Tmax=0.6 Unvoiced speech is successfully detected Triggers phase vocoder processing Time-scaled speech signal ( =2, Tmax =0.6) 0 0.2 0 0.2 0.4 0.6 Time (s) Classification results 0.8 1 0.8 1 Phase vocoder SOLAFS 0.4 0.6 Time (s) Signal length = 25 seconds Tmax=0.6 Classification results: 0 91 % phase Phase vocoder vocoder 9 % SOLAFS SOLAFS 0 Time-scaled music signal ( =2, Tmax =0.6) 5 5 10 15 Time (s) Classification results 10 15 Time (s) 20 25 20 25 A/B method Speech, music and mixed content (speech over music) samples tested Hybrid method compared to stand-alone techniques Comparisons performed on compressed and expanded signals Eight listeners took part of the test Samples evaluated using a 5 step scale 70% 60% 50% 40% Speech 30% 20% 10% 0% H >> SOLA H > SOLA H = SOLA H < SOLA H << SOLA 70% 60% 50% 40% Speech Music 30% 20% 10% 0% H >> SOLA H > SOLA H = SOLA H < SOLA H << SOLA 70% 60% 50% 40% Speech Music 30% Mixed 20% 10% 0% H >> SOLA H > SOLA H = SOLA H < SOLA H << SOLA 50% 45% 40% 35% 30% 25% Speech 20% 15% 10% 5% 0% H >> PV H > PV H = PV H < PV H << PV 50% 45% 40% 35% 30% Speech 25% Music 20% 15% 10% 5% 0% H >> PV H > PV H = PV H < PV H << PV 50% 45% 40% 35% 30% Speech 25% Music Mixed 20% 15% 10% 5% 0% H >> PV H > PV H = PV H < PV H << PV 60% 50% 40% Speech 30% Music Mixed 20% 10% 0% H >> SOLA H > SOLA H = SOLA H < SOLA H << SOLA 60% 50% 40% Speech 30% Music 20% 10% 0% H >> SOLA H > SOLA H = SOLA H < SOLA H << SOLA 60% 50% 40% Speech 30% Music Mixed 20% 10% 0% H >> SOLA H > SOLA H = SOLA H < SOLA H << SOLA 60% 50% 40% 30% Speech 20% 10% 0% H >> PV H > PV H = PV H < PV H << PV 60% 50% 40% Speech 30% Music 20% 10% 0% H >> PV H > PV H = PV H < PV H << PV 60% 50% 40% Speech 30% Music Mixed 20% 10% 0% H >> PV H > PV H = PV H < PV H << PV A hybrid TSM method is presented ◦ Uses a frame-by-frame classification stage ◦ Selects the best method based on the input signal monophonic/polyphonic/noise character ◦ Mode transitions High quality results are obtained ◦ Using speech, music and mixed-content signals Future work ◦ Refine the classification criterion ◦ Use of phase flexibility to improve phase coherence would improve phase vocoder to SOLAFS transitions Contact: [email protected]
© Copyright 2026 Paperzz