ONE SENTENCE VOICE ADAPTATION USING GMM-BASED FREQUENCY-WARPING
AND SHIFT WITH A SUB-BAND BASIS SPECTRUM MODEL
Masatsune Tamura, Masahiro Morita, Takehiko Kagoshima, and Masami Akamine
Knowledge Media Laboratory, Corporate Research and Development Center, Toshiba Corporation
ABSTRACT
This paper presents a rapid voice adaptation algorithm using
GMM-based frequency warping and shift with parameters of a subband basis spectrum model (SBM)[1]. The SBM parameter
represents a shape of a spectrum of speech. It is calculated by
fitting a sub-band basis to the log-spectrum. Since the parameter is
the frequency domain representation, frequency warping can be
directly applied to the SBM parameter. A frequency warping
function that minimize the distance between source and target
SBM parameter pairs in each mixture component of a GMM is
derived using a DP (Dynamic programming) algorithm. The
proposed method is evaluated in an unit-selection based voice
adaptation framework applied to a unit-fusion based text-to-speech
synthesizer. The experimental results show that the proposed
adaptation method is effective for rapid voice adaptation using just
one sentence, compared to the conventional GMM.-based linear
transformation of mel-cepstra.
Index Terms—voice adaptation, frequency warping, subband basis spectrum model, unit fusion speech synthesis
1. INTRODUCTION
Voice conversion[2]-[5] is a technique to convert source speech to
target speech that sounds as if it is uttered by a target speaker. By
applying a voice conversion technique to a speech unit database of
a text-to-speech synthesizer(TTS), it can be adapted to a new voice
using only a small amount of recorded utterances of a target voice.
GMM-based voice conversion[3]-[5] is one of the widely used
voice conversion methods. The GMM and the voice conversion
functions for respective mixture components are trained using pairs
of utterances of a source speaker and a target speaker. For the
transformation of a spectral feature, linear regression of melcepstrum parameters is widely used[3][4]. Since the regression
matrices have many parameters to estimate (square of the order of
a cepstrum parameter), adaptation is slow. When the amount of
adaptation data is small, estimation of the regression matrices is
not reliable. Over-smoothing of the converted spectra is another
problem. Spectrum peaks of a converted spectrum become unclear.
Therefore, adapted speech has a degradation of voice quality.
To solve these problems, voice conversion algorithms using
frequency warping[4][5] were proposed. A vocal tract length of a
speaker is one of the properties for representing voice
characteristics. In frequency domain, it is represented by locations
of formant frequencies. Therefore, by applying frequency warping,
voice characteristics of speech can be changed. Strengths of
formants also represent a vocal tract shape. An sub-band energy
conversion[5] or a filter along with frequency warping changes
voice characteristics. One method conducts a dynamic frequency
warping to the Straight spectrum[4] to reduce the over-smoothing
problem. The converted spectrum is calculated by interpolation of
a frequency warped source spectrum and a spectrum generated by
978-1-4577-0539-7/11/$26.00 ©2011 IEEE
the cepstrum transformation. The method still uses the GMMbased linear regression of cepstra. Therefore, the adaptation is still
slow. Another method uses a GMM-based weighted frequency
warping[5]. It uses a piecewise linear frequency warping function
and sub-band energy conversion. The warping function is
calculated using positions of formants of mean target-and-source
spectrum for each mixture component. The warping function only
uses small number of formant positions. It is not estimated by data
optimization process such as mean squared error minimization.
Therefore, the conversion function is not precise.
In this paper, we propose a GMM-based voice conversion
method that applies frequency warping and shift to the parameters
of sub-band basis spectrum model(SBM)[1]. The SBM parameter
represents a shape of pitch-synchronous log-spectrum and phase. It
uses sub-band basis vectors created by 1-cycle sinusoidal shapes
that are similar to sparse coding basis. The parameter of the SBM
is calculated by fitting the basis to the log-spectrum. This
parameter is useful for voice adaptation for unit fusion based TTS,
since there are no significant degradation in synthetic speech using
analysis-synthesis database and original database.
The SBM parameter is the frequency domain representation.
Therefore, frequency warping can be directly applied. A GMM is
trained using SBM parameters of source and target speaker. Then,
frequency warping functions and shift vectors for respective
mixture components are estimated. A frequency warping function
that minimize the distance between source and target SBM
parameter pairs in each mixture component of a GMM is derived
using a DP (Dynamic programming) algorithm. A weighted
distance matrix with the posterior probability of the GMM is
calculated, and then a DP path is searched for on it. The shift
vectors representing filters are also calculated to minimize the
difference between the warped source parameters and the target
parameters. The voice conversion of proposed method is rapid
because the number of parameters of a conversion function is small
(two times the order of a SBM parameter). The conversion
functions are estimated by minimizing the distances between
training data pairs. Therefore the source spectrum can be precisely
converted to the target.
The proposed method is evaluated on an unit selection based
voice adaptation framework for the plural unit selection and fusion
based TTS[6][7]. The system does not need a parallel corpus of the
source and target speakers. Instead, it uses a cost function to make
a pair of source and target speech units. It enables a non-parallel
training of voice conversion functions. The experimental results
show that the proposed voice conversion method is effective
compared to the GMM-based linear regression of mel-cepstrum
parameters, when using a very small number of training utterances
such as one sentence.
2. THE VOICE ADAPTATION SYSTEM
Figure 1 illustrates the flow of the proposed voice-adaptationbased speech synthesis system. The system consists of a voice
5124
ICASSP 2011
Source
speech unit database
(large)
Target
adaptation data
(small)
0
Log-magnitude[db]
voice conversion module
Training data preparation
Conversion function training
Voice conversion
Adapted
speech unit database
(large)
input text
TTS module
Sub-band basis
π/2
0
Frequency[rad] π
(a) Sub-band basis.
FFT spectrum
SBM parameter
reconstructed spectrum
100
50
0
π
Frequency[rad]
(b) Log-spectrum using FFT, SBM parameter, reconstructed
spectrum from SBM parameter.
Figure 2. Example of SBM parameter.
synthetic
speech
Figure 1. Flow diagram of the proposed voice adaptation
based speech synthesis system.
conversion module and a TTS module. The voice conversion
module uses a speech unit database of the source speaker and a
small amount of adaptation data of a target speaker as inputs, and
the output is a voice-adapted speech unit database. The voice
conversion module consists of a training data preparation process,
a conversion function training process, and a voice conversion
process. In the training data preparation process, speech samples of
adaptation data are segmented into speech units (half-phone) and
pitch-cycle waveforms are extracted from the speech units by
applying Hanning window. To each of the pitch-cycle waveforms,
SBM parameters are extracted. Then, for each speech unit of the
adaptation data, a source speech unit is selected to make a pair with
the target speech unit for the conversion function training. The unit
selection is performed by minimizing a cost function[7], defined
by
C (u t , u c ) = ∑ wi C i (u t , u c ) ,
Frequency scale
1
(1)
where, u t , u c represents the target speech unit and source speech
unit. Ci (u t , u c ) represents a sub-cost function and wi is a weight
of the sub-cost. F0 sub-cost, duration sub-cost, phoneme
environment sub-cost, and boundary spectral sub-cost are used.
In the conversion function training process, a voice conversion
function is trained. The voice conversion function is based on the
GMM-based frequency warping and shift algorithm, described in
Section 3, and consists of frequency warping functions and shift
vectors. A GMM model and conversion functions for respective
Gaussian components are trained in this process.
In the voice conversion process, the conversion function is
applied to the speech units in the source speech unit database.
Converted speech units are generated from source speech units and
stored into the adapted speech unit database. The SBM parameters
of respective pitch-cycle waveforms of each source speech unit are
converted to those of a target speech unit by applying the
conversion function. Pitch-cycle waveforms for converted speech
units are generated by inverse-FFT of the reconstructed spectra
from the converted SBM parameters. The phase spectra
reconstructed from phase parameters of source units are used in
this process. The converted speech units are generated by over-lap
adding the pitch-cycle waveforms. An LSP post-filter[8] is applied
to the pitch-cycle waveforms.
In the TTS module, speech is synthesized from input text using
the adapted speech unit database. The TTS module consists of text
analysis part, prosody generation part and a speech synthesis part.
π/2
0
A plural unit selection and fusion method[6] is used in the speech
synthesis part.
3. VOICE ADAPTATION USING GMM-BASED
FREQUENCY WARPING AND SHIFT
The proposed method uses SBM parameters[1] for voice
adaptation. Figure 2 shows an example of a SBM parameter.
Figure 2 (a) illustrates the sub-band basis vectors that are placed on
a mel-frequency-scale for the lower half band and an equally
spaced scale for the upper half band. A basis vector is generated by
a 1-cycle sinusoidal function (Hanning window shape). Figure 2
(b) shows an example of the log-spectrum of a phoneme “a” from a
female speaker: the FFT spectrum, the SBM parameter and the
reconstructed spectrum from the SBM parameter are shown. The
SBM parameter is plotted with points and vertical lines at the
center frequencies of respective basis vectors.
Voice conversion is done by GMM-based frequency warping and
shift algorithm in the SBM parameter domain. The conversion
function is defined by,
y=
M
∑ P (x, cx = m | λ gmm ){warpm (x ) + shiftm } ,
(2)
m =1
where y = { y (1),L, y ( N )} , x , M represents the converted
parameter, the source parameter and the number of mixtures in
GMM λ gmm . P(x, c = m | λ gmm ) ≡ γ m (x) is a posterior probability,
where the mixture c x for a given observation x is m . The
warping function warp m (x) for the mixture m is defined by a
mapping function ψ m (k ) that maps the coefficients of source
parameter to the target parameter. The shift parameter shift m is
defined by a shift vector s m = {s m (1), L , s m ( N )} . Since the
parameter is in the log-spectrum domain a shift operation
represents a filter operation in time domain. For each elements of
y , equation (2) is written as
y (k ) =
M
∑ γ m (x ){x(ψ m (k )) + sm (k )} .
m =1
(3)
In the warping function warp m (x) , smoothing operation is also
applied in addition to ψ m (k ) . The mapping function ψ m (k ) maps
the elements of x to y by skipping or repeating some elements of
x . Smoothing operation uses interpolated value for skipped
elements and smoothed value for repeated elements. It reduces
unnatural spectrum jumps and flat spectra.
To train the warping function and the shift parameter, the training
5125
Source frequency
[rad]
data pairs are used. For each target speech units, a source speech
unit is selected from the source speech unit database to make a
training data pair as described in Section 2. The GMM model
λ gmm is trained using the data pairs. An observation vector for
GMM ot = {o′tsrc , o′ttarget }′ consists of a source parameter o tsrc and
. The combined vector is used for training,
a target parameter otarget
t
and the Gaussian for the o tsrc part is used for conversion. The
GMM parameters are initialized by LBG (Linde-Buzo-Gray)
algorithm and re-estimated by maximum likelihood estimation.
The conversion function consists of mapping functions
ψ m (k ) and shift vectors s m for respective mixture components.
They are iteratively trained to minimize the error function.
1. Initialization: Set s m to 0 .
2. Calculate ψ m that minimizes the distance between target
data and source data.
3. Calculate s m that minimizes the error function
4. Go to step 2, until convergence of average distance
between converted parameter and target parameter.
The squared error E between converted target parameters and
source parameters is,
Log-magnitude[db]
2
t
≈
M
∑ γ m (x t ){yt − {warpm (x t ) + shiftm }}
2
M
2
m =1 t
where x t y t represents source and target of t-th training pair. Here,
we assume that the error distribution for each Gaussian mixture is
independent. The mapping function ψ m (k ) can be obtained by,
2
ψ m ( k ) = arg min ∑ γ m ( x ){( y ( k ) − sm ( k )) − x (ψ m ( k ))} 㻚
(5)
ψ m (k )
t
Let Dm (i, j ) = ∑t γ m (xt ){( yt (i ) − sm (i ) ) − xt ( j )}2 be the weighted
distance matrix, the DP path is obtained by searching for a path
that minimizes,
⎧ distm (i − 1, j ) ⎫
⎪
⎪
distm (i, j ) = min ⎨ distm (i − 1, j − 1) ⎬ + Dm (i, j ) 㻚
(6)
⎪dist (i − 1, j − 2)⎪
⎩ m
⎭
By using equation (6), an optimum DP path for each Gaussian
mixture using multiple training data pairs can be obtained. Next in
Step 3, the shift vector s m is calculated as a weighted average
difference between the mapped source parameters and the target
parameters:
sm ( k ) = ∑ γ m ( x ){yt ( k ) − xt (ψ m ( k ))} ∑ γ m ( x ) 㻚
(7)
t
0
π/2
Target frequency [rad]
(a) Mapping function.
10
π
Shift vector s m
0
-10
0
π/2
Frequency [rad]
π
100
source
target
converted
50
0
0
π/2
π
Frequency [rad]
(c) Example of source, target, and converted SBM parameters.
Figure 3. Example of conversion function.
(4)
m =1
∑ ∑ γ m (x t ){yt − {warpm (x t ) + shiftm }}
0
(b) Shift vector.
t
=∑
Mapping function ψ m
π/2
Log-magnitude[db]
E = ∑ y t − yˆ t
π
t
Figure 3 shows an example of the (a) mapping function, (b) shift
vector, (c) source parameters, target parameters, and converted
parameters. These parameters are plotted at the center frequencies
of respective sub-band basis vectors. These figures show that the
spectrum shape is getting closer to the target by applying
frequency warping and shift directly to the SBM parameter.
4. EXPERIMENTS
For comparison between the conventional method and the
proposed method, a MOS evaluation test was conducted. GMMbased linear regression of the mel-cepstrum parameters is used as a
baseline. The speech unit database of one female speaker (624
sentences) and one male speaker (802 sentences) were used as
conversion source, and those of 4 female speakers (FA, FB, FC,
FD) and 1 male speaker (MA) were used as the target. Adaptation
was performed only on voiced speech units. One sentence and 50
sentences were used for adaptation, and different 50 sentences
were used for calculating spectral distance. Sub-cost weights wi in
equation (1) is experimentally set as {10, 3, 1, 1, 3} for normalized
sub-costs of F0 target cost, duration target cost, phonetic context
cost, spectrum concatenation cost, and power concatenation cost,
respectively.
Figure 4 shows the objective measure for (a) the proposed method
(SBM) and (b) the baseline method (MCEP) for one sentence
adaptation. The log-spectral distance between the reconstructed
spectrum and target spectrum for the test sentences was used as an
objective measure. The test data pairs of the target parameters and
the source or adapted parameters were created by unit selection.
The x-axis represents the number of mixtures for conversion.
SOURCE shows the case for source speech unit database without
conversion. The y-axis represents the log-spectral distance. The
distances for respective target speakers and their average (indicated
as ALL) are plotted. For proposed method, the distance does not
increase rapidly as increasing the number of mixtures. On the other
hand, the distance of the baseline method increases rapidly
Therefore, multiple mixtures can be used even for one sentence
adaptation by using the proposed method. It means that the
proposed method can reflect the acoustic space of source speaker
efficiently, even if a few adaptation data is available. One of the
reasons is that number of conversion parameters for each mixture
is small (100) for SBM conversion, while it is large (2500) for
MCEP conversion. Thus, adaptation of proposed method is rapid.
Figure 5 shows the result of the MOS evaluation. The subjects
listened to the stimuli with synthetic speech from a large speech
unit database of the target speaker. They gave five scale scores for
both speech quality (1: poor, 3: fair, 5: excellent) and similarity (1:
different, 3: resembled, 5: same). Seven subjects are attended and 4
sentences for each are evaluated. In Figure 5, (a) shows the
comparison between the baseline method (MCEP) and the
5126
Log-spectral distance[db]
Log-spectral distance[db]
12
FA
FB
10
FC
FD
MA
ALL
Speech quality
Similarity
SBM-1
MCEP-1
SBM-50
8
6
12
MCEP-50
SOURCE 1
2
4
8
16
number of GMM mixtures
(a) Proposed method (SBM)
SOURCE
32
1
3
5
2
4
(a) Comparison between baseline and proposed method
SHT-1
10
DFW-1
8
6
Speech quality
Similarity
DFW+SHT-1
FA
FB
FC
FD
DFW+SHT-50
MA
ALL
SHT-50
DFW-50
SOURCE 1
2
4
8
16
32
number of GMM mixtures
(b) Baseline method (MCEP)
Figure 4. Log-spectral distance for one sentence adaptation.
proposed method (SBM). SBM-1 and MCEP-1 use one sentence
for adaptation and the SBM-50 and MCEP-50 use 50 sentences.
SOURCE represents the synthetic speech from the sourcespeaker’s
speech unit database without conversion. For synthesizing test
samples, target speaker’s prosody generated in the TTS module
was used. In the figure, MOS results of speech quality and
similarity are shown. The scores were averaged over target
speakers. Based on the results of the objective evaluation, the
number of mixtures for the proposed method to 2 (FC,MA) or 4
(FA,FB,FD), and that for the baseline were set to 1. For adaptation
with 50 sentences, they were set to 64 (FA) or 128 (other) for
proposed method and 8 for the baseline.
The results show that the speech quality of the proposed method
using one sentence is higher than the baseline system while the
similarity stays almost the same. For adaptation with 50 sentences,
the scores for the proposed method are close to baseline for both
speech quality and similarity. For the proposed method, scores of
speech quality are close to “fair”, and those of similarity are higher
than “resemble”. The speech quality for SOURCE is higher than
the others, but the similarity score is less than 2. Consequently, the
results show that the proposed method is effective if only a small
number of adaptation sentences are available. The proposed
method, which synthesizes fair quality speech, can be used in
applications such like server based speech-to-speech translation
system or avatar interface of user’s voice where adaptation data is
limited.
Figure 5 (b) shows the result for comparison among the
frequency-warping (DFW) and the shift (SHT), and their
combination (DFW+SHT). For this evaluation, FB and FC were
used for target speakers, and the average scores of them are shown.
The individuality of DFW+SHT is better than both DFW and SHT
for one sentence adaptation. For adaptation with 50 sentences, both
for individuality and speech quality, DFW+SHT was more highly
scored. To summarize, the result showed that the frequency
warping and shift based voice conversion is effective compared to
the frequency warping only or the shift only conversion.
5.
1
2
3
4
(b) Comparison between DFW+FLT, DFW, FLT
Figure 5. Mean opinion score.
5
GMM-based frequency-warping and shift with parameters of subband basis spectral model. The proposed method was compared to
the baseline GMM-based mel-cepstrum linear regression method.
The results show that the proposed method is effective when only a
small number of adaptation utterances are available. The results
also showed that frequency warping and shift conversion results in
higher similarity MOS scores than adaptation by shift only or
frequency warping only. Our future work includes speaking-style
adaptation, cross-lingual adaptation, and using the proposed
method in HMM-based speech synthesis.
6. REFERENCE
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
CONCLUSION
In this paper, we proposed a voice adaptation method using
5127
M. Tamura, T. Kagoshima, and M. Akamine, “Sub-band
spectrum parameter for pitch-synchronous log-spectrum and
phase based on approximation of sparse coding,” Proc.
INTERSPEECH, pp.2406-2409, 2010.
Y. Stylianou, “Voice transformation: a survey,” Proc.
ICASSP, pp. 3585-3588, Apr., 2009.
Y. Stylianou,, O. Cappe, and E., Moulines, “Continuous
probabilistic transform for voice conversion”, IEEE Trans.
Speech & Audio Processing, vol. 6, pp. 131-142, 1998.
T. Toda, H. Saruwatari, K. Shikano. “Voice conversion
algorithm based on Gaussian mixture model with dynamic
frequency warping of STRAIGHT spectrum,” Proc. ICASSP,
pp. 841-844, 2001.
D. Erro, A. Moreno, “Weighted frequency warping for voice
conversion,” Proc. INTERSPEECH, pp. 1965-1968, 2007
T. Mizutani and T. Kagoshima, "Concatenative speech
synthesis based on the plural unit selection and fusion
method," IEICE Trans. E88-D, 11, pp.2565-2572, 2005.
M. Tamura, T, Kagoshima, “A study on voice conversion for
plural speech unit selection and fusion based speech
synthesis,”, Proc. ASJ2008, 2-P-5, Sept., 2008. (in Japanese)
Z.H.Ling, Y.J. Wu, Y.P.Wang, L.Qin, R.H.Wang, “USTC
system for Blizzard challenge 2006 an Improved HMMbased speech synthesis method,” Proc. Blizzard challenge
workshop, 2006.
© Copyright 2026 Paperzz