Improving the Sound Quality of an Electronic Voice Box

Improving the Sound Quality of an Electronic Voice
Box
*
Nan Yan *, Manwa L. Ng† and Tan Lee ‡
Laboratory for Ambient Intelligence & Multimodal System, Shenzhen Institutes of Advanced Technology,Chinese Academy of
Sciences/The Chinese University of Hong Kong, Shenzhen, China
E-mail: [email protected] Tel: +86-755-86392174
†
Division of Speech and Hearing Sciences, University of Hong Kong, Hong Kong, China
E-mail: [email protected] Tel: +852- 39171582
‡
Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong, China
E-mail: [email protected] Tel: +852- 3943 8267
Abstract— The present project attempted to design the next
generation electronic voice box by modifying the driving signal
of the device. Two steps were involved in the project. The first
step involved designing a working minishaker system using
existing commercial miniature shaker system that functions
similarly to an electrolarynx, but with variable driving signals
(type and frequency). The second step involved carrying out
perceptual experiments to examine listeners’ intelligibility and
acceptability of the voice produced with the minishaker system
associated with different driving signals by both normal
laryngeal speakers and laryngectomized speakers of Cantonese.
Upon completion of the experiments, the five driving signals and
the five driving frequencies that are associated with the highest
acceptability scores were obtained. It is anticipated that the next
generation electronic voice box should be designed based on
these driving signals and frequencies should yield better and
more natural sound quality.
I.
INTRODUCTION
Total laryngectomy is a surgery for treating patients with late
stage laryngeal cancer. During the procedure, the entire
voicing mechanism including all the laryngeal cartilages,
hyoid bone, and soft tissues of the voice box and some
adjacent structures are removed. Due to the loss of voicing
structures, patients lose their ability to speak after the surgery.
Therefore, learning to use an alternative way to speak again
becomes an important part of post-laryngectomy
rehabilitation in these laryngeal cancer survivors. This ability
is closely associated with their quality of life following
laryngectomy.
According to statistics, there are currently over 600,000
laryngectomees worldwide, with over 12,000 new cases every year
in the U.S.A. [1]. A conservative estimate indicates that there are
currently at least 65,000 laryngectomees in China, and over 2,700
active laryngectomees in Hong Kong [2]. Although different types of
post-laryngectomy speech are available such as esophageal,
tracheoesophageal,
pneumatic
artificial
laryngeal
and
electrolaryngeal speech, almost all of the laryngectomy patients rely
on a device known as electronic voice box, or electrolarynx, as the
only “talking device” right after the surgery before they learn to
speak other forms of post-laryngectomy speech. Many of them
remain using EL as their primary speaking method later on. In fact,
the majority of laryngeal cancer survivors are using an electrolarynx
as their artificial speaking device [2].
978-616-361-823-8 © 2014 APSIPA
An electrolarynx (EL) is a hand-held device that functions as a
new sound source. During EL speech, the device is held against the
anterolateral side of speaker’s neck. When the button on the device is
pressed, sound generated by the vibrating plate of the EL is
transmitted to the inside of the speaker’s throat (vocal tract) via neck
tissue [3]. The transmitted signal is resonated inside the vocal tract to
form different speech sounds. Different vocal tract configurations
which are determined by the positioning of various speech structures
inside the vocal tract (e.g., lips, tongue, jaw, soft palate, etc.) possess
different resonance characteristics (transfer functions), and thus
different sounds can be produced. For example, during production of
different vowels, the voicing system (the larynx) is basically doing
the same thing, while the articulatory system of the vocal tract
change to yield different sounds. In particular, tongue tip is elevated
and fronted for the production of /i/, as in “see”, whereas, the back of
the tongue is elevated and backed when producing the sound /u/,
combined with rounded lips, as in “zoo”. It is the unique vocal tract
configurations associated with different speech sounds that
distinguish different speech sounds.
Despite the popularity of using EL among laryngeal cancer
survivors who have undergone total laryngectomy, EL speech
performance is far from satisfactory due to its poor sound quality.
The literature consistently reports that EL sound is characterized by
its low pitch (above one octave below normal laryngeal sound),
monotonic (lack of pitch fluctuation), reduced loudness, robotic,
unnatural, and nonhuman sound quality [cf. 4-6]. Such poor sound
quality negatively affects the perception of EL speech by the general
public, particularly for female EL speakers as it greatly affects their
gender identity. The reduced intensity associated with EL speech
often significantly lowered EL speech intelligibility and rendered EL
speakers not willing to speak in public, especially in noisy
environment. In addition, laryngeal cancer survivors who are using
EL as their primary speaking method often complain that the
unnatural EL sound often draws unwanted attention from others and
communication barriers are developed [7]. Verbal communication
and quality of life of EL laryngeal cancer survivors is adversely
affected due to such poor sound quality associated with EL device
[7].
Although current EL technology has been developed since more
than 40 years ago, little improvement in the sound quality of the EL
device is seen [8]. Researchers consistently found that the output
speech sound of EL speakers has a diminished energy in the low
frequency region (below 500 Hz) when compared to the speech
sound produced by normal laryngeal speakers [cf. 9-11]. The lowfrequency energy deficit in EL speech sound likely contributes to the
perception of unnatural and nonhuman EL sound quality, and thus
APSIPA 2014
the poor EL speech intelligibility. In addition to the low-frequency
energy deficit, the lack of pitch and amplitude fluctuations (jitter and
shimmer) in EL sound also contribute to the perception of robotic
and nonhuman sound quality. A recent research examined the sound
quality after the low frequency energy is boosted and reported
positive results. Efforts have then been made to improve EL speech
sound by using adaptive filtering algorithm [cf. 12] and subtractiontype algorithm [9]. However, these are post-processing strategies and
cannot be implemented simultaneously with EL speech production.
An EL is a vibratory device that produces electronic sound
source, in a way similar to the human vocal cords. The working
mechanism of an EL is similar to a mechanical shaker, in which the
pattern of vibration of the membrane is governed by the input
driving signal. Yet, the exact driving signal used by the
commercially available ELs is not known due to proprietary reason.
However, it has been revealed that, for example, Siemens EL has a
driving signal that resembles an impulse train of alternate sign, and
Bell’s EL is driven by a square wave [13]. It is not known why
impulse train and square wave are chosen as the input driving signals
for ELs. There has no established studies reporting why using an
impulse or square wave as the EL driving signal will yield a superior
sound quality or better listeners’ acceptability or intelligibility. In an
attempt to enhance EL sound, researchers made use of broadband
white noise as the driving signal [8,13]. This is obtained by inversefiltering the neck frequency response function extracted from
laryngectomees. The researchers claimed that this could yield a
sound quality that is similar to a vocal cord sound. However,
whether an EL sound driven by a broadband white noise signal can
yield a near normal sound still remains skeptical. Moreover, how
such EL sound improves intelligibility is yet to be determined. In
review of such knowledge gap, the present study attempted to
examine how input driving signal of an EL is associated with
listeners’ intelligibility and acceptability. By prototyping an EL with
different driving signals and different vibratory frequencies, and by
assessing the listeners’ acceptability and intelligibility of the
resultant EL sound, a new generation EL device of enhanced sound
quality can be developed. Specifically, the purpose of the study
included: (1) to design a simulation for EL, (2) to find the best five
driving signals the will yield the highest acceptability scores, and (3)
to find the best five driving frequencies that will yield the highest
acceptability scores.
II.
box with an electronic artificial voice box - the EL, an
alternative sound source.
In the present project, a minishaker system that functions in a
similar manner as a conventional EL was designed. The schematic of
the minishaker system is shown in Fig. 2. The minishaker system
was constructed by connecting a small vibration exciter (Model
4810, Brüel & Kjær), driven by a 75 VA power amplifier (Model
2718, Brüel & Kjær), to an impedance head (Model 8001, Brüel &
Kjær) and a conditioning amplifier (Model 2692-A-OS2, Brüel &
Kjær). The power amplifier was connected to a computer from
which the variable driving signal was input to the system. By
manipulating the driving signal, the vibratory pattern of the vibratory
exciter could be changed.
Fig.1 Flowchart of the mechanism of existing electrolarynx.
Fig.2 Schematic diagram of the minishaker system used in the study.
Use of the minishaker system is similar to using an EL by an
alaryngeal speaker. When in use, the speaker holds the minishaker
device in the hand, and the vibratory plate of the minishaker system
is coupled to the neck of the user, as in using the electrolarynx.
Sound energy is then transmitted transcervically to the vocal tract for
resonance.
METHODS
A.
Simulation of an electrolarynx - Design of a minishaker
system
Little is known about the design of conventional
electrolarynges (ELs) due to proprietary reasons. However,
the mechanism of a conventional EL can generally be
illustrated by Fig. 1. During EL speech production, sound
generated by the vibration of EL is transmitted to the inside of
the vocal tract of EL speaker (see Fig. 1). The EL speaker’s
neck frequency response function determines how much and
what acoustic energy is being transmitted across the neck
(transcervically). The sound is then resonated inside the vocal
tract. The way the transcervical sound is resonated depends on
the vocal tract resonance characteristics (vocal tract transfer
function) which are determined by the vocal tract
configuration (shaped by various speech structures inside).
This helps laryngeal cancer survivors who have undergone
total laryngectomy make sound by substituting the lost voice
Fig. 3 The vibratory exciter and the minishaker system used in the study to
simulate an electrolarynx.
We kindly ask authors to check your camera-ready paper if
all fonts in the PDF file of the final manuscript are embedded
and subset.
It can be checked from Document
Properties/Fonts in File menu of Adobe Acrobat.
B.
Assessment of listeners’ acceptability
To assess listeners’ intelligibility and acceptability
associated with the sound of the minishaker system, two sets
of experiments were carried out: (1) stimulus production, and
(2) perceptual rating experiment. The perceptual rating
experiment was carried out in two steps. The first step
involved production of speech samples using a total of 54
different driving signals by 4 (2 laryngeal and 2
laryngectomized) speakers. The 54 driving signals were
designed by using impulse, square wave, sawtooth wave, and
sinusoidal wave. After the rating experiment, the 10 signals
associated with the best intelligibility were shortlisted and the
second step began. The second step involved production of
speech stimuli by six (3 laryngeal and 3 laryngectomized)
speakers using the shortlisted signals. Upon completion of the
second step, the five signals associated with the best listeners’
acceptability were obtained.
To obtain the five best frequencies, the best driving signal
(impulse) was used to produce speech stimuli at 12 different
frequencies (from 90 Hz to 140 Hz, with a 3 Hz increment).
From the 12 different frequencies, the five frequencies
associated with the best listeners’ acceptability were
shortlisted. The experimental setup for stimulus production
and perceptual rating is discussed below.
1)
Stimulus production.
To obtain the stimuli for the perceptual raing experiment,
both healthy laryngeal and laryngectomized speakers were
recruited to read aloud a short Chinese passage: “The North
Wind and The Sun”. The speakers were all adult native male
speakers of Cantonese. During the recording, speech samples
were produced by using the minishaker system. The speakers
read the passage by coupling the plate of the minishaker
system against the neck. The acoustic signals were digitized at
20 kHz and 16 bits/sample, and recorded using a high-quality
microphone (SM58, Shure) and an external sound card
(MobilPre USB, M-Audio) for later perceptual experiment.
2)
Perceptual rating experiment.
Naïve listeners who had no prior experience with any form
of alaryngeal speech were recruited to participate in the
perceptual rating experiment. All listeners were adult native
speakers of Cantonese. The experimental procedure and
protocol used in the present study were similar to that
reported by Ng, Kwok, and Chow [6], in which listeners’
acceptability and intelligibility associated with different types
of laryngectomized speech were assessed. Assessment of
acceptibility and intelligibility of speech produced by using
the minishaker system was based on ratings for the following
six attibutes: (1) voice quality, (2) articulation proficiency, (3)
quietness of speech, (4) pitch variation, (5) accuracy in tone
production, and (6) overall speech intelligibility. During the
experiment, the listeners were seated in a sound treated room
and speech stimuli were presented to the listeners at a
comfortable loudness level (at about 56-70 dBSPL). The
listeners were requested to rate the above attributes on a 1-7
equal-interval scale, with a “1” representing the worst and a
“7” the best ratings. The raters were instructed to circle the
rating on an answer sheet provided before the experiment. To
familiarize the raters with the procedure, trials were given
before the experiment began. Speech acceptability and
intelligibility were evaluated based on the ratings obtained.
III.
RESULTS AND DISCUSSION
A.
Five Best Driving Signals
After the first step of the perceptual experiment, ten driving
signals associated with the best acceptability were shortlisted.
They were: (1) impulse + Gaussian noise, (2) upward
sawtooth wave + sinusoidal wave, (3) impulse only, (4)
upward sawtooth wave + sinusoidal wave + Gaussian noise,
(5) impulse + upward sawtooth wave, (6) impulse +
downward sawtooth wave, (7) impulse + upward sawtooth
wave + Gaussian noise, (8) impulse + square wave, (9)
downward sawtooth wave only, and (10) impulse +
downward sawtooth wave + Gaussian noise. After the second
step of the perceptual experiment, the five best signals
obtained are shown in Table 1.
TABLE I.
THE FIVE BEST DRIVING SIGNALS
Rank
Driving signals
1
2
3
4
5
Impulse only
Square wave + pulse train
Impulse + Gaussian noise
Downward sawtooth wave + pulse train
Impulse + downward sawtooth wave
From Table 1, it is observed that impulse, Gaussian noise,
and pulse train are common components for the EL sounds to
be more acceptable by listeners. These in fact resemble some
of the defining characteristics of human voice [14]. According
to the myoelastic-aerodynamic theory of sound production
[14], human speech sound is generated by the periodic opening
and closing of the vocal cords. The successive, yet imperfect,
opening and closing of the vocal cords result in trains of
acoustic energy being generated and propagated along the
supralaryngeal vocal tract. Such sound production mechanism
yields a sound that is pulsatile in nature, but the imperfect
glottal vibration inevitably adds noise to the sound being
produced. This might be the reason for impulse, Gaussian
noise, pulse components are found in the five best driving
signals, as they should sound more similar to human sounds.
B.
Best Five Signal Frequencies
Using the best driving signal (impulse only), speech
samples were produced by the speakers at different
frequencies. After the perceptual experiment was completed,
the five frequencies associated with the best listeners’
acceptability were obtained. The procedure was similar to the
previous one, and the results of this experiment are shown in
Table 2.
TABLE II.
THE FIVE BEST SIGNAL FREQUENCIES
Rank
Signal frequencies (using impulse as driving signal)
1
2
3
4
5
108 Hz
105 Hz
111 Hz
120 Hz
140 Hz
As can be seen in Table 2, the best frequency obtained was
108 Hz. This value appears similar to what normal male
speakers reported in the literature [cf. 15, 16]. According to
Baken [15], this frequency resembled the average voice
fundamental frequency of age-matched normal adult males
phonating with vocal cords. This may be the reason why it is
perceived with the highest listeners’ acceptability. However,
as only male speakers were recruited, our data are true only to
male EL speakers. The most favorable frequency(ies) for
female EL speakers is(are) not known. Further studies should
explore the use of EL on female laryngectomized speakers,
with special focus on gender identity.
C.
Differences between the minishaker system and the
existing electrolarynx
Although the present minishaker system was used as an
alternative sound source in a way similar to conventional EL,
marked differences are observed between the existing
electrolarynx and the minishaker system. The most significant
differences lies in the way sound is produced. For an
electrolarynx, sound waves are created by the vibrating
plunger continuously hammering the back of the vibratory
membrane of the device, creating an impulse-type signal. The
current minishaker system, however, creates sounds by
simply vibrating the plate of the exciter, with no impact
involved. This allows the easy manipulation of the vibratory
characteristics of the minishaker through controlling the input
driving signal. However, the intensity associated with the
minishaker system is unavoidably lower than that of an
electrolarynx. The reduced intensity in the sound generated by
the minishaker might have negatively affected the
intelligibility and acceptability scores obtained in the study.
Apparently, more studies are needed to investigate the
different sounding mechanisms between the EL and
minishaker system. A preliminary way of compensating the
reduced intensity with the current minishaker system may be
done by using a larger vibratory plate, as prototyped in Fig. 5.
However, coupling this minishaker with a large vibratory
plate to the neck of the user may be another challenge.
IV. CONCLUSIONS
A minishaker system was designed to simulate the working
of an electrolarynx. The flexibility in manipulating the input
driving signal allowed us to change the driving signal and find
out the best driving signals - the ones associated with the best
listeners’ acceptability.
The present study found that the best five driving signals
are: (1) impulse only, (2) impulse + Gaussian noise, (3) pulse
train + square wave, (4) downward sawtooth wave + pulse
train, and (5) impulse + downward sawtooth wave. The best
five frequencies are: (1) 108 Hz, (2) 105 Hz, (3) 111 Hz, (4)
120 Hz, and (5) 140 Hz.
Fig.4 A schematic diagram showing how sound is generated by an
electrolaryx.
Figure. 5 A schematic diagram showing how sound is generated by an
electrolaryx.
ACKNOWLEDGMENT
The project was partially funded by the Innovation and
Technology Commission, Hong Kong (ITS/517/09), General
Research Funds, Hong Kong (CUHK 414010) and National
Natural Science Foundation of China (NSFC 61135003,
NSFC 90920002). We would like to thank all the participating
laryngectomy patients.
REFERENCES
[1] I. A. o. Laryngectomees, Laryngectomized Speakers Source Book:
International Association of Laryngectomees, 2000.
[2] New Voice Club of Hong Kong. The 13th Executive Committee Report
(2004-2007). The New Voice Club of Hong Kong, Hong Kong, 2007.
[3] J. W. Lerman, “The artificial larynx” In Alaryngeal Speech
Rehabilitation - For Clinicians by Clinicians, S. J. Salmon and K. H.
Mount, Eds. Austin, TX: Pro-Ed, Inc, 1991, pp. 27-46.
[4] S. Bennett and B. Weinberg, “Acceptability ratings of normal,
esophageal, and artificial larynx speech”, J Speech Lang Hear Res.,
vol.16 pp.608-615, 1973.
[5] G. Meltzner, R. E. Hillman, et al., “Electrolaryngeal speech - the state of
the art and future directions for development”, in Contemporary
Considerations in the Treatment and Rehabilitation of Head and Neck
Cancer – Voice, Speech, and Swallowing. P. C. Doyle & R. L. Keith,
Eds. Austin, TX: Pro-Ed, 2005.
[6] M. L. Ng, C.-L. I. Kwok, and S.-F. W. Chow, “Speech performance of
adult Cantonese-speaking laryngectomees using different types of
alaryngeal phonation,” J. Voice, vol. 11, pp. 338-344, 1997.
[7] G. S. Meltzner and R. E. Hillman, “Impact of abnormal acoustic
properties on the perceived quality of electrolaryngeal speech,” in ISCA
Tutorial and Research Workshop on Voice Quality: Functions, Analysis
and Synthesis, 2003.
[8] G. S. Meltzner, J. B. Kobler, and R. E. Hillman, “Measuring the neck
frequency response function of laryngectomy patients: Implications for
the design of electrolarynx devices,” J. Acoust. Soc. Am., vol. 114, pp.
1035-1047, 2003.
[9] H. Liu and M. L. Ng, “Electrolarynx in voice rehabilitation,” Auris
Nasus Larynx, vol. 34, pp. 327-332, 2007.
[10] Y. Qi and B. Weinberg, “Low-frequency energy deficit in
electrolaryngeal speech,” J Speech Lang Hear Res., vol. 34, pp. 12501256, 1991.
[11] M. S. Weiss, G. H. Yeni‐Komshian, and J. M. Heinz, “Acoustical and
perceptual characteristics of speech produced with an electronic
artificial larynx,” J. Acoust. Soc. Am., vol. 65, pp. 1298-1308, 1979.
[12] C. Y. Espy-Wilson, V. R. Chari, et al., “Enhancement of
electrolaryngeal speech by adaptive filtering,”J Speech Lang Hear Res.,
vol. 41, pp. 1253-1264, 1998.
[13] R. L. Norton and R. S. Bernstein, “Improved laboratory prototype
electrolarynx (LAPEL): Using inverse filtering of the frequency
response function of the human throat,” Ann. Biomed. Eng., vol. 21, pp.
163-174, 1993.
[14] W. R. Zemlin, Speech and Hearing Science - Anatomy and Physiology,
Allyn & Bacon, Needham Heights, MA, 1997.
[15] R. J. Baken, Clinical Measurement of Speech and Voice, San Diego,
CA: Singular Publishing Group, 1996.