Improving the Sound Quality of an Electronic Voice Box * Nan Yan *, Manwa L. Ng† and Tan Lee ‡ Laboratory for Ambient Intelligence & Multimodal System, Shenzhen Institutes of Advanced Technology,Chinese Academy of Sciences/The Chinese University of Hong Kong, Shenzhen, China E-mail: [email protected] Tel: +86-755-86392174 † Division of Speech and Hearing Sciences, University of Hong Kong, Hong Kong, China E-mail: [email protected] Tel: +852- 39171582 ‡ Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong, China E-mail: [email protected] Tel: +852- 3943 8267 Abstract— The present project attempted to design the next generation electronic voice box by modifying the driving signal of the device. Two steps were involved in the project. The first step involved designing a working minishaker system using existing commercial miniature shaker system that functions similarly to an electrolarynx, but with variable driving signals (type and frequency). The second step involved carrying out perceptual experiments to examine listeners’ intelligibility and acceptability of the voice produced with the minishaker system associated with different driving signals by both normal laryngeal speakers and laryngectomized speakers of Cantonese. Upon completion of the experiments, the five driving signals and the five driving frequencies that are associated with the highest acceptability scores were obtained. It is anticipated that the next generation electronic voice box should be designed based on these driving signals and frequencies should yield better and more natural sound quality. I. INTRODUCTION Total laryngectomy is a surgery for treating patients with late stage laryngeal cancer. During the procedure, the entire voicing mechanism including all the laryngeal cartilages, hyoid bone, and soft tissues of the voice box and some adjacent structures are removed. Due to the loss of voicing structures, patients lose their ability to speak after the surgery. Therefore, learning to use an alternative way to speak again becomes an important part of post-laryngectomy rehabilitation in these laryngeal cancer survivors. This ability is closely associated with their quality of life following laryngectomy. According to statistics, there are currently over 600,000 laryngectomees worldwide, with over 12,000 new cases every year in the U.S.A. [1]. A conservative estimate indicates that there are currently at least 65,000 laryngectomees in China, and over 2,700 active laryngectomees in Hong Kong [2]. Although different types of post-laryngectomy speech are available such as esophageal, tracheoesophageal, pneumatic artificial laryngeal and electrolaryngeal speech, almost all of the laryngectomy patients rely on a device known as electronic voice box, or electrolarynx, as the only “talking device” right after the surgery before they learn to speak other forms of post-laryngectomy speech. Many of them remain using EL as their primary speaking method later on. In fact, the majority of laryngeal cancer survivors are using an electrolarynx as their artificial speaking device [2]. 978-616-361-823-8 © 2014 APSIPA An electrolarynx (EL) is a hand-held device that functions as a new sound source. During EL speech, the device is held against the anterolateral side of speaker’s neck. When the button on the device is pressed, sound generated by the vibrating plate of the EL is transmitted to the inside of the speaker’s throat (vocal tract) via neck tissue [3]. The transmitted signal is resonated inside the vocal tract to form different speech sounds. Different vocal tract configurations which are determined by the positioning of various speech structures inside the vocal tract (e.g., lips, tongue, jaw, soft palate, etc.) possess different resonance characteristics (transfer functions), and thus different sounds can be produced. For example, during production of different vowels, the voicing system (the larynx) is basically doing the same thing, while the articulatory system of the vocal tract change to yield different sounds. In particular, tongue tip is elevated and fronted for the production of /i/, as in “see”, whereas, the back of the tongue is elevated and backed when producing the sound /u/, combined with rounded lips, as in “zoo”. It is the unique vocal tract configurations associated with different speech sounds that distinguish different speech sounds. Despite the popularity of using EL among laryngeal cancer survivors who have undergone total laryngectomy, EL speech performance is far from satisfactory due to its poor sound quality. The literature consistently reports that EL sound is characterized by its low pitch (above one octave below normal laryngeal sound), monotonic (lack of pitch fluctuation), reduced loudness, robotic, unnatural, and nonhuman sound quality [cf. 4-6]. Such poor sound quality negatively affects the perception of EL speech by the general public, particularly for female EL speakers as it greatly affects their gender identity. The reduced intensity associated with EL speech often significantly lowered EL speech intelligibility and rendered EL speakers not willing to speak in public, especially in noisy environment. In addition, laryngeal cancer survivors who are using EL as their primary speaking method often complain that the unnatural EL sound often draws unwanted attention from others and communication barriers are developed [7]. Verbal communication and quality of life of EL laryngeal cancer survivors is adversely affected due to such poor sound quality associated with EL device [7]. Although current EL technology has been developed since more than 40 years ago, little improvement in the sound quality of the EL device is seen [8]. Researchers consistently found that the output speech sound of EL speakers has a diminished energy in the low frequency region (below 500 Hz) when compared to the speech sound produced by normal laryngeal speakers [cf. 9-11]. The lowfrequency energy deficit in EL speech sound likely contributes to the perception of unnatural and nonhuman EL sound quality, and thus APSIPA 2014 the poor EL speech intelligibility. In addition to the low-frequency energy deficit, the lack of pitch and amplitude fluctuations (jitter and shimmer) in EL sound also contribute to the perception of robotic and nonhuman sound quality. A recent research examined the sound quality after the low frequency energy is boosted and reported positive results. Efforts have then been made to improve EL speech sound by using adaptive filtering algorithm [cf. 12] and subtractiontype algorithm [9]. However, these are post-processing strategies and cannot be implemented simultaneously with EL speech production. An EL is a vibratory device that produces electronic sound source, in a way similar to the human vocal cords. The working mechanism of an EL is similar to a mechanical shaker, in which the pattern of vibration of the membrane is governed by the input driving signal. Yet, the exact driving signal used by the commercially available ELs is not known due to proprietary reason. However, it has been revealed that, for example, Siemens EL has a driving signal that resembles an impulse train of alternate sign, and Bell’s EL is driven by a square wave [13]. It is not known why impulse train and square wave are chosen as the input driving signals for ELs. There has no established studies reporting why using an impulse or square wave as the EL driving signal will yield a superior sound quality or better listeners’ acceptability or intelligibility. In an attempt to enhance EL sound, researchers made use of broadband white noise as the driving signal [8,13]. This is obtained by inversefiltering the neck frequency response function extracted from laryngectomees. The researchers claimed that this could yield a sound quality that is similar to a vocal cord sound. However, whether an EL sound driven by a broadband white noise signal can yield a near normal sound still remains skeptical. Moreover, how such EL sound improves intelligibility is yet to be determined. In review of such knowledge gap, the present study attempted to examine how input driving signal of an EL is associated with listeners’ intelligibility and acceptability. By prototyping an EL with different driving signals and different vibratory frequencies, and by assessing the listeners’ acceptability and intelligibility of the resultant EL sound, a new generation EL device of enhanced sound quality can be developed. Specifically, the purpose of the study included: (1) to design a simulation for EL, (2) to find the best five driving signals the will yield the highest acceptability scores, and (3) to find the best five driving frequencies that will yield the highest acceptability scores. II. box with an electronic artificial voice box - the EL, an alternative sound source. In the present project, a minishaker system that functions in a similar manner as a conventional EL was designed. The schematic of the minishaker system is shown in Fig. 2. The minishaker system was constructed by connecting a small vibration exciter (Model 4810, Brüel & Kjær), driven by a 75 VA power amplifier (Model 2718, Brüel & Kjær), to an impedance head (Model 8001, Brüel & Kjær) and a conditioning amplifier (Model 2692-A-OS2, Brüel & Kjær). The power amplifier was connected to a computer from which the variable driving signal was input to the system. By manipulating the driving signal, the vibratory pattern of the vibratory exciter could be changed. Fig.1 Flowchart of the mechanism of existing electrolarynx. Fig.2 Schematic diagram of the minishaker system used in the study. Use of the minishaker system is similar to using an EL by an alaryngeal speaker. When in use, the speaker holds the minishaker device in the hand, and the vibratory plate of the minishaker system is coupled to the neck of the user, as in using the electrolarynx. Sound energy is then transmitted transcervically to the vocal tract for resonance. METHODS A. Simulation of an electrolarynx - Design of a minishaker system Little is known about the design of conventional electrolarynges (ELs) due to proprietary reasons. However, the mechanism of a conventional EL can generally be illustrated by Fig. 1. During EL speech production, sound generated by the vibration of EL is transmitted to the inside of the vocal tract of EL speaker (see Fig. 1). The EL speaker’s neck frequency response function determines how much and what acoustic energy is being transmitted across the neck (transcervically). The sound is then resonated inside the vocal tract. The way the transcervical sound is resonated depends on the vocal tract resonance characteristics (vocal tract transfer function) which are determined by the vocal tract configuration (shaped by various speech structures inside). This helps laryngeal cancer survivors who have undergone total laryngectomy make sound by substituting the lost voice Fig. 3 The vibratory exciter and the minishaker system used in the study to simulate an electrolarynx. We kindly ask authors to check your camera-ready paper if all fonts in the PDF file of the final manuscript are embedded and subset. It can be checked from Document Properties/Fonts in File menu of Adobe Acrobat. B. Assessment of listeners’ acceptability To assess listeners’ intelligibility and acceptability associated with the sound of the minishaker system, two sets of experiments were carried out: (1) stimulus production, and (2) perceptual rating experiment. The perceptual rating experiment was carried out in two steps. The first step involved production of speech samples using a total of 54 different driving signals by 4 (2 laryngeal and 2 laryngectomized) speakers. The 54 driving signals were designed by using impulse, square wave, sawtooth wave, and sinusoidal wave. After the rating experiment, the 10 signals associated with the best intelligibility were shortlisted and the second step began. The second step involved production of speech stimuli by six (3 laryngeal and 3 laryngectomized) speakers using the shortlisted signals. Upon completion of the second step, the five signals associated with the best listeners’ acceptability were obtained. To obtain the five best frequencies, the best driving signal (impulse) was used to produce speech stimuli at 12 different frequencies (from 90 Hz to 140 Hz, with a 3 Hz increment). From the 12 different frequencies, the five frequencies associated with the best listeners’ acceptability were shortlisted. The experimental setup for stimulus production and perceptual rating is discussed below. 1) Stimulus production. To obtain the stimuli for the perceptual raing experiment, both healthy laryngeal and laryngectomized speakers were recruited to read aloud a short Chinese passage: “The North Wind and The Sun”. The speakers were all adult native male speakers of Cantonese. During the recording, speech samples were produced by using the minishaker system. The speakers read the passage by coupling the plate of the minishaker system against the neck. The acoustic signals were digitized at 20 kHz and 16 bits/sample, and recorded using a high-quality microphone (SM58, Shure) and an external sound card (MobilPre USB, M-Audio) for later perceptual experiment. 2) Perceptual rating experiment. Naïve listeners who had no prior experience with any form of alaryngeal speech were recruited to participate in the perceptual rating experiment. All listeners were adult native speakers of Cantonese. The experimental procedure and protocol used in the present study were similar to that reported by Ng, Kwok, and Chow [6], in which listeners’ acceptability and intelligibility associated with different types of laryngectomized speech were assessed. Assessment of acceptibility and intelligibility of speech produced by using the minishaker system was based on ratings for the following six attibutes: (1) voice quality, (2) articulation proficiency, (3) quietness of speech, (4) pitch variation, (5) accuracy in tone production, and (6) overall speech intelligibility. During the experiment, the listeners were seated in a sound treated room and speech stimuli were presented to the listeners at a comfortable loudness level (at about 56-70 dBSPL). The listeners were requested to rate the above attributes on a 1-7 equal-interval scale, with a “1” representing the worst and a “7” the best ratings. The raters were instructed to circle the rating on an answer sheet provided before the experiment. To familiarize the raters with the procedure, trials were given before the experiment began. Speech acceptability and intelligibility were evaluated based on the ratings obtained. III. RESULTS AND DISCUSSION A. Five Best Driving Signals After the first step of the perceptual experiment, ten driving signals associated with the best acceptability were shortlisted. They were: (1) impulse + Gaussian noise, (2) upward sawtooth wave + sinusoidal wave, (3) impulse only, (4) upward sawtooth wave + sinusoidal wave + Gaussian noise, (5) impulse + upward sawtooth wave, (6) impulse + downward sawtooth wave, (7) impulse + upward sawtooth wave + Gaussian noise, (8) impulse + square wave, (9) downward sawtooth wave only, and (10) impulse + downward sawtooth wave + Gaussian noise. After the second step of the perceptual experiment, the five best signals obtained are shown in Table 1. TABLE I. THE FIVE BEST DRIVING SIGNALS Rank Driving signals 1 2 3 4 5 Impulse only Square wave + pulse train Impulse + Gaussian noise Downward sawtooth wave + pulse train Impulse + downward sawtooth wave From Table 1, it is observed that impulse, Gaussian noise, and pulse train are common components for the EL sounds to be more acceptable by listeners. These in fact resemble some of the defining characteristics of human voice [14]. According to the myoelastic-aerodynamic theory of sound production [14], human speech sound is generated by the periodic opening and closing of the vocal cords. The successive, yet imperfect, opening and closing of the vocal cords result in trains of acoustic energy being generated and propagated along the supralaryngeal vocal tract. Such sound production mechanism yields a sound that is pulsatile in nature, but the imperfect glottal vibration inevitably adds noise to the sound being produced. This might be the reason for impulse, Gaussian noise, pulse components are found in the five best driving signals, as they should sound more similar to human sounds. B. Best Five Signal Frequencies Using the best driving signal (impulse only), speech samples were produced by the speakers at different frequencies. After the perceptual experiment was completed, the five frequencies associated with the best listeners’ acceptability were obtained. The procedure was similar to the previous one, and the results of this experiment are shown in Table 2. TABLE II. THE FIVE BEST SIGNAL FREQUENCIES Rank Signal frequencies (using impulse as driving signal) 1 2 3 4 5 108 Hz 105 Hz 111 Hz 120 Hz 140 Hz As can be seen in Table 2, the best frequency obtained was 108 Hz. This value appears similar to what normal male speakers reported in the literature [cf. 15, 16]. According to Baken [15], this frequency resembled the average voice fundamental frequency of age-matched normal adult males phonating with vocal cords. This may be the reason why it is perceived with the highest listeners’ acceptability. However, as only male speakers were recruited, our data are true only to male EL speakers. The most favorable frequency(ies) for female EL speakers is(are) not known. Further studies should explore the use of EL on female laryngectomized speakers, with special focus on gender identity. C. Differences between the minishaker system and the existing electrolarynx Although the present minishaker system was used as an alternative sound source in a way similar to conventional EL, marked differences are observed between the existing electrolarynx and the minishaker system. The most significant differences lies in the way sound is produced. For an electrolarynx, sound waves are created by the vibrating plunger continuously hammering the back of the vibratory membrane of the device, creating an impulse-type signal. The current minishaker system, however, creates sounds by simply vibrating the plate of the exciter, with no impact involved. This allows the easy manipulation of the vibratory characteristics of the minishaker through controlling the input driving signal. However, the intensity associated with the minishaker system is unavoidably lower than that of an electrolarynx. The reduced intensity in the sound generated by the minishaker might have negatively affected the intelligibility and acceptability scores obtained in the study. Apparently, more studies are needed to investigate the different sounding mechanisms between the EL and minishaker system. A preliminary way of compensating the reduced intensity with the current minishaker system may be done by using a larger vibratory plate, as prototyped in Fig. 5. However, coupling this minishaker with a large vibratory plate to the neck of the user may be another challenge. IV. CONCLUSIONS A minishaker system was designed to simulate the working of an electrolarynx. The flexibility in manipulating the input driving signal allowed us to change the driving signal and find out the best driving signals - the ones associated with the best listeners’ acceptability. The present study found that the best five driving signals are: (1) impulse only, (2) impulse + Gaussian noise, (3) pulse train + square wave, (4) downward sawtooth wave + pulse train, and (5) impulse + downward sawtooth wave. The best five frequencies are: (1) 108 Hz, (2) 105 Hz, (3) 111 Hz, (4) 120 Hz, and (5) 140 Hz. Fig.4 A schematic diagram showing how sound is generated by an electrolaryx. Figure. 5 A schematic diagram showing how sound is generated by an electrolaryx. ACKNOWLEDGMENT The project was partially funded by the Innovation and Technology Commission, Hong Kong (ITS/517/09), General Research Funds, Hong Kong (CUHK 414010) and National Natural Science Foundation of China (NSFC 61135003, NSFC 90920002). We would like to thank all the participating laryngectomy patients. REFERENCES [1] I. A. o. Laryngectomees, Laryngectomized Speakers Source Book: International Association of Laryngectomees, 2000. [2] New Voice Club of Hong Kong. The 13th Executive Committee Report (2004-2007). The New Voice Club of Hong Kong, Hong Kong, 2007. [3] J. W. Lerman, “The artificial larynx” In Alaryngeal Speech Rehabilitation - For Clinicians by Clinicians, S. J. Salmon and K. H. Mount, Eds. Austin, TX: Pro-Ed, Inc, 1991, pp. 27-46. [4] S. Bennett and B. Weinberg, “Acceptability ratings of normal, esophageal, and artificial larynx speech”, J Speech Lang Hear Res., vol.16 pp.608-615, 1973. [5] G. Meltzner, R. E. Hillman, et al., “Electrolaryngeal speech - the state of the art and future directions for development”, in Contemporary Considerations in the Treatment and Rehabilitation of Head and Neck Cancer – Voice, Speech, and Swallowing. P. C. Doyle & R. L. Keith, Eds. Austin, TX: Pro-Ed, 2005. [6] M. L. Ng, C.-L. I. Kwok, and S.-F. W. Chow, “Speech performance of adult Cantonese-speaking laryngectomees using different types of alaryngeal phonation,” J. Voice, vol. 11, pp. 338-344, 1997. [7] G. S. Meltzner and R. E. Hillman, “Impact of abnormal acoustic properties on the perceived quality of electrolaryngeal speech,” in ISCA Tutorial and Research Workshop on Voice Quality: Functions, Analysis and Synthesis, 2003. [8] G. S. Meltzner, J. B. Kobler, and R. E. Hillman, “Measuring the neck frequency response function of laryngectomy patients: Implications for the design of electrolarynx devices,” J. Acoust. Soc. Am., vol. 114, pp. 1035-1047, 2003. [9] H. Liu and M. L. Ng, “Electrolarynx in voice rehabilitation,” Auris Nasus Larynx, vol. 34, pp. 327-332, 2007. [10] Y. Qi and B. Weinberg, “Low-frequency energy deficit in electrolaryngeal speech,” J Speech Lang Hear Res., vol. 34, pp. 12501256, 1991. [11] M. S. Weiss, G. H. Yeni‐Komshian, and J. M. Heinz, “Acoustical and perceptual characteristics of speech produced with an electronic artificial larynx,” J. Acoust. Soc. Am., vol. 65, pp. 1298-1308, 1979. [12] C. Y. Espy-Wilson, V. R. Chari, et al., “Enhancement of electrolaryngeal speech by adaptive filtering,”J Speech Lang Hear Res., vol. 41, pp. 1253-1264, 1998. [13] R. L. Norton and R. S. Bernstein, “Improved laboratory prototype electrolarynx (LAPEL): Using inverse filtering of the frequency response function of the human throat,” Ann. Biomed. Eng., vol. 21, pp. 163-174, 1993. [14] W. R. Zemlin, Speech and Hearing Science - Anatomy and Physiology, Allyn & Bacon, Needham Heights, MA, 1997. [15] R. J. Baken, Clinical Measurement of Speech and Voice, San Diego, CA: Singular Publishing Group, 1996.
© Copyright 2026 Paperzz