TECHNICAL ANNEX COST Action 277 "Non-linear Speech Processing" A. BACKGROUND Source-filter models are almost always part of speech processing applications such as speech coding, synthesis, speech recognition, and speaker recognition technology. Usually, the filter is linear and based on linear prediction; the excitation for the linear filter is either left undefined, modelled as noise, described by a simple pulse train, or described by an entry from a large codebook. While this approach has led to great advances in the last 30 years, it neglects structure known to be present in the speech signal. In practical applications, this neglect manifests itself as an increase in bit rate, a less natural speech synthesis, and an inferior discriminating ability in speech sounds. The replacement of the linear filter (or parts thereof) with non-linear operators (models) should enable us to obtain an accurate description of the speech signal with a lower number of parameters. This in turn should lead to better performance of practical speech processing applications. From a physics and mathematics viewpoint, in the traditional linear approach to speech modelling the true non-linear physics of speech production are approximated via the standard assumptions of linear acoustics and 1D plane wave propagation of the sound in the vocal tract. Despite the limited technological success of the linear model in several applications, there is strong theoretical and experimental evidence for the existence of important non-linear 3D fluid dynamics phenomena during the speech production that cannot be accounted for by the linear model. Examples of such phenomena include modulations of the speech airflow and turbulence. For the reasons mentioned above, there has been a growing interest in the usage of non-linear models in speech processing. Several works have been published that clearly show that the potential for performance improvement through the usage of non-linear techniques is large. Motivated by the high potential benefits of this technology, US researchers at well-known university and industrial laboratories are very active in this field. In Europe, the field has also attracted a significant number of researchers. However, no significant European collaborative efforts currently exist and the information exchange among European laboratories in the field is negligible. The European work will clearly benefit from an infrastructure which facilitates information exchange and collaboration and as, a COST Action will improve the quality of European research and ultimately result in a strengthening of the European COST 277/Annex/en 1 telecommunications industry in the general area of speech processing and voice communications. The COST Action will improve the information exchange and foster collaboration between researchers working on non-linear modelling techniques for the speech signal at both universities and industrial laboratories. This COST Action will also potentially underpin subsequent proposals for European research projects in this area. The Action is building on results obtained in three COST actions focussing on voice communications, i.e. COST 249 "Continuous Speech Recognition Over the Telephone", COST 250 "Speaker Recognition in Telephony", and COST 258 "The Naturalness of Synthetic Speech". These actions concentrated on different application fields of speech processing technology whereas the new action concentrates on solving common methodological problems identified in the three previous actions (which will all have expired by the year 2001). This re-orientation of the research focus will benefit a broad range of practical speech-processing applications produced by European industry. A.1 Brief review of non-linear techniques in speech processing Non-linear methods for speech processing is a rapidly growing area of research. Naturally, it is difficult to define a precise date for the origin of the field, but it is not unreasonable to state that this rapid growth started in the mid nineteen eighties. Since that time, numerous techniques which are ultimately aimed at engineering applications have been described. An excellent, fairly recent overview of these techniques is given in [Kubin95]. Inherent in the broad scope of non-linear methods is the large variety of methods found in the literature and a difficulty in classifying the techniques. Moreover, it is difficult to predict which techniques ultimately will be more successful. However, commonly observed in the speech processing literature are various forms of oscillators and non-linear predictors, the latter being part of the more general class of non-linear autoregressive methods. The oscillator and autoregressive techniques themselves are also closely related since a non-linear autoregressive model in its synthesis form forms a non-linear oscillator if no input is applied. For this reason we focus here on non-linear autoregressive models. For the practical design of a non-linear autoregressive models, various approximations have been proposed [Kubin95]. They can be split into two main categories: parametric and nonparametric methods. Parametric methods: COST 277/Annex/en 2 Parametric methods are perhaps best exemplified by the polynomial approximation (truncated Volterra series with the special case of quadratic filters), locally linear models, including threshold autoregressive models, and state dependent models. Another important group of parametric methods is based on neural nets: radial basis functions approximations, multi-layer perceptrons and recurrent neural nets. Nonparametric methods: Nonparametric non-linear autoregressive methods also play an important role in non-linear speech processing. Examples are Lorenz's method of analogues which is maybe the simplest of various nearest neighbour methods which also includes non-linear predictive vector quantisation or codebook prediction. Another non-parametric approach is based on kernel-density estimates of the conditional expectation. Speech Fluid Dynamics, Modulation and Fractal Methods: Another class of non-linear speech processing methods includes the various models and DSP algorithms proposed to analyse non-linear phenomena of the fluid dynamics type in the speech airflow during speech production. Such non-linear phenomena are described in [Teager89] and Kaiser[83]. The investigation of the speech airflow non-linearities can proceed in at least two directions: (i) numerical simulations of the non-linear differential (Navier-Stokes) equations governing the 3D dynamics of the speech airflow in the vocal tract, and (ii) development of non-linear signal processing systems suitable to detect such phenomena and extract related information. The second direction has been followed by Maragos and his coworkers to model and detect modulations in speech resonances of the AM-FM type [Maragos93], to model and measure the degree of turbulence in speech sounds using fractals, and to apply related non-linear speech features to problems of speech recognition and speech vocoders. B. OBJECTIVES AND BENEFITS The ultimate objective of this Action is to improve the voice services in telecommunication systems through the development of new non-linear speech processing techniques. The new technologies developed within the Action are to provide higher quality speech synthesis, more efficient speech coding, improved speech recognition, and improved speaker identification and verification. The methods are expected to contribute significantly to the acceptance of voice interfaces for information systems such as the mobile Internet (by improved synthesis and recognition). Furthermore, these methods are expected to lead to COST 277/Annex/en 3 improved efficiency in future generations of speech coders used in wireless networks, including packet-based wireless networks. The Action intends to accomplish the stated goals by developing techniques based on non-linear processing. Since a large number of research groups have declared their interest in participating, the Action is expected to be able to cover the wide spectrum of speech processing applications describe above. The COST Action will strengthen the cooperation between European researchers working on non-linear speech processing, thus increasing the efficiency of their research. The Action will achieve the stated advances with the following research strategies: 1. Speech Coding: The bit rate available for speech signals must be strictly limited in order to accommodate the constraints of the channel resource. For example, new low-rate speech coding algorithms are needed for interactive multimedia services on packet-switched networks such as the evolving mobile radio networks or the Internet, and non-linear speech processing offers a good alternative to conventional techniques. Voice transmission will have to compete with other services such as data/image/video transmission for the limited bandwidth resources allocated to an ever growing, mobile network user base, and very low bit rate coding at consumer quality will see increasing demand in future systems. 2. Speech Synthesis: New telecommunication services include the capability of a machine to speak with a human in a "natural way"; to this end, a lot of work must be done in order to improve the actual voice quality of text-to-speech and concept-to-speech systems. The richness of the output signals of self-excited non-linear feedback oscillators will allow to match synthetic voices better to human voices. In this area, the COST Action will build on results obtained on signal generation by COST Action 258 "The Naturalness of Synthetic Speech". 3. Speaker Identification and Verification: Security in transactions, information access, etc. is another important question to be addressed in the future, and speaker identification/verification is perhaps one of the most important bio-metric systems, because of its feasibility for remote (telephonic) recognition without additional hardware requirements. This line of work will build on results from COST Action 250 "Speaker Recognition in Telephony". 4. Speech Recognition: Speech recognition plays an increasingly important role in modern society. Non-linear techniques allow us to merge feature extraction and classification COST 277/Annex/en 4 problem and to include the dynamics of the speech signal in the model. This is likely to lead to significant improvements over current methods which are inherently static. The above four fields will receive major attention in the Action and will result in methods and systems which can directly be exploited in telecommunications applications. There are additional areas of interest which will receive partial coverage through the Action. They include Voice Analysis and Conversion where the quality of the human voice is analysed (including clinical phonetics applications) and where techniques for the manipulation of the voice character of an utterance are developed. Speech Enhancement for the improvement of signal quality prior to further transmission and/or processing by man or machine. This line of work will build on results from COST Action 249 "Continuous Speech Recognition Over the Telephone". A special case is the increase of Robustness to Background Noise and Channel Errors by joint source-channel coding schemes. C. SCIENTIFIC PROGRAMME The work of the Action will be to: 1. Increase the knowledge of the actual technologies available at the different participating countries, in order to bundle efforts. 2. Define a set of problems and conditions to be addressed, and a set of experiments in order to compare current linear-processing based methods to non-linear techniques, and to evaluate their potential. 3. Create liaisons to appropriate groups within COST, European and national projects, as well as ETSI and ITU-T committees etc. 4. Improve the spreading of research results as well as encouraging communication between active European researchers in telecommunication and multimedia, using available electronic media (e-mail, web) as well as written reports. 5. Use workshops as the main form of cooperation, planning them to be each an important milestone in the progress to achieve the main objectives. Working groups COST 277/Annex/en 5 There are many difficulties in communications and signal processing algorithms which linear techniques have failed to address satisfactorily. It is a generally held belief that these problems may however have solutions in the growing field of non-linear signal processing. The recent rise in neural network concepts is, for example, largely fuelled by this promise. In addition the past decade has seen a remarkable growth in the theory of the dynamics of non-linear systems. One cause of this interest has been the realisation that deterministic mathematical models with few degrees of freedom can generate extremely complex behaviour. Thus complicated physical systems may be well-modelled by relatively simple non-linear models. This Action is aimed to advance non-linear signal processing techniques for speech in telecommunication systems by investigating non-linear models for solving speech processing problems. For this purpose, the following working groups (WG) will be set up: WG1: Speech Coding In speech coding, it is possible to obtain good results using models based on linear predictive coding, since the residual can be coded with sufficient accuracy, given a high enough bit rate. However, it is also evident that some of the best results in terms of optimising both quality and bit rate are obtained from codec structures that contain some form of non-linearity. Analysis-by-synthesis coders fall into this category. For example, in CELP coders the closed-loop selection of the vector from the codebook can be seen as a data-dependent, non-linear mechanism. With a non-linear predictor, it may be possible to improve the long-term pitch predictor in analysis-by-synthesis coders. In a recent paper, it was reported that this long-term prediction contributed around 75% to the overall SNR of a typical CELP coder. Therefore it is reasonable to expect that the non-linear predictor, will contribute to improving this long-term prediction and hence the performance of the coder. WG2: Speech Synthesis Speech synthesis technology plays an important role in many aspects of man-machine interaction, particularly in telephony applications. The COST Action will focus on new techniques for the speech signal generation stage in a speech synthesiser based on concepts from non-linear dynamical theory. COST 277/Annex/en 6 To model the non-linear dynamics of speech the one-dimensional speech time-domain signal is embedded into an appropriate higher dimensional space. This reconstructed state space representation has approximately the same dynamical properties as the original speech generating system and is thus an effective model for speech synthesis. The Action will focus on systems that will reproduce the natural dynamics of speech. This will involve constructing models which operate in the state space domain, such as neural network architectures. The speech synthesised by these methods will be more natural-sounding than linear concatenation techniques because the low dimensional dynamics of the original signal are learnt, which means that phenomena such as inter-pitch jitter are automatically included into the model. In addition to generating high quality speech, other associated tasks will also be addressed. The most important of these is to examine techniques for natural pitch modification which can be linked into the non-linear model. WG3: Speech and speaker recognition In recognition, many problems remain in continuous speech recognition. Much of this may be due to the static nature of the hidden Markov models (HMM) used: they are unable to follow the dynamics of the speech between individual states. In addition, it is typically a series of mel-frequency cepstral coefficients which form the acoustic feature vector used for classification, with the inclusion of first- and second-order differentials to try to provide some continuity between frames. However, the cepstrum itself is fundamentally based on a linear speech model, and the inclusion of the differential terms is an unsatisfactory method to attempt to include the speech dynamics into the inherently static HMMs. The non-linear predictor may be able to perform the tasks of front-end feature extractor and low-level classifier simultaneously. A simple recognition system can be envisaged where each class (which could be of phones, diphones, etc) can be characterised by a non-linear model. Then, given an input frame of speech, it will be possible to use the sum of the error residual from the predictor over the frame to decide which class the input speech belongs to. Thus the feature extraction and the classification problem are merged together and solved by one unit. Further, the dynamics of the speech signal may be included in the non-linear model, unlike the HMM structure which is inherently static. This has been highlighted previously as a promising area to pursue for continuous speech recognition. For speaker recognition applications it has been shown that the residual signal of a linear analysis contains enough information to enable human beings for identifying people. Thus, there is relevant information that is ignored with a linear analysis. Several papers have shown COST 277/Annex/en 7 that it is possible to improve the identification rates with a combination of linear and non-linear predictive models. Further, for both speech and speaker recognition there is growing experimental evidence by several research groups that using non-linear aeroacoustic features of the modulation or fractal type as input to HMM-based classifiers (in addition to the standard cepstrum linear features) leads to better recognition performance than using only linear features. Thus, work on detecting such features and using them in recognition systems is very promising. WG4: Voice Analysis and enhancement Speech application systems such as recognisers, synthesisers, and coders require some parametric representation of the signal. These parameters should reflect our understanding of speech production and speech perception mechanisms. For example, actual recognition systems are not robust in terms of noise and variations of voice quality and speaker. Synthesisers suffer for being characterised by poor and unnatural quality of speech, and by a lack of flexibility in terms of changes in voice gender and reflection of emotional states. Finally, coders at low bit rates perform poorly in terms of subjective quality. Moreover, at medium bit rates, an improvement in coding algorithms could be obtained if the effects of noise introduced by the transmission channel on the acoustic features of speech segments were better understood. One way for gaining insight in how speech sounds are structured is to analyse the acoustic signal in a very detailed manner. This procedure allows the definition of a number of acoustic attributes, which characterise, in a significant way, the speech units of a language. However, as is well known, the acoustic attributes of a given sound vary as a function of many factors, among which the context in which the sound is embedded, and the speaker, have the strongest effects. Therefore, a careful analysis of the speech signal requires sophisticated experiments in which a large number of tokens of a given sound are recorded to form the experimental data base. Accurate measurements of the acoustic parameters must be performed on all collected data, and their significance must be evaluated. However, even when doing so, a variety of factors such as speaking rate, emotional state, and suprasegmental features may be neglected. It should be noted, in addition, that the acoustic attributes also depend upon the language under examination, and thus the findings obtained on one language can hardly be extended to other languages. All these aspects make the analysis complex and time-consuming, a feature which is often not compatible with the timing of speech application systems development. COST 277/Annex/en 8 The properties which are found to be significant at the acoustic level, may or may not be so at the perceptual level. For example, an acoustic parameter such as the third formant (F3) may significantly vary from one vowel to the other, and therefore exhibit values which are peculiar for a vowel, while perceptually the F3 information might be integrated with the information contained in F2. Therefore, the acoustic analysis must be supported and integrated by a perceptual analysis in order to evaluate the perceptual relevance of the acoustic attributes. The aim of this part of the project is to perform a set of experiments designed in the above view for a variety of speech sounds. Results of acoustic and perceptual analyses carried out on both consonants and vowels will be performed. The output of these analyses are the acoustic and perceptual attributes of classes of sounds. The work can be organised as follows: A database of Vowels and Consonants (extracted from running speech) from several languages will be analysed with the aim to find a combination of acoustic properties which prove to be both speaker independent (normalisation problem) and context independent (coarticulation problem). It is clear that these two problems are directly related to a similar issue that consists in finding properties in the acoustic waveform which are invariant with respect to speaker, language, and phonetic context variations. Normalisation, i.e. the problem of finding acoustic properties of vowels and consonants which prove to be speaker independent, can be based on either no a-priori knowledge on speaker's characteristics (intrinsic normalisation), or on the use of some a-priori knowledge on each speaker's production system (extrinsic normalisation). In the first category of methods, the perception of a vowel or a consonant is supposed to be dependent only upon the signal itself, and, in particular, upon few parameters characterising them. In particular, the formant frequencies and the fundamental frequency are crucial factors in identifying a vowel, whereas formant transitions are crucial factors in identifying a consonant. The normalisation consists in finding a function of these parameters which proves to be invariant with respect to the speaker. Several functions, having a normalisation effect, which have been proposed in literature can be used. In the second category of methods (extrinsic normalisation), the perception of a vowel or a consonant is supposed to be largely influenced by the context in which it is included. In this case an adaptation time for the specific speaker should be modelled. These two methods can be compared testing the parameters selected for representing vowels and consonants on several classifiers and evaluating their performance. Co-articulation, i.e. the problem of finding acoustic properties of vowels and consonants which prove to be context independent, can be faced taking into account the whole formant trajectories. COST 277/Annex/en 9 The experiments suggested constitute a first step toward the definition of acoustic and perceptual features of vowel and consonant segments and such features can play a fundamental role in the design of preprocessing algorithms, classification methods, and system structure for speech applications. WG5: Dissemination The research of one particular group will be very significant and helpful for the other groups, although they were not experts in the concrete subject of this group, so it is very important to arrange common activities and forums, as well as a website open to all the scientific community, periodic electronic publications, etc. The dissemination group will be responsible of workshops, conferences, www pages, short-term staff exchange between participating institutions, etc. D. ORGANISATION AND TIMETABLE D.1 Organisation, Management and Responsibilities The work will be organised in working groups with regular reporting at workshops, which will allow for discussion in smaller groups. Working groups will be set up, each with responsibility for one of the applications described in the scientific programme, including a special working group responsible for dissemination of the results. The Management Committee will usually have 2 annual meetings. The delegates of the Management Committee are the liaison officers for contacting national groups in the participating countries. Members of the Management Committee will form project groups working with specific issues as they arise. Important elements in the dissemination process are workshops, seminars, and the production of reports and reference documents on Non-linear Speech Processing, as well as the establishment of WWW pages and databases. D.2 Timetable The total duration of the Action will be 4 years. COST 277/Annex/en 10 1 2 3 4 5 6 7 ACTIVITY Election of chair, vice-chair, and working group coordinators and initial planning Action Management Exposition of actual technologies and applications (Seminar) Establishment of liaisons and experts network Coordination for research, explanation of results, etc. Dissemination of results. Arrangement of seminars, development of printed and multimedia information Reviews of the Action DURATION 1 meeting (MC+WG) 8 half day meetings (2 per year) for preparation and follow up on work plan (MC) 1 meeting (MC+WG) 1 meeting (MC+WG) 8 meetings (MC+WG) (2 per year) 7 meetings (only WG5) (2 per year except first year, 1) 2 meetings (MC+WG) MC: Management committee WG: Working Groups COST 277/Annex/en 11 The following plot shows the organisation (see the numbers on the previous table): 1st year 6 months START 1 MC+ 3 (MC+ WG) 2 (MC) 6 months 5 (MC+ WG) 2 (MC) 5 (MC+ WG) 6 WG5 WG 2nd year 6 months 5 MC+ 4 (MC+ WG) 6 months 2 (MC) 6 WG5 5 (MC+ WG) 2 (MC) 6 WG5 WG 3rd year (same for 4th year) 6 months 5 (MC+ 2 (MC) 6 months 6 WG5 5 (MC+ WG) 2 (MC) WG) COST 277/Annex/en 12 6 WG5 7 (MC+ WG) 7 (MC+ WG) D.3 Dissemination A special working group will be set up with responsibility for dissemination of the results including the planning and organisation of workshops and seminars for the scientific community, such as: ▪ EUSIPCO (European Signal Processing Conference) ▪ EUROSPEECH (European Conference on Speech Communication and Technology), IWANN (International Workshop on Artificial and Natural Neural Networks) ▪ etc. Thus, special sessions about non-linear speech processing are planned inside these conferences, instead of setting up a new workshop suite. The number of these sessions will be, at least, one every two years. International experts will also be invited to these open workshops. On the other hand, this approach will enable to achieve a wider dissemination of the results than a specific new workshop series and in order to minimise the efforts of local arrangements for hotels, travels, etc. co-location with well established conferences will be pursued. Another way of dissemination addressed to the scientific community will be through publications in IEEE, EURASIP, etc magazines, and probably the publication of a book entitled "non-linear speech processing" edited by Springer Verlag, or Kluwer, etc., at the end of the Action. Economic progress depends largely on the degree of innovation applied to the business sector. The development of new processes, new products and new services leads to greater business competition on an international scale. The source of innovative solutions to today's needs implies applying research results to the business sector, so this will be the second main target of our dissemination. Participants in the Action will attend events, such as the international fair of emergent technologies, business and science (http://www.fitec.org). All working groups are responsible for the realisation of item 3 in the scientific programme (section C) and the results of evaluations of the Action will be taken into consideration in updating the dissemination plan during the course of the Action. The envisaged research work is quite complementary to other European efforts and COST Actions, and the group is able to work in parallel with other groups on related topics due to COST 277/Annex/en 13 extensive experience in COST cooperation. Members of the Action also participate in other European and national projects and the focus on non-linear speech processing will complement very effectively research undertaken in COST Actions 249, 250 and their continuation 276, 278, so active collaboration will be explored and implemented. E. ECONOMIC DIMENSION The following countries have actively participated in the preparation of the Action or otherwise indicated their interest: Austria, Belgium, Denmark, Finland, France, Greece, Hungary, Ireland, Italy, Portugal, Spain, Sweden, Switzerland, and the United Kingdom. Estimated number of signatories: 14 Cost per signatory per year: Estimated number of person-years per year and signatory involved in Action Estimated cost per person-year (average of engineer/student) 2 EUR 65 thousand Economic dimension: Assuming 10 Signatories and 4 years duration, the total economic dimension amounts to 4 × 14 × EUR 65 thousand ~ EUR 3,6 million This estimate is valid under the assumption that all the countries mentioned will participate in the Action. Any departure from this will change the total cost accordingly. Appendix: REFERENCES [Kaiser83] J. F. Kaiser, "Some Observations on Vocal Tract Operation from a Fluid Flow Point of View", in Vocal Fold Physiology: Biomechanics, Acoustics, and Phonatory Control, I. R. Titze and R. C. Scherer (Eds.), The Denver Center for the Performing Arts, Denver, CO, pp.358-386, 1983. [Kubin95] G. Kubin Non-linear processing of speech. Chapter 16, in Speech coding and synthesis, W.B. Kleijn & K.K. Paliwal editors, Ed. Elsevier 1995. COST 277/Annex/en 14 [Maragos93] P. Maragos, J. F. Kaiser, and T. F. Quatieri, "Energy Separation in Signal Modulations with Application to Speech Analysis", IEEE Trans. Signal Processing, vol.41, pp. 3024-3051, Oct. 1993. [Teager89] H. M. Teager and S. M. Teager, "Evidence for Non-linear Sound Production Mechanisms in the Vocal Tract", in Speech Production and Speech Modelling, W.J. Hardcastle and A. Marchal, Eds., NATO Advanced Study Institute Series D, vol.55, Bonas, France, July 1989. COST 277/Annex/en 15
© Copyright 2025 Paperzz