Non-linear Speech Processing

TECHNICAL ANNEX
COST Action 277
"Non-linear Speech Processing"
A.
BACKGROUND
Source-filter models are almost always part of speech processing applications such as speech
coding, synthesis, speech recognition, and speaker recognition technology. Usually, the filter
is linear and based on linear prediction; the excitation for the linear filter is either left
undefined, modelled as noise, described by a simple pulse train, or described by an entry from
a large codebook. While this approach has led to great advances in the last 30 years, it
neglects structure known to be present in the speech signal. In practical applications, this
neglect manifests itself as an increase in bit rate, a less natural speech synthesis, and an
inferior discriminating ability in speech sounds. The replacement of the linear filter (or parts
thereof) with non-linear operators (models) should enable us to obtain an accurate description
of the speech signal with a lower number of parameters. This in turn should lead to better
performance of practical speech processing applications.
From a physics and mathematics viewpoint, in the traditional linear approach to speech
modelling the true non-linear physics of speech production are approximated via the standard
assumptions of linear acoustics and 1D plane wave propagation of the sound in the vocal
tract. Despite the limited technological success of the linear model in several applications,
there is strong theoretical and experimental evidence for the existence of important non-linear
3D fluid dynamics phenomena during the speech production that cannot be accounted for by
the linear model. Examples of such phenomena include modulations of the speech airflow
and turbulence.
For the reasons mentioned above, there has been a growing interest in the usage of non-linear
models in speech processing. Several works have been published that clearly show that the
potential for performance improvement through the usage of non-linear techniques is large.
Motivated by the high potential benefits of this technology, US researchers at well-known
university and industrial laboratories are very active in this field. In Europe, the field has also
attracted a significant number of researchers. However, no significant European collaborative
efforts currently exist and the information exchange among European laboratories in the field
is negligible. The European work will clearly benefit from an infrastructure which facilitates
information exchange and collaboration and as, a COST Action will improve the quality of
European research and ultimately result in a strengthening of the European
COST 277/Annex/en 1
telecommunications industry in the general area of speech processing and voice
communications.
The COST Action will improve the information exchange and foster collaboration between
researchers working on non-linear modelling techniques for the speech signal at both
universities and industrial laboratories. This COST Action will also potentially underpin
subsequent proposals for European research projects in this area. The Action is building on
results obtained in three COST actions focussing on voice communications, i.e. COST 249
"Continuous Speech Recognition Over the Telephone", COST 250 "Speaker Recognition in
Telephony", and COST 258 "The Naturalness of Synthetic Speech". These actions
concentrated on different application fields of speech processing technology whereas the new
action concentrates on solving common methodological problems identified in the three
previous actions (which will all have expired by the year 2001). This re-orientation of the
research focus will benefit a broad range of practical speech-processing applications produced
by European industry.
A.1
Brief review of non-linear techniques in speech processing
Non-linear methods for speech processing is a rapidly growing area of research. Naturally, it
is difficult to define a precise date for the origin of the field, but it is not unreasonable to state
that this rapid growth started in the mid nineteen eighties. Since that time, numerous
techniques which are ultimately aimed at engineering applications have been described. An
excellent, fairly recent overview of these techniques is given in [Kubin95].
Inherent in the broad scope of non-linear methods is the large variety of methods found in the
literature and a difficulty in classifying the techniques. Moreover, it is difficult to predict
which techniques ultimately will be more successful. However, commonly observed in the
speech processing literature are various forms of oscillators and non-linear predictors, the
latter being part of the more general class of non-linear autoregressive methods. The
oscillator and autoregressive techniques themselves are also closely related since a non-linear
autoregressive model in its synthesis form forms a non-linear oscillator if no input is applied.
For this reason we focus here on non-linear autoregressive models. For the practical design of
a non-linear autoregressive models, various approximations have been proposed [Kubin95].
They can be split into two main categories: parametric and nonparametric methods.
Parametric methods:
COST 277/Annex/en 2
Parametric methods are perhaps best exemplified by the polynomial approximation (truncated
Volterra series with the special case of quadratic filters), locally linear models, including
threshold autoregressive models, and state dependent models. Another important group of
parametric methods is based on neural nets: radial basis functions approximations, multi-layer
perceptrons and recurrent neural nets.
Nonparametric methods:
Nonparametric non-linear autoregressive methods also play an important role in non-linear
speech processing. Examples are Lorenz's method of analogues which is maybe the simplest
of various nearest neighbour methods which also includes non-linear predictive vector
quantisation or codebook prediction. Another non-parametric approach is based on
kernel-density estimates of the conditional expectation.
Speech Fluid Dynamics, Modulation and Fractal Methods:
Another class of non-linear speech processing methods includes the various models and DSP
algorithms proposed to analyse non-linear phenomena of the fluid dynamics type in the
speech airflow during speech production. Such non-linear phenomena are described in
[Teager89] and Kaiser[83]. The investigation of the speech airflow non-linearities can
proceed in at least two directions: (i) numerical simulations of the non-linear differential
(Navier-Stokes) equations governing the 3D dynamics of the speech airflow in the vocal
tract, and (ii) development of non-linear signal processing systems suitable to detect such
phenomena and extract related information. The second direction has been followed by
Maragos and his coworkers to model and detect modulations in speech resonances of the
AM-FM type [Maragos93], to model and measure the degree of turbulence in speech sounds
using fractals, and to apply related non-linear speech features to problems of speech
recognition and speech vocoders.
B.
OBJECTIVES AND BENEFITS
The ultimate objective of this Action is to improve the voice services in telecommunication
systems through the development of new non-linear speech processing techniques.
The new technologies developed within the Action are to provide higher quality speech
synthesis, more efficient speech coding, improved speech recognition, and improved speaker
identification and verification. The methods are expected to contribute significantly to the
acceptance of voice interfaces for information systems such as the mobile Internet (by
improved synthesis and recognition). Furthermore, these methods are expected to lead to
COST 277/Annex/en 3
improved efficiency in future generations of speech coders used in wireless networks,
including packet-based wireless networks.
The Action intends to accomplish the stated goals by developing techniques based on
non-linear processing. Since a large number of research groups have declared their interest in
participating, the Action is expected to be able to cover the wide spectrum of speech
processing applications describe above. The COST Action will strengthen the cooperation
between European researchers working on non-linear speech processing, thus increasing the
efficiency of their research.
The Action will achieve the stated advances with the following research strategies:
1.
Speech Coding: The bit rate available for speech signals must be strictly limited in order
to accommodate the constraints of the channel resource. For example, new low-rate
speech coding algorithms are needed for interactive multimedia services on
packet-switched networks such as the evolving mobile radio networks or the Internet,
and non-linear speech processing offers a good alternative to conventional techniques.
Voice transmission will have to compete with other services such as data/image/video
transmission for the limited bandwidth resources allocated to an ever growing, mobile
network user base, and very low bit rate coding at consumer quality will see increasing
demand in future systems.
2.
Speech Synthesis: New telecommunication services include the capability of a machine
to speak with a human in a "natural way"; to this end, a lot of work must be done in
order to improve the actual voice quality of text-to-speech and concept-to-speech
systems. The richness of the output signals of self-excited non-linear feedback
oscillators will allow to match synthetic voices better to human voices. In this area, the
COST Action will build on results obtained on signal generation by COST Action 258
"The Naturalness of Synthetic Speech".
3.
Speaker Identification and Verification: Security in transactions, information access,
etc. is another important question to be addressed in the future, and speaker
identification/verification is perhaps one of the most important bio-metric systems,
because of its feasibility for remote (telephonic) recognition without additional
hardware requirements. This line of work will build on results from COST Action 250
"Speaker Recognition in Telephony".
4.
Speech Recognition: Speech recognition plays an increasingly important role in modern
society. Non-linear techniques allow us to merge feature extraction and classification
COST 277/Annex/en 4
problem and to include the dynamics of the speech signal in the model. This is likely to
lead to significant improvements over current methods which are inherently static.
The above four fields will receive major attention in the Action and will result in methods and
systems which can directly be exploited in telecommunications applications. There are
additional areas of interest which will receive partial coverage through the Action. They
include Voice Analysis and Conversion where the quality of the human voice is analysed
(including clinical phonetics applications) and where techniques for the manipulation of the
voice character of an utterance are developed.
Speech Enhancement for the improvement of signal quality prior to further transmission
and/or processing by man or machine. This line of work will build on results from COST
Action 249 "Continuous Speech Recognition Over the Telephone". A special case is the
increase of Robustness to Background Noise and Channel Errors by joint source-channel
coding schemes.
C.
SCIENTIFIC PROGRAMME
The work of the Action will be to:
1.
Increase the knowledge of the actual technologies available at the different participating
countries, in order to bundle efforts.
2.
Define a set of problems and conditions to be addressed, and a set of experiments in
order to compare current linear-processing based methods to non-linear techniques, and
to evaluate their potential.
3.
Create liaisons to appropriate groups within COST, European and national projects, as
well as ETSI and ITU-T committees etc.
4.
Improve the spreading of research results as well as encouraging communication
between active European researchers in telecommunication and multimedia, using
available electronic media (e-mail, web) as well as written reports.
5.
Use workshops as the main form of cooperation, planning them to be each an important
milestone in the progress to achieve the main objectives.
Working groups
COST 277/Annex/en 5
There are many difficulties in communications and signal processing algorithms which linear
techniques have failed to address satisfactorily. It is a generally held belief that these
problems may however have solutions in the growing field of non-linear signal processing.
The recent rise in neural network concepts is, for example, largely fuelled by this promise. In
addition the past decade has seen a remarkable growth in the theory of the dynamics of
non-linear systems. One cause of this interest has been the realisation that deterministic
mathematical models with few degrees of freedom can generate extremely complex
behaviour. Thus complicated physical systems may be well-modelled by relatively simple
non-linear models. This Action is aimed to advance non-linear signal processing techniques
for speech in telecommunication systems by investigating non-linear models for solving
speech processing problems. For this purpose, the following working groups (WG) will be
set up:
WG1:
Speech Coding
In speech coding, it is possible to obtain good results using models based on linear predictive
coding, since the residual can be coded with sufficient accuracy, given a high enough bit rate.
However, it is also evident that some of the best results in terms of optimising both quality
and bit rate are obtained from codec structures that contain some form of non-linearity.
Analysis-by-synthesis coders fall into this category. For example, in CELP coders the
closed-loop selection of the vector from the codebook can be seen as a data-dependent,
non-linear mechanism.
With a non-linear predictor, it may be possible to improve the long-term pitch predictor in
analysis-by-synthesis coders. In a recent paper, it was reported that this long-term prediction
contributed around 75% to the overall SNR of a typical CELP coder. Therefore it is
reasonable to expect that the non-linear predictor, will contribute to improving this long-term
prediction and hence the performance of the coder.
WG2:
Speech Synthesis
Speech synthesis technology plays an important role in many aspects of man-machine
interaction, particularly in telephony applications. The COST Action will focus on new
techniques for the speech signal generation stage in a speech synthesiser based on concepts
from non-linear dynamical theory.
COST 277/Annex/en 6
To model the non-linear dynamics of speech the one-dimensional speech time-domain signal
is embedded into an appropriate higher dimensional space. This reconstructed state space
representation has approximately the same dynamical properties as the original speech
generating system and is thus an effective model for speech synthesis.
The Action will focus on systems that will reproduce the natural dynamics of speech. This
will involve constructing models which operate in the state space domain, such as neural
network architectures. The speech synthesised by these methods will be more
natural-sounding than linear concatenation techniques because the low dimensional dynamics
of the original signal are learnt, which means that phenomena such as inter-pitch jitter are
automatically included into the model. In addition to generating high quality speech, other
associated tasks will also be addressed. The most important of these is to examine techniques
for natural pitch modification which can be linked into the non-linear model.
WG3:
Speech and speaker recognition
In recognition, many problems remain in continuous speech recognition. Much of this may
be due to the static nature of the hidden Markov models (HMM) used: they are unable to
follow the dynamics of the speech between individual states. In addition, it is typically a
series of mel-frequency cepstral coefficients which form the acoustic feature vector used for
classification, with the inclusion of first- and second-order differentials to try to provide some
continuity between frames. However, the cepstrum itself is fundamentally based on a linear
speech model, and the inclusion of the differential terms is an unsatisfactory method to
attempt to include the speech dynamics into the inherently static HMMs.
The non-linear predictor may be able to perform the tasks of front-end feature extractor and
low-level classifier simultaneously. A simple recognition system can be envisaged where
each class (which could be of phones, diphones, etc) can be characterised by a non-linear
model. Then, given an input frame of speech, it will be possible to use the sum of the error
residual from the predictor over the frame to decide which class the input speech belongs to.
Thus the feature extraction and the classification problem are merged together and solved by
one unit. Further, the dynamics of the speech signal may be included in the non-linear model,
unlike the HMM structure which is inherently static. This has been highlighted previously as
a promising area to pursue for continuous speech recognition.
For speaker recognition applications it has been shown that the residual signal of a linear
analysis contains enough information to enable human beings for identifying people. Thus,
there is relevant information that is ignored with a linear analysis. Several papers have shown
COST 277/Annex/en 7
that it is possible to improve the identification rates with a combination of linear and
non-linear predictive models.
Further, for both speech and speaker recognition there is growing experimental evidence by
several research groups that using non-linear aeroacoustic features of the modulation or
fractal type as input to HMM-based classifiers (in addition to the standard cepstrum linear
features) leads to better recognition performance than using only linear features. Thus, work
on detecting such features and using them in recognition systems is very promising.
WG4:
Voice Analysis and enhancement
Speech application systems such as recognisers, synthesisers, and coders require some
parametric representation of the signal. These parameters should reflect our understanding of
speech production and speech perception mechanisms. For example, actual recognition
systems are not robust in terms of noise and variations of voice quality and speaker.
Synthesisers suffer for being characterised by poor and unnatural quality of speech, and by a
lack of flexibility in terms of changes in voice gender and reflection of emotional states.
Finally, coders at low bit rates perform poorly in terms of subjective quality. Moreover, at
medium bit rates, an improvement in coding algorithms could be obtained if the effects of
noise introduced by the transmission channel on the acoustic features of speech segments
were better understood.
One way for gaining insight in how speech sounds are structured is to analyse the acoustic
signal in a very detailed manner. This procedure allows the definition of a number of acoustic
attributes, which characterise, in a significant way, the speech units of a language. However,
as is well known, the acoustic attributes of a given sound vary as a function of many factors,
among which the context in which the sound is embedded, and the speaker, have the strongest
effects. Therefore, a careful analysis of the speech signal requires sophisticated experiments
in which a large number of tokens of a given sound are recorded to form the experimental
data base. Accurate measurements of the acoustic parameters must be performed on all
collected data, and their significance must be evaluated. However, even when doing so, a
variety of factors such as speaking rate, emotional state, and suprasegmental features may be
neglected. It should be noted, in addition, that the acoustic attributes also depend upon the
language under examination, and thus the findings obtained on one language can hardly be
extended to other languages. All these aspects make the analysis complex and
time-consuming, a feature which is often not compatible with the timing of speech application
systems development.
COST 277/Annex/en 8
The properties which are found to be significant at the acoustic level, may or may not be so at
the perceptual level. For example, an acoustic parameter such as the third formant (F3) may
significantly vary from one vowel to the other, and therefore exhibit values which are peculiar
for a vowel, while perceptually the F3 information might be integrated with the information
contained in F2. Therefore, the acoustic analysis must be supported and integrated by a
perceptual analysis in order to evaluate the perceptual relevance of the acoustic attributes.
The aim of this part of the project is to perform a set of experiments designed in the above
view for a variety of speech sounds. Results of acoustic and perceptual analyses carried out
on both consonants and vowels will be performed. The output of these analyses are the
acoustic and perceptual attributes of classes of sounds.
The work can be organised as follows: A database of Vowels and Consonants (extracted from
running speech) from several languages will be analysed with the aim to find a combination
of acoustic properties which prove to be both speaker independent (normalisation problem)
and context independent (coarticulation problem). It is clear that these two problems are
directly related to a similar issue that consists in finding properties in the acoustic waveform
which are invariant with respect to speaker, language, and phonetic context variations.
Normalisation, i.e. the problem of finding acoustic properties of vowels and consonants which
prove to be speaker independent, can be based on either no a-priori knowledge on speaker's
characteristics (intrinsic normalisation), or on the use of some a-priori knowledge on each
speaker's production system (extrinsic normalisation). In the first category of methods, the
perception of a vowel or a consonant is supposed to be dependent only upon the signal itself,
and, in particular, upon few parameters characterising them. In particular, the formant
frequencies and the fundamental frequency are crucial factors in identifying a vowel, whereas
formant transitions are crucial factors in identifying a consonant. The normalisation consists
in finding a function of these parameters which proves to be invariant with respect to the
speaker. Several functions, having a normalisation effect, which have been proposed in
literature can be used. In the second category of methods (extrinsic normalisation), the
perception of a vowel or a consonant is supposed to be largely influenced by the context in
which it is included. In this case an adaptation time for the specific speaker should be
modelled. These two methods can be compared testing the parameters selected for
representing vowels and consonants on several classifiers and evaluating their performance.
Co-articulation, i.e. the problem of finding acoustic properties of vowels and consonants
which prove to be context independent, can be faced taking into account the whole formant
trajectories.
COST 277/Annex/en 9
The experiments suggested constitute a first step toward the definition of acoustic and
perceptual features of vowel and consonant segments and such features can play a
fundamental role in the design of preprocessing algorithms, classification methods, and
system structure for speech applications.
WG5:
Dissemination
The research of one particular group will be very significant and helpful for the other groups,
although they were not experts in the concrete subject of this group, so it is very important to
arrange common activities and forums, as well as a website open to all the scientific
community, periodic electronic publications, etc.
The dissemination group will be responsible of workshops, conferences, www pages,
short-term staff exchange between participating institutions, etc.
D.
ORGANISATION AND TIMETABLE
D.1
Organisation, Management and Responsibilities
The work will be organised in working groups with regular reporting at workshops, which
will allow for discussion in smaller groups. Working groups will be set up, each with
responsibility for one of the applications described in the scientific programme, including a
special working group responsible for dissemination of the results.
The Management Committee will usually have 2 annual meetings. The delegates of the
Management Committee are the liaison officers for contacting national groups in the
participating countries. Members of the Management Committee will form project groups
working with specific issues as they arise.
Important elements in the dissemination process are workshops, seminars, and the production
of reports and reference documents on Non-linear Speech Processing, as well as the
establishment of WWW pages and databases.
D.2
Timetable
The total duration of the Action will be 4 years.
COST 277/Annex/en 10
1
2
3
4
5
6
7
ACTIVITY
Election of chair, vice-chair, and working
group coordinators and initial planning
Action Management
Exposition of actual technologies and
applications (Seminar)
Establishment of liaisons and experts
network
Coordination for research, explanation of
results, etc.
Dissemination of results. Arrangement of
seminars, development of printed and
multimedia information
Reviews of the Action
DURATION
1 meeting (MC+WG)
8 half day meetings (2 per year)
for preparation and follow up on
work plan (MC)
1 meeting (MC+WG)
1 meeting (MC+WG)
8 meetings (MC+WG) (2 per year)
7 meetings (only WG5) (2 per year
except first year, 1)
2 meetings (MC+WG)
MC: Management committee
WG: Working Groups
COST 277/Annex/en 11
The following plot shows the organisation (see the numbers on the previous table):
1st year
6 months
START
1
MC+
3
(MC+
WG)
2
(MC)
6 months
5
(MC+
WG)
2
(MC)
5
(MC+
WG)
6
WG5
WG
2nd year
6 months
5
MC+
4
(MC+
WG)
6 months
2
(MC)
6
WG5
5
(MC+
WG)
2
(MC)
6
WG5
WG
3rd year (same for 4th year)
6 months
5
(MC+
2
(MC)
6 months
6
WG5
5
(MC+
WG)
2
(MC)
WG)
COST 277/Annex/en 12
6
WG5
7
(MC+
WG)
7
(MC+
WG)
D.3
Dissemination
A special working group will be set up with responsibility for dissemination of the results
including the planning and organisation of workshops and seminars for the scientific
community, such as:
▪ EUSIPCO (European Signal Processing Conference)
▪
EUROSPEECH (European Conference on Speech Communication and Technology),
IWANN (International Workshop on Artificial and Natural Neural Networks)
▪ etc.
Thus, special sessions about non-linear speech processing are planned inside these
conferences, instead of setting up a new workshop suite. The number of these sessions will
be, at least, one every two years. International experts will also be invited to these open
workshops. On the other hand, this approach will enable to achieve a wider dissemination of
the results than a specific new workshop series and in order to minimise the efforts of local
arrangements for hotels, travels, etc. co-location with well established conferences will be
pursued.
Another way of dissemination addressed to the scientific community will be through
publications in IEEE, EURASIP, etc magazines, and probably the publication of a book
entitled "non-linear speech processing" edited by Springer Verlag, or Kluwer, etc., at the end
of the Action.
Economic progress depends largely on the degree of innovation applied to the business sector.
The development of new processes, new products and new services leads to greater business
competition on an international scale. The source of innovative solutions to today's needs
implies applying research results to the business sector, so this will be the second main target
of our dissemination. Participants in the Action will attend events, such as the international
fair of emergent technologies, business and science (http://www.fitec.org).
All working groups are responsible for the realisation of item 3 in the scientific programme
(section C) and the results of evaluations of the Action will be taken into consideration in
updating the dissemination plan during the course of the Action.
The envisaged research work is quite complementary to other European efforts and COST
Actions, and the group is able to work in parallel with other groups on related topics due to
COST 277/Annex/en 13
extensive experience in COST cooperation. Members of the Action also participate in other
European and national projects and the focus on non-linear speech processing will
complement very effectively research undertaken in COST Actions 249, 250 and their
continuation 276, 278, so active collaboration will be explored and implemented.
E.
ECONOMIC DIMENSION
The following countries have actively participated in the preparation of the Action or
otherwise indicated their interest: Austria, Belgium, Denmark, Finland, France, Greece,
Hungary, Ireland, Italy, Portugal, Spain, Sweden, Switzerland, and the United Kingdom.
Estimated number of signatories: 14
Cost per signatory per year:
Estimated number of person-years per year
and signatory involved in Action
Estimated cost per person-year (average of
engineer/student)
2
EUR 65 thousand
Economic dimension:
Assuming 10 Signatories and 4 years duration, the total economic dimension amounts to
4 × 14 × EUR 65 thousand ~ EUR 3,6 million
This estimate is valid under the assumption that all the countries mentioned will participate in
the Action. Any departure from this will change the total cost accordingly.
Appendix: REFERENCES
[Kaiser83] J. F. Kaiser, "Some Observations on Vocal Tract Operation from a Fluid Flow Point of
View", in Vocal Fold Physiology: Biomechanics, Acoustics, and Phonatory Control, I. R. Titze and
R. C. Scherer (Eds.), The Denver Center for the Performing Arts, Denver, CO, pp.358-386, 1983.
[Kubin95] G. Kubin Non-linear processing of speech. Chapter 16, in Speech coding and synthesis,
W.B. Kleijn & K.K. Paliwal editors, Ed. Elsevier 1995.
COST 277/Annex/en 14
[Maragos93] P. Maragos, J. F. Kaiser, and T. F. Quatieri, "Energy Separation in Signal
Modulations with Application to Speech Analysis", IEEE Trans. Signal Processing, vol.41,
pp. 3024-3051, Oct. 1993.
[Teager89] H. M. Teager and S. M. Teager, "Evidence for Non-linear Sound Production
Mechanisms in the Vocal Tract", in Speech Production and Speech Modelling, W.J. Hardcastle and
A. Marchal, Eds., NATO Advanced Study Institute Series D, vol.55, Bonas, France, July 1989.
COST 277/Annex/en 15