Contributions in Speech Analysis and Text-to

Resume of the Doctoral Dissertation
Contributions in Speech Analysis and Text-to-Speech Synthesis
for Romanian
1. Introduction
The purpose of the doctoral dissertation is the study of speech processing modalities and sustaining of research in the
voice synthesis domain, having the following goals:
a) development of some automated voice signal analysis methods;
b) development of a speech synthesis method specifically adapted for Romanian;
c) development of a working methodology for building an automated voice synthesis system;
d) the implementation of a voice synthesis system prototype for Romanian.
The author of the dissertation followed the direction of creating a voice synthesis that would follow the quality
parameters of natural speech. A speech synthesis method for Romanian was designed for this purpose, and a working
methodology was proposed in order to create an automated speech synthesis system.
Using the syllables as linguistic units, the designed synthesis method belongs to the category of high quality methods
based on concatenation. The method is specifically adapted for Romanian and supposes a rule-based approach both in the text
processing phase, for the extraction of linguistic units and prosodic information, and also in the vocal database construction
phase, for the extraction of acoustic units from spoken signal.
2. Digital processing of voice signal
Digital processing and analysis of the voice signal represent the first two steps in voice synthesis and voice recognition
stages. Digital processing of the signal reffers to all direct acoustic signal operating methods, starting with capturing of the
signal, filtering, coding, compressing and stacking it on a magnetic or optical support. The voice signal analysis supposes
finding signal parameters based on the recorded speech and then comparing this parameters with the expected values.
This chapter contains a synthetic study performed by the author on the voice signal processing, coding and
compression. Standard voice signal coding and also voice signal compression methods were presented here.
2.1. Contributions to the voice signal processing
A special application called SPEA – Sound Processing and Enhancement Application – was designed for the study of
the voice signal properties. In current design stage, the SPEA application has the following facilities:
(1) loading and viewing the recorded voice signal in different Wave file formats;
(2) increasing the display resolution for viewing the type of wave and the signal patterns on the different size scales;
(3) finding the main voice signal parameters;
(4) selecting the workspace from a Wave file;
(5) computing the Fourier transform and viewing the signal amplitude and faze spectrums;
(6) the possibility of interactive modification of the amplitude and faze spectrums’ components towards voice signal
acoustic improvement.
The application automatically detects the formants or the maximum values of the spectral envelope, values which are
above certain threshold values. For each formant, it calculates the amplitude, the central frequency and the bandwidth, that are
important voice synthesis parameters.
The SPEA application allows selective filtering of the FFT spectrum, a very important aspect of the speech analysis for
the improvement of the voice signal quality. The FILTER command offers interactive filtering of the frequencies and graphic
editing of the formants and voice signal spectral components. By moving the mouse in the FFT spectrum area, the user can
eliminate the frequency bands corresponding to the noise or increase the signal energy in the desired bands.
The user can modify the amplitude of formants and spectral components in order to improve the timber of the sound.
The experiments performed show that a quality voice implies a rich number of spectral components. High frequency spectral
components, especially, are determinant for speech quality. This is helpful in the process of creating the vocal database used
for speech synthesis, where some of recorded voice segments could be improved by adding high spectral components.
The results of the experiments performed through SPEA application on real signal patterns were presented. The
purpose of these experiments was to determine the specific characteristics of the voice signal when pronouncing different
Romanian sounds by more than one user, in different conditions.
The experiments based on the study of the specific properties of the signal, that ensures a superior quality of the
emitted sound, were done in this manner. Thereby, different categories of spectral analysis were done: spectral analysis of the
voices emitted by different speakers, spectral analysis of consonants, spectral analysis of multi-tonally emitted sounds,
highlighting the importance of choosing the analyse window on the spectral analysis result; the behavior of the modulated
signals; acoustic perceptual analysis: the perceptual analysis of sounds emitted in different phases, the relationship between
the sound timber and the auditory perception.
The factors that significantly determine the voice quality were also studied, showing the influence of the sampling
frequency and that of the recording environment on the quality of the recorded voice, as well as the determinant factors needed
to perform a high quality voice synthesis.
Also vowel sounds behavior analysis is of great importance in the purpose of creating quality speech synthesis. Vowel
sounds characteristics in different prosodic contexts were studied, creating comparative diagrams corresponding to vowel
pronounciation in these different contexts.
1
3. Voice signal analysis
After processing the signal, the analysis of the voice signal is the next step in voice synthesis. Voice signal analysis
supposes:
1) finding signal parameters and characteristics based on speech patterns recorded from the user;
2) decomposing the signal in segments or regions with common properties (signal segmentation);
3) highlighting the significant segments and creating relations between them and the known information (extracting
the information);
Considering that the voice signal is quasi-stationary on short segments of time, meaning it keeps its properties intact
during the entire interval, the nowadays methods of analysing the voice signal use a so called short term analysis. In order to
approach this analysis method, the voice signal is divided into 10-30 ms segments, range in which the signal is considered
stationary.
There are two voice signal analysis methods: time domain analysis and frequency domain analysis.
1) The frequency domain analysis means decomposing the signal into components of known frequency, as is the case of
the Fourier analysis, or in the components that have a known frequency behavior, as is the case of filter based
analysis. The obtained parameters, due to decomposing the signal into components, are different from the case of
time domain analysis, the two approaches being complementary. The main methods used in this type of analysis were
described: filter bands analysis, Fourier analysis, LPC analysis, cepstral analysis and perceptual analysis.
2) The time domain analysis means computing the voice signal parameters by directly analysing the samples of wave
signal. The following parameters can be extracted: maximum and average amplitude, the energy of the voice signal,
the number of zero crosses and the fundamental frequency. Some usual methods of determining the fundamental
frequency of the voice signal were described in the study, as it follows: the autocorrelation method, the differential
function of average amplitude method, the central limitation method.
3.1. The author’s contributions in determining the signal periods
In this chapter, the author presented their own algorithm for determining the signal periods. The algorithm is based on
a calculus method exclusively in the time domain, which makes it extremely fast and efficient. Other advantages derived from
this approach are: the exact detection of the ends of the period interval, the precise determination of each period in a quasiperiodical acoustic segment with variable frequency, the fast determination of the period maximums. The algorithm is presented
in figure 3.1.
Pivot Determination
Period estimation
Maximum Points Detection
Hiatus Points Detection
Period Determination
Figure 3.1. The signal periods determination algorithm
As it can be observed in figure 3.1, the algorithm has 4 gradual steps: determining the start point (the pivot point),
determining an estimation of the period, detecting the maximum and hiatus points for each period and then designating the
period intervals.
The determination of the pivot point is necessary in order to know the position of the first period maximum. In order to
determine it, after a median filtering of the signal, the zero points, the minimum and maximum of the acoustic signal are
computed. Then the pattern in the signal with the highest amplitude out of all the maximum points is chosen, on a distance D
from the beginning of the considered segment. This is the pivot point.
Next is the stage of estimating the signal period around the pivot point. For this, the points on the left and right of the
pivot point that have the amplitude comparable to the amplitude of the pivot are determined. The initial period estimation is
obtained by mediating the distances between these two points and the central pivot point.
In a third stage all the period maximums are determined, starting from the pivot point to the left, respectively to the
right. A period maximum is obtained in the following way: knowing that the distance from the previous point is equal to the
estimated period, the local maximum point is positioned at the minimum distance from this position.
If at a certain iteration, no maximum point is found in the pipeline position, either because of overloading the
established period, either because of the low signal amplitude, the next maximum point is marked as period hiatus – in the first
case as an amplitude hiatus – in the second one.
Finally, in the fourth stage, after determining all the period maximum points, the end points of period interval are
computed. The starting point of each interval is considered to be the first zero point preceding the corresponding period
maximum. Therefore, each period interval starts from its initial zero point and ends at the initial point of the next interval.
2
Results obtained with self period interval determination algorithm
The algorithm works with good results both in the case of a male voice and in the case of a female voice. In the case
of normal timber voices, the results are exact. If the voice timber is very rich in overtones, due to the fact that the signal sweeps
very quickly over the zero line, there can sometimes be variations of 1-5% in determining the end points for some periods, but
these variations are compensated at the neighboring periods.
The algorithm proposed here is much more precise than the methods that imply a signal analysis in the frequency
domain, because of the fact that here (in the proposed method) the signal patterns are directly analyzed, with no windowing
necessary – which always leads to approximations.
In figure 3.2 one can see the final result of the period intervals determination for a vowel pronounced by a male
speaker:
Figure 3.2. The final result of the period intervals determination
3.2. Voice signal segmentation
After extracting the characteristics of the spoken signal, the next step of the analysis is the voice signal segmentation.
The segmentation means detecting the different signal categories and categorizing them according to the properties of the
signal.
The complexity of the segmentation algorithms depends on the type of categories we want to detect. For example, the
algorithms that separate the signal in regions following certain physical parameters will be less complex than those that
determine the phonetic category (vowel or consonant). At their turn, these will be less complex than those that determine not
only the category, but also the identity of the phonemes. The same way, the algorithms that determine all the alophone
variations of a particular phoneme can be even more complex, due to the variation of that phoneme during speech.
The detection of the categories and the classification of the voice signal are done in three steps:
1) detecting the S/U/V basic segments;
2) identifying the phonetic categories;
3) accurately identifying of the phonemes;
The first step in this algorithm is dividing the signal into three basic segments categories: silence – S, unvoiced – U,
voiced – V.
The second step is creating connections between each speech segment and a certain phonetic category. The
phoneme types and the phonetic categories are different depending on the language. For example, for English we can define 9
phoneme categories: vowels, voiced consonants, nasals, semivowels, voiced fricatives, unvoiced fricatives, voiced stops,
unvoiced stops and silence.
The third step, a more complex one, is the accurate identification of the phonemes from the input flow. Here, the
purpose is matching the analyzed segment with one of the phonemes in that language.
In general, choosing the number of categories in which the signal is segmented is done through a compromise
between the complexity of the algorithms and the resolution of the resulted speech segments. Supposing that the recognition of
the individual phonemes is not necessary, the complexity of the segment recognition algorithms is reduced, because the
number of choices necessary for the matching process are reduced from the number of phonemes to the number of phonetic
classes (for example, for English, from 41 phonemes to 9 phonetic classes). In addition, the differences between two phonetic
categories are recognized more easily than the difference between two phonemes belonging to the same category.
3.2.1. Contributions to the voice signal automated segmentation process
In this chapter the method, designed by the author, capable of automatically detecting the S/U/V components of the
signal (Silence, Unvoiced, Voiced), of dividing these components into regions with certain properties and then creating
connections between these regions and a known sequence of phonemes, was presented (figure 3.3.).
The algorithm proposed by the author executes the automated voice signal segmentation into 10 region classes. The
voice signal was first divided into 4 basic categories: silence, voiced vowel, unvoiced consonant and transition, then they were
classified into 10 distinct region classes: silence, unvoiced consonant, voiced vowel, unvoiced silence, jump-region, irregularity,
transitory, dense transitory, type R discontinuous and type G discontinuous.
The author realised some experiments that lead to creating connections between Romanian phonemes and region
classes, connections which were presented in the corresponding chapter of the thesis.
3
Voice signal
S/U/V segmentation
Region detection
Sub-regions
Connecting
Phonemic segmentation
The proposed algorithm uses the signal analysis in time domain. After
a low-pass filtering of the signal, the zero passing points Zi from the wave form
are detected. Then, the minimum value points mi and the maximum value points
Mi between two zero points are computed.
The silence/voiced separation is done using a threshold limit Ts applied
to the voice signal amplitude. In the silence segments all the mi and Mi points
must be lower than Ts.
The distance between two adjacent zero points Di is then computed for
each segment in the voice signal. The decision of voiced segment is taken if
this distance is greater than a threshold limit U.
The transitory segments are also defined, these being the segments
for which the above conditions are not fulfilled.
Figure 3.3. The automated segmentation method proposed by the author.
Following the S/U/V segmentation (detection of the silence/ unvoiced/ voiced segments) is a division of the voice
signal into distinct region classes, towards finding the signal properties of regions and creating connections with the entry
phonemes set.
After a first application of the above algorithm, a great number of regions will be generated. While the voiced regions
are correctly determined from the beginning, the unvoiced regions are fragmented by a series of silenced regions, due to the
fact that, usually, these unvoiced regions are of low amplitude. All these regions will be compacted into a single unvoiced region
the second time applying the algorithm.
After segmentation, the detected regions will be connected to the array of the phonemes found at entry, based on
some rules established depending on the acoustic properties of each phoneme pronounced in Romanian. This process, of
connecting the distinct regions in the voice signal to the phonemes, has a very important role in the automated generation of the
vocal database and is used with many other applications – including the speech recognition domain.
The final result of signal segmentation into regions
In figure 3.4 one can
observe the final result of the
segmentation into region classes for
the phrase << Evidenţierea unui
cadru general >>. The following
classes can be observed here :
voiced-vowels (orange), unvoiced
consonants (red line), transitory
regions (red bolded line), silence (no
line), unvoiced silence (blue).
The advantage of this
algorithm over other approaches is
the
promptitude
derived
from
performing the calculi in the time
domain and detecting the basic
categories in a single parse of the
signal patterns. Also, the different
types of regions are detected mainly
based on the parameters obtained in
the first stage of the algorithm.
Figure 3.4. The final result of the segmentation into region classes for the phrase << Evidenţierea unui cadru general >>
3.3. Phonetic segmentation
Phonetic segmentation is the association process of some phonetic symbols from the input text with the spoken signal.
After segmentation process, the acoustic units will be extracted. These acoustic units can be letters (phonemes), syllables,
letter groups or whole words, depending on the approached method. After separating the segments in the recorded signal, the
acoustic units are parameterized, labeled and recorded in the database used in the synthesis.
Since the phonetic transcription of the text doesn’t represent a difficult task, the toughest job in processing of the
speech corpus and in the creation of the vocal database is the segmentation. This is due to the fact that the automated
segmentation methods aren’t reliable enough at the present day, therefore the manual check-up of the segmentation remains
mandatory, an extremely expensive process both in the means of time and of the development costs.
This need of manual intervention is considered a limiting factor for the building of new corpora used in synthesis.
Taking into consideration the market growth towards diversifying the speech synthesizers, the improvement of precision and
automation degree of the segmentation and of the corpora annotation process represent a must.
4
3.3.1. Author’s contributions to the automated phonetic segmentation of the voice signal
The author has proposed a phonetic segmentation method based on association rules, which creates a
correspondence between groups of letters found at entry and the distinct regions of the voice signal. The segmentation
algorithm follows the entry text and tries to find the best match for each letter group presented with one or more regions from
the voice signal.
The entry text is first rewritten into a certain phonetic transcription using a simple look after table. The transcribed text
is then split into a sequence of phonetic groups. A correspondence with the segmented regions in the voice signal is
determined, based on some association rules.
The presented method follows three distinct steps:
1. The phonetic transcription of the entry text;
2. The segmentation of the voice signal into regions;
3. Writing the association rules for each phonetic group.
An automated generator for text stream parsers, called LEX, was used for associating phonetic groups with the
corresponding region sequences. LEX generates a text lexical analyzer (scanner) according to a set of rules given in the
Bachus-Naur Form notation.
Each rule in the set of rules contains a character pattern specification, which has to be matched with the current
phonetic group from the input, and a corresponding action is executed. In our case, executed action represents the testing a
condition for the region sequence which can be put into correspondence with the phonetic group.
The generated scanner has, at input, the character string resulted from the phonetic transcription and, based on the
stored rules, the following actions are executed:
1) it takes the current character sequence from the input string;
2) it finds the corresponding rule through pattern-matching;
3) it tries to find a region sequence from the signal that would match the condition specified by the rule.
For a certain speaker, the results of the association between the input phonemes string and the region set from the
voice signal depend on two factors: (1) the way of the voice signal segmentation and (2) the association rules corresponding to
that speaker.
The signal segmentation using the method proposed by the author divides the signal into 10 distinct region classes,
the borders between the regions being well confined. The most delicate problem is building the association rule set for the
considered speaker. The rules set design is done using a voice corpus recorded from that user, based on which the rules for
each phoneme group are written.
Once a rule set for a speaker is written, transcribing it for another one is done only by modifying duration conditions
and the type of regions associated for each phoneme group. The author first designed a rule set for a male speaker, which was
afterwards easily adapted for a female speaker. After applying this method of phoneme-region association, the conclusion was
that these duration and waveform constraints are enough for a correct association.
Experiments regarding the segmentation into phonetic sub-regions
After the association process of phonetic groups with the regions of the voice signal, there will be two distinct
association facts:
1) a certain phoneme is uniquely associated to a region or a region set;
2) a group of more phonemes will be associated to a region or a region set.
The first case is usually found when we’re talking about an unvoiced consonant (/s/, /ş/, /t/, /ţ/, /j/, /f/, /č/, /ğ/) or a singular
vowel (which doesn’t appear in a vowel group). In this case the segmentation is precisely determined.
The second case is found when a group of phonemes built out of vowels and glide consonants (/l/, /m/, /n/, /r/) is
encountered. Most times, such a group will be associated to a single voiced-voice type region.
This particular case does not affect the linguistic unit detection process used in building the annotated voice corpus,
especially when the linguistic units are built out of phoneme groups like syllables.
However, if we want to separate the phonemes from such a region corresponding to a phoneme group, certain methods,
based on detecting the inherent characteristics of each phoneme, will have to be used.
The author experimented two methods:
1. A method based on detecting the sudden transitions in the region;
2. A method based on phonetic modeling;
Both methods demand computing the signal characteristic coefficients of each phoneme (Fourier coefficients were used), as
well as computing a distance between two coefficient sets that are compared. The results were explicitly presented in the
corresponding section of the disertation.
About the application of these phonetic segmentation methods, we can say they were applied for segmentation and
annotation of spoken corpus in order to build an acoustic unit database. The vocal database is built out of basic units that if
concatenated can generate a sound signal corresponding to any text.
These basic units can be chosen from: words, sentences, morphemes, syllables, phonemes, etc, depending on the
application’s requirements. The use of words and sentences (recording them as waveforms, including intonation, articulation etc) leads
to obtaining a high quality speech, but for a restrained linguistic domain.
The author chosed, for implementation, the version that uses syllables as basic units. This choice holds the advantage of
using speech segments large enough to hold the intra-segmental prosody elements (like accents), but small enough to ensure a
reasonable size of the database. Another advantage of using syllables is the fact that, through concatenation, there are no acoustic
artifacts resulted, as is the case of diphones, for example, case in which a spectral interpolation in concatenation points is necessary.
5
4. Voice synthesis methods
A classification of the voice synthesis methods was presented at the beginning of this chapter:
a) Depending on the approach level, the voice synthesis methods group into two categories: methods that approach the low
level synthesis and the high level one, respectively;
b) Depending on the approached analysis domain, the speech synthesis methods divide into: synthesis methods in time
domain and synthesis methods in frequency domain.
The time domain synthesis methods create a direct concatenation of the waveforms previously stored in the vocal database.
The simplest synthesizers based on these methods don’t use the acoustic unit parameterization, using directly the waveform of the
time domain signal.
The major advantage of these concatenation synthesis methods in time domain is the almost natural quality of the
synthesized voice. Some of the disadvantages are the important resource quantity used in the waveform storage process and, also,
the speech prosody modification difficulties.
The speech synthesis methods in frequency domain create the voice synthesis based on some acoustic parameters
generated as a result of the approximation of some spectral characteristics from the frequency domain. Therefore, in order to
synthesize a text, the acoustic parameters corresponding to the speech are generated first and then the voice signal waveforms are
generated.
In the study, a few methods that return good results in synthesis were presented. Therefore, the linear prediction and the
formantic synthesis method in frequency domain, the TD-PSOLA method in time domain and the corpus based method were detailed.
4.1. Contributions to the design of the speech synthesis methods
As a specific development of the concatenation speech synthesis methods, the author designed and implemented a speech
synthesis method starting from a text, method based on syllable concatenation. Defining some linguistic rules in the text analysis
phase and some waveform blending rules based on the prosodic characteristics was necessary for the implementation of the method.
From the text-to-speech system classification point of view, the developed method is a mixed one. It combines the approach
characteristics based on waveform concatenation with the rule based approach. The speech synthesis based on this method is done
in two steps: the text analysis and, respectively, the speech synthesis (figure 4.1).
Pre-processing
Acoustic unit retrieval
Ri: G
Voice
DB
i
<>
Cond_R
egi
TEXT ANALYSIS
Syntax analysis
Unit concatenation
Linguistic units
determining
Syllables
SPEECH SYNTHESIS
Local prosody
determining
Synthesis
Accents
Figure 4.1. The syllable concatenation based synthesis method
In the text analysis phase, a pre-processing stage was first necessary for the phonetic transcription of the numbers
and the abbreviations inside the text. The syntax analysis highlight the possible errors encountered while typing the text that will
be synthesized. Next is the basic linguistic unit determining, units that, in the current approach, are syllables. In the last stage of
the text analysis the intra-segmental prosody is determined, in correlation to the word emphasis. Specific rule set were created
for each stage in the text analysis.
In the synthesis stage, the acoustic units corresponding to the syllable units from the entry text are first found based on
a vocal database search algorithm. The acoustic units are concatenated and then converted into sound in the last stage of the
speech synthesis.
The database used with the synthesis method proposed by the author contains a Romanian syllable subset. After
being recorded, the syllables must undergo a normalization process for speech tonality and intensity parameter alignment. The
syllables were recorded in different contexts and pronunciation ways, including this way the prosody corresponding to the text
that will be synthesized.
The vocal database contains syllables built out of two, three or four letters, denoted by S2, S3 and S4.
The strategy was to record an as large as possible number of syllables from each category, in order of the appearance
frequency in Romanian. In order to do so, considering that an automated method of dividing the words into syllables was
designed, the creation of a Romanian syllable statistics was taken into consideration, with the purpose of using it in the creation
of the reference syllables and the acoustic database process.
The statistics’ aim was detecting the appearance frequency of Romanian syllable and it was created based on texts
from different domains like: belles-letters of different kinds, religion, economics, politics, science and technology, journalism.
The texts summed approximately 342000 words, meaning over 600 pages in A4 format. Only the type S2, S3 and S4 syllables
were counted, meaning those having two, three or more component phonemes.
The sub-segmental local prosody integration inside the words was taking into consideration when creating the syllable
database. Therefore, the prosody was included by recording the unstressed and the stressed syllables for each S2, S3 and S4
category. Also, considering that a syllable is pronounced differently depending on the position it is on inside the word, if it is at
6
the beginning, at the middle or at the end of the word, the intention was to record the syllables in these different contexts they
can appear in. In a first instance, a differentiation between the end syllables on one hand, and the medial and initial ones (also
integrated in the median syllables category) on the other hand, was made.
The syllables were introduced in the database based on the above characteristics. The database is organized as a
tree structure. The nodes of the tree represent the syllable characteristics and the leaf nodes correspond to the actual syllables.
The hierarchical structure of the database has four levels:
1. The Category Level : Two, three or four phoneme syllables (S2, S3, S4);
2. The Context Level : Median (Med) or final (Fin) segment, related to the position inside the word;
3. The Accent Level : Stressed syllables (A) or unstressed syllables (N) inside the word;
4. The Syllable Level : The acoustic units recorded in WAVE format.
This hierarchical structure: [Category] -> [Context] -> [Accent] -> [Syllable] holds the advantage of substantially
reducing the database search time, in the phase of matching the phonetic units inside the text with the recorded acoustic units.
Approximately 600 acoustic units, phonemes and syllables were stored in the database, considering both median and
final segment syllables, as well as the stressed and unstressed syllables.
In the phase of restoring the acoustic units from the database the previously presented syllable characteristics were
considered. The following situations are possible:
a) The required syllable is identically found in the database, from a phonetic point of view (the component phonemes),
from a contextual point of view (median or final) and from a prosodic point of view (the emphasis). In this case the syllable is
stored as it is in order to be incorporated into the synthesized word.
b) The syllable is phonetically found, but not prosodically or contextually. In this case it is preferred that the syllable is
built out of subunits (separate phonemes and shorter syllables) which would mainly follow the required prosody (stressed or
unstressed syllable) and then, if possible, the specified context.
c) The syllable isn’t phonetically found in the database. In this case, as in the previous one, the syllable will be built
based on the component subunits found in the database.
The acoustic units found in the database through the above algorithm will be concatenated in order to generate the exit
signal. The concatenation will take into account the pause between words, which will be adjusted depending on the required
speech rhythm. The last phase is the one of the actual synthesis, in which the waveforms corresponding to the entry text will be
played using the sound card of the computer.
Results obtained using the synthesis method developed by the author
a) The automated syntax analyzer, created within the method, is based on a dictionary that holds over 30000
inflexional forms of Romanian and on a set of 550 flexion rules. The analyzer was tested on a series of Romanian texts of
different genre, from literature to technical documents, summing over 200000 words. The tests have proved an accuracy of over
98% correctly recognized words, the unrecognized words being exceptions that hadn’t yet been introduced in the rule set.
These results show the completeness of the designed rule set, as well as the viability of the proposed method.
b) The lexical analyzer used for determining the syllables contains a set of over 180 rules for decomposing the words
into syllables. The obtained performance was of 98% correctly decomposed words, percentage calculated on a set of 50000
words extracted out of different text genre (literature, economics, politics, science and technology, philosophy, religion). The
performance is better than that found at other Romanian researchers who used lexical rules.
c) The lexical analyzer for determining accents holds a set of 250 rules for detecting the stressed syllable inside words.
A 94% rate of successfully detecting the stressed syllable was obtained, rate calculated on the same set of 50000 words, the
one used in the case of decomposing the words into syllables.
d) In the speech synthesis phase, the method generated good results, due to using acoustic units of medium and large
length, the type of syllables. The direct concatenation of the units, without any other signal processing, allows the synthesized
speech to keep its naturalness and the prosodic aspects characteristic to the voice the acoustic units were recorded in.
The advantages of the method
The syllable concatenation based synthesis method presented in this chapter has the following advantages:
a) It has a unitary approach in all the design phases, being based on rules in the most important stages.
b) It uses rules organized in a LEX type grammar, allowing the separation of the linguistic analysis mode from the data
handling flux.
c) It ensures an enhanced extensibility and adaptability capacity, due to the fact that the rules are accessible and can
be edited by the user.
d) It ensures a significant decrease of the costs and the time needed for the design process, due to using rules (at
most a few hundred) as opposed to the methods that use dictionaries or lexicons (containing tens or hundreds of thousands of
definitions).
e) It has a higher degree of versatility, due to using LEX specific regular grammar, unlike other methods that use
internal rule representation or even the XML format. Using sets based on regular expressions allows pattern specification for
the linguistic and contextual units which they appear in, the result being a higher accuracy in the final text analysis.
f) A smaller effort for building and maintaining the vocal database is required than in the case of the corpus based
method. Therefore, in the syllable based method the number of acoustic units is at least two size grades smaller than in the
corpus based method.
g) It maintains the efficiency and quality of the concatenation synthesis methods as opposed to the parametrical
synthesis methods. Therefore, in the case of the concatenation methods, the synthesized signal maintains the quality of the
units recorded in the vocal database, while in the case of the parametrical methods the exit signal is approximated.
h) It offers a higher synthesis quality than the phoneme or diphone based methods, due to a reduced number of
concatenation points at the syllable level.
7
5. Creating the LIGHTVOX speech synthesis system for Romanian
As a contribution to the design and creation an interactive voice system domain, the author’s aim was designing and
implementing a voice synthesis system adapted for Romanian that uses syllables as phonetic units, called LIGHTVOX. The
system was conceived as a text-to-speech system where the speech synthesis is done starting from a Romanian text using the
syllable based synthesis method presented in the previous chapter.
The creation of the LIGHTVOX system followed two word directions:
1. Building the acoustic database (off-line process) included the following steps: recording the speech patterns,
normalizing the signal, splitting the signal into regions, phonetic segmentation, separating the acoustic units and
the actual building of the database.
2. The text-to-voice conversion (on-line process), including the following steps: the text pre-processing, correcting
the spelling, detecting the linguistic units, determining the local prosody, restoring the acoustic units, combining
the units with the voice synthesis.
All this steps were detailed in the study.
Within the Electronics, Telecommunications and Information Technology Faculty in Cluj-Napoca, the author created a
prototype of the LIGHTVOX voice synthesis system. The implementation of the prototype was done based on a five component
structure: the linguistic analysis module, the prosodic analysis module, the vocal database management module, the phonetic
unit matching module and the actual speech synthesis module.
Regarding the experimental results, a fluent natural audition of the synthesized text, that follows the segmental
prosody (the word emphasizing) in Romanian, was found.
The system can be directly used by persons with visual disabilities or by blind persons for automated text reading by
using simple keyboard commands.
The system can be easily extended to other applications for blind persons, like: computer voice assisted text typing,
electronic mail applications, reading Web pages, electronic library type applications for the blind (where, through an interactive
voice menu, the user could chose an author, an electronic format book and a chapter of that book and the system would read it
through the synthesized voice).
6. Conclusions
The research done within the thesis had as final purpose the development of a voice synthesis method specifically
adapted to Romanian, as well as the development of a work methodology for building an automated voice synthesis system.
Using the syllables as linguistic units, the designed synthesis method is part of the category of high quality methods,
based on concatenation. The method is specifically adapted to Romanian and proposes a new approach, based on rules.
The developed design methodology offers the possibility to build a voice synthesis method using both procedures
specific for signal processing and methods specific for artificial intelligence and computational linguistics, methods based on
rules and knowledge sets.
Specific processing rules were developed in the most important design stages of the voice synthesis system: in the
text analysis and processing stage, for linguistic unit detection inside the text and in the vocal database construction stage, for
extracting the voice units from the voice signal.
The developed voice synthesis system prototype proves the viability of the method designed by the author, it offers the
possibility to develop great importance application in the man-machine communication domain, but it is also of great utility for
special needs persons.
The obtained results were presented during internal and international conferences and published in specialty
magazines and books. They were published in 20 articles and a book in the domain of the thesis.
The main accomplishments and contributions of the thesis refer to:
1. Realising a study about the sound production and perception pattern.
2. Realising a synthetic study about the voice signal processing, coding and compression methods.
3. Developing a voice signal digital processing application called SPEA (Sound Processing and Enhancement
Application).
4. Realising some experiments on real audio and voice signal patterns, in the purpose of determining the parameters
that have a direct influence on the acoustic quality of the signal.
5. Realising a synthetic study on the voice signal analysis methods in time and frequency domain.
6. Realising a study on the voice signal segmentation and classification modalities.
7. Developing an own method of segmenting the voice signal into regions, capable of detecting 4 fundamental signal
categories and 10 region classes.
8. Developing a method for determining the periods in the voice signal waveform.
9. Developing three distinct methods for phonetic segmentation of the voice signal by analyzing the detected regions
in the signal.
10. Realising a study on the voice synthesis methods starting from a text.
11. Realising a study on the existent voice synthesis methods.
12. Developing a syllable based voice synthesis methods for Romanian. In this method some linguistic rules were
established for the text analysis phase and waveform combining rules in the synthesis phase.
13. Structuring a new text-to-speech voice synthesis system design methodology and developing the LIGHTVOX
synthesis system for Romanian.
14. Creating a vocal database that uses syllables as acoustic units. The database holds 600 syllables recorded in
different contexts and pronunciation ways.
8