Focus Detection and NLP

Focus Detection and NLP
Jonas Lindh and Jessica Villing
PhD Students
GSLT
and
Department of Linguistics
Göteborg University
{jonas,jessica}@ling.gu.se
Abstract
This paper describes one of the advantages dierent spoken NLP applications (especially dialogue systems) can benet from implementing focus
detection. There is an overview of some implementations and their results
as well as a brief description of a possible implementation into GoDiS
(Gothenburg DIalogue System). A pilot experiment was performed to see
how two acoustic correlates performed in detecting focus. The automatic
detection was very successful and is therefore suggested as implementation
for the system to resolve ambiguous utterances in dialogue.
1
Introduction
As soon as there is spoken input and/or output in an NLP system we believe
that phonetic knowledge including automatic focus detection will improve the
system. The kind of NLP system we will discuss in this paper is spoken dialogue
systems, where the user interacts with the system in natural language.
Spoken dialogue systems aim to provide the user with answers to questions,
or to control various devices. It can be a system where the user can ask about
e.g. tourist information or book theatre tickets, or a system where the user can
control e.g. a cell phone or a navigation system just by talking to it.
So called conversational dialogue systems make it possible for the user to
interact with the system in a way that feels natural regarding choice of words
and word order and to negotiate with the system and thereby perform more
advanced dialogues (Allen et al., 2001).
Detection of the focus word in a phrase would help the system interpret
the user's intentions, and make the dialogue feel more natural for the user.
Humans intuitively knows how to interpret an ambiguous utterance based on
the way the speaker intonates the words, and a dialogue system that interpret
user utterances in the same way would probably be more human-like and less
tedious to use.
1
1.1
Hypothesis
In this paper, we will explore how detection of focus in a user utterance can
help provide a dialogue system with additional information, and thereby make
it possible to disambiguate the utterance. Our hypothesis is that detection of a
focused word might give enough information about what additional information
is suited to give to the user of a dialogue system.
2
Background
Is it possible to speak exactly the same sentence, performing the same speech act,
and still mean something completely dierent depending on occasion?
When
uttering a sentence like
(1)
Do I have a meeting with Jonas on Thursday?
what does the speaker really mean?
Just looking at a transcription like this
one does not provide enough information since the same sequence of words can
have dierent meaning depending on prosody (Shriberg et al., 1998). Marking
the stressed word (italicized) gives a hint on what the speaker really wants to
know:
(2)
Do I have a meeting with Jonas on Thursday?
(3)
Do I have a meeting with Jonas on Thursday?
(4)
Do I have a meeting with Jonas on Thursday ?
The speaker might know that she has scheduled e.g. lunch, but want to know if
there might be a meeting scheduled too (2), be insecure whether it was Jonas or
maybe someone else that she has scheduled to meet (3) or be insecure whether
it was on Thursday or some other day that the meeting was supposed to take
place (4).
The fact that it is possible to make these three dierent interpretations
just by noticing the speaker's intonation is due to humans intuition about how
utterances should be realised in order for a dialogue to be coherent. Engdahl
et al. (2000) believes that the speaker stress the part of an utterance that she
believs the hearer should update her information state with. The background
1
material is generally deaccented or left out . This can be accentuated by the
following example:
(5)
Do I have a meeting with Jonas on Thursday, not David ?
Jonas is focal which is conrmed by the parallel item in the continuation David.
The continuation must be parallel, otherwise the discourse would be perceived
as incoherent:
1 At
least this counts for English and Swedish, other languages might use other techniques
such as morphology or word order.
2
(8)
Do I have a meeting with Jonas on Thursday, not Friday ?
Allwood (1974) investigated the acoustic correlates of the semantic phenomenon
of focus.
He could see that the element in an utterance that is asserted is
the element that is in focus, everything else is background information.
To
acoustically determine what word is the focus word, he investigated duration,
intensity and fundamental frequency, and found that focus primarily is marked
through increased duration. There are also changes in F0 and intensity curves,
but those changes are less clear and regular.
There have been several studies on how to use prosodic information in different kinds of NLP systems:
Tür et al. (2001) uses prosodic cues for automatic topic segmentation to
aid various language understanding systems such as information extraction and
retrieval and text summarization.
in two phases.
The paradigm that they are using works
The rst phase is the chopping phase: the input is divided
into contiguous strings of words that are assumed to belong to the same topic
(e.g. sentences in textual input since it is assumed that topics do not change
in the middle of a sentence). In the second phase the sentence boundaries are
classied into topic boundary and non-topic boundary. A prosodic model is used
to estimate the posterior probability of a topic change at a given word boundary
based on prosodic features extracted from the data. The prosodic features are
duration (duration of pauses, duration of nal vowels and nal rhymes) and pitch
(fundamental frequency (F0) patterns preceding and following the boundary, F0
patterns across the boundary, and pitch range relative to the speaker's baseline).
Their results indicates that prosodic information, especially pause duration,
provides an excellent source of information for automatic topic segmentation.
Edlund and Heldner (2005) looked at intonation patterns to determine relevant places in a dialogue where a dialogue system can begin to talk without
interrupting the user. Finding these places makes it possible to speed up the
dialogue. Most frequent, dialogue system's turn-taking principle is to wait till
the user have paused about two seconds, then the system assumes that the user
has given away the turn and it starts to speak. The problem is that if the user
is just hesitating - maybe wondering about how to formulate a question - the
system will interrupt the user. If, on the other hand, the user really wants to
give away the turn, two seconds is quite a long time to wait.
Finding better
ways to handle turn-taking is therefore important to make the system feel natural for the user. Edlund and Heldner found that rising intonation before a silent
pause can be associated with turn-yielding as well as turn-keeping, so prosodic
analysis does not give enough information but they found that it is a valuable
complement.
Strom et al. (1997) uses prosody to improve a speech-to-speech translator
called INTARC, which incrementally translates spoken phrases from German to
English. Prosodic clues make it possible to nd phrase boundaries and focus.
Finding the phrase boundaries reduces the search space during the syntactic
parsing, and it also rules out analysis trees during the semantic parsing. The
3
gure below shows an overview of the INTARC system.
As can be seen, the
prosodic boundaries support the stochastic and the semantic parser and the
prosodic focus supports the semantic evaluation.
2.1
Why focus detection in a dialogue system?
It is up to the designer of the dialogue system to decide how the system should
response to dierent user questions.
Maybe the most obvious (and denitely
the easiest) answer to (1) is a single yes or no, but it would be nice to be
able to give additional information, and in that case make sure that it is the
wanted additional information. One way to disambiguate the utterance and nd
out which one of the three possible interpretations is the right would be to ask
clarifying questions. That might however be both tiring and even annoying to
the user since the intention of the question is obvious to her. In a human-human
conversation, the prosody gives valuable information on how to disambiguate an
utterance. The most important information in an utterance is often stressed,
so depending on whether the stressed word in the sample utterance above is
meeting, Jonas or Thursday, it is possible to know which one of the three
possible interpretations is the right.
2.2
Acoustic correlates to focus
It is well known that the main acoustic correlates to stress are duration, F0
movements and intensity, where an F0 peak and relative duration are the
strongest cues. This has been shown in several studies for dierent languages,
such as Dutch (Sluijter and van Heuven, 1995), American English (Sluijter and
van Heuven, 1996) and Swedish (Allwood, 1974; House and Bruce, 1990; Sautermeister and Lyberg, 1996). In applications some of these correlates are easier to
implement than others. Intensity is for example in easily measured parameter,
but at the same time technical issues such as recording equipment and distance
to microphone will aect it. Spectral correlates have also been shown to indicate stress or focus (Sluijter and van Heuven, 1996), as increased duration also
include more time for articulators to reach a target producing more extreme
formant values. As spectral measures also are dicult to interpret automatically it is dicult to implement in an existing system. Two parameters are left
to experiment with and they have also been shown to be the best predictors of
focus (Heldner et al., 1999). These two parameters are also not very dicult to
4
implement in a system, or at least not in a system containing a speech recognizer
since the recognizer will also feed some kind of segmented output. If there is no
segmented output, one can always use an external module and feed the sound
through an aligner to get for example syllable level feedback (which is the level
presumed in this papers experiment). This way the duration per syllable and
F0 maximum can easily be calculated and fed back to the system.
2.3
GoDiS and Praat
GoDiS (GOthenburg DIalogue System) is a dialogue system based on an issuebased approach of dialogue management (Larsson, 2002). The idea behind issuebased dialogue management is that issues (semantically modeled as questions)
essentially are entities specifying certain pieces of as-yet-unavailable information.
The conversational goals can thereby to a large extent be modeled as
questions. Every user utterance is seen as an answer to a question, and to interpret the utterance the system tries to nd a question that matches the given
answer.
Each utterance is interpreted as a dialogue move, corresponding to
speech acts (Austin, 1962) such as requesting, answering or conrming. Apart
from the obvious issues that arise from the activity on which the dialogue take
place (e.g. issues about which contact the user wants to call with her cell phone)
there are also meta-issues that arise from the dialogue itself, e.g. if the ASR
(Automatic Speech Recognizer) fails to hear the name of the contact and the
system needs to ask for the name again.
A GoDiS application consists of one dialogue plan for each task that the
application is capable of performing. The dialogue plan species which questions
the system has to nd answers to, and what the system needs to do when all
questions have been answered to perform the requested task.
To give three
dierent answers to example (2), (3) and (4) in section 2, it would be necessary
to implement three dierent dialogue plans.
The software Praat (freely available at www.praat.org, Boersma and Weenink,
2006) can be run as an external module to other programs (such as GoDiS) by
the sendpraat subroutine.
2.3.1
Focus detection and GoDiS
GoDiS is a keyword spotting system, meaning that one or more keywords trigger
each dialogue plan. This means that there has to be either dierent keywords
that trigger each one of the dialogue plans, or the disambiguation between the
plans must be solved by asking clarifying questions.
A question like (1) in a
GoDiS application of today would trigger three plans, forcing the system to ask
a clarifying question to nd out what information the user wants to have.
Detection of the word that is in focus in the utterance can help giving a
hint on what the user thinks is the most important word.
special feature (e.g.
Giving a word a
[+FOC]) would be a way of telling the system that the
word is a focus word.
The system could then be told to choose the dialogue
plan that corresponds to the focus word, giving the user the correct additional
5
information.
If there is no focus word the system could either give a neutral
simple answer (e.g. yes or no), or ask clarifying questions.
3
Method
To get an understanding of how well automatic detection would perform, a pilot
experiment was conducted and evaluated. For the experiment the hypothesis
was that a focused word (i.e. the stressed syllable in the focused word) would
be prominent in an otherwise ambiguous sentence. It was also presumed that
the system (or an external module) would output a syllabically segmented and
time aligned label le.
3.1
Participants and material
Four subjects were recorded (two male and two female) using a M-audio Nova
microphone connected to a M-audio rewire solo external mobile audio interface. The recordings were sampled at 44.1 kHz and 16-bit resolution and analyzed automatically using a high level language script which runs inside the
software Praat (Boersma and Weenink, 2006). A cut o frequency of 90-300Hz
for male speakers and 110-450Hz for female speakers was applied. This means
that frequencies below those oor values (for example when creakiness occurs)
are ignored and are presented as empty, unvoiced parts in some of the gures
in the results. The subjects reported no decient hearing and were all between
25-37 years old.
3.2
Experimental design
Each subject was given a sheet of paper with an ambiguous question and three
alternative interpretations.
The subject was then recorded uttering the same
sentence three times with each one of the interpretations in mind.
The label
les were then created semi automatically and manually corrected. After being
recorded the subjects evaluated the recordings by the other speakers to see
whether they could all make the same interpretation, i.e. detect focus, based on
the others recordings. All except one recording was correctly interpreted and
since the misinterpretation was unanimous, that specic recording was later
excluded, as the automatic detection cannot be expected to outperform human
capabilities of detecting focus. Before each recording the subject was given the
information that he/she is using a dialogue system handling his/her calendar.
They were then presented with the sentence and told that they were to ask
the system the question Har jag ett möte med Jonas på torsdag?
(Do I
have a meeting with Jonas on Thursday?). They were then given the following
alternative interpretations of the question:
1. Is it a meeting (or practice) you have with Jonas on Thursday?
2. Is it with Jonas (or perhaps Martin) that you have a meeting on Thursday?
6
3. Is it on Thursday (or Friday) that you have a meeting with Jonas?
These questions were later also used for the subjects when evaluating the recordings. They could then pinpoint each recording to a reference question.
Each recording and syllabically labeled le was then run using the script
and feedback was presented as the syllable with highest F0 peak in semitones
and greatest duration for each syllable. The range of variation in fundamental
frequency has been shown to be best expressed in semitones and was therefore
chosen as the obvious logarithmic measure of comparison in this experiment.
4
Result and discussion
First each recorded sentence will be presented with the results of the duration
and F0 measurements followed by a short discussion of each speaker's result and
interpretation. Finally, the results are followed by a summary of all speakers'
results and an overall interpretation of the success rate.
4.1
Male speaker 1 (MS1)
The rst recording clearly produced a focused <moete> considering both fundamental frequency and duration as can be seen in Figure 1 and Table 1 below.
Figure 1: Time aligned F0 curve in semitones (ST re 100Hz) for MS1 recording
for interpretation 1.
7
Syll
%
dur
har
ja
ett
moe
te
med
jo
nas
po
tors
dag
3.1
2.6
5.5
21.4
8.1
11.5
8
7.8
6.2
10.8
14.9
Table 1:
Total
dur
(sec)
1.78
(100%)
Total duration and percentage duration for each syllable for MS1
recording and interpretation 1.
The second recording was more problematic as can be seen in gure 2 and
table 2 below.
Figure 2: Time aligned F0 curve in semitones (ST re 100Hz) for MS1 recording
and interpretation 2.
As can be seen the last part of the utterance is not analyzed due to creak in
signal and can be spotted as outliers in the gure of low frequency. A much worse
problem is that also in this case the syllable <moe> has a higher F0 peak than
the target <jo>.
All subjects evaluated this recording correctly pinpointing
<jonas> as the focused word. A speculation is to presume a relative F0 peak
since a general decay over an utterance is well known. Some proposals have been
made on how to handle this by recalculating the relative frequency depending
on the decay of F0 minima ('t Hart et al., 1990). However, the minima here do
not indicate a decay accept for the last two syllables (containing creaky voice).
8
Syll
%
dur
har
ja
ett
moe
te
med
jo
nas
po
tors
dag
4.3
4.7
5.4
12.6
5.4
8
12.1
13.5
6.1
13.4
14.5
Table 2:
Total
dur
(sec)
1.69
(100%)
Total duration and percentage duration for each syllable for MS1
recording and interpretation 2.
The duration measures do not solve the problem of indicating focus as there
are ve syllables of similar length. The length of the last two syllables can theoretically easily be explained by the well known nal lengthening. Practically
it is more dicult to interpret what is nal lengthening and focus in an application. However, maybe in this case we also have to look for relative measures
of duration compared to surroundings.
Some of the problems remain in the third recording which can be studied in
gure 3 and table 3 below.
Figure 3: Time aligned F0 curve in semitones (ST re 100Hz) for MS1 recording
and interpretation 3.
The peak on the syllable <moe> remains, as well as a small one on <jo>
as well. Considering the mentioned decay the highest peak would probably be
on <tors> though, which is the intention.
9
Syll
%
dur
har
ja
ett
moe
te
med
jo
nas
po
tors
dag
6.2
3.8
4.7
10.7
7.7
6.1
7.7
9.8
7
18.4
17.8
Table 3:
Total
dur
(sec)
1.53
(100%)
Total duration and percentage duration for each syllable for MS1
recording and interpretation 3.
The combination of nal lengthening and focus assign 36.2% of the total
duration to the focused word <torsdag>, which, at least in this case, supports
duration as a main cue for focus.
4.2
Second Male Speaker (MS2)
The duration and F0 for the rst recording and interpretation of MS2 can be
seen below in gure and table 4.
Figure 4: Time aligned F0 curve in semitones (ST re 100Hz) for MS2 recording
and interpretation 1.
Even though there is a clear peak on the intended <moe>, there is an even
higher on the following two syllables. This can possibly be explained by a less
rapid decay related to the focus peak and could also be solved by feeding cues
from the system on possible focused words/syllables.
10
Syll
%
dur
har
ja
ett
moe
te
med
jo
nas
po
tors
dag
13.8
4.9
8.5
17.6
4.9
6.5
6.1
7.8
7.2
9.7
13
Table 4:
Total
dur
(sec)
2.01
(100%)
Total duration and percentage duration for each syllable for MS2
recording and interpretation 1.
Using duration in this case would solve the F0 issue, since the intended
syllable clearly has the longest duration.
Studying the second recording and interpretation by MS2 indicates a pattern
of delayed decay for F0 after focus peak (see gure and table 5 below).
Figure 5: Time aligned F0 curve in semitones (ST re 100Hz) for MS2 recording
and interpretation 2.
As the focused word has a grave accent in this Swedish dialect (spoken by
MS2) the second syllable also has a peak in F0 that could easily be assigned the
second high peak.
Syll
%
dur
har
ja
ett
moe
te
med
jo
nas
po
tors
dag
7.9
4.6
9.9
8.8
5
8.4
7.8
13.5
6.1
13.4
14.4
Table 5:
Total
dur
(sec)
1.98
(100%)
Total duration and percentage duration for each syllable for MS2
recording and interpretation 2.
The problem with nal lengthening creates a problem with using duration
11
in this case also. However, in combination with F0 the result could still be used
as there are no peaks at all in the last syllables.
In the last recording for MS2 all the evaluations classied focus as being
the second interpretation <jonas>, which led us to exclude it since this study
mainly focuses on the similarities between perception of focus and its acoustic
correlates.
4.3
Female Speaker 1 (FS1)
The characteristics of focus for FS1 can be studied in the following table and
gure 6 below.
Figure 6: Time aligned F0 curve in semitones (ST re 100Hz) for FS1 recording
and interpretation 1.
There is a clear peak for focused <moe>. The competitive following peak
would easily be disregarded with information from the system (only regarding
parsed NPs as focused for example) or simply by using duration as well, which
can be seen below in table 6.
Syll
%
dur
har
ja
ett
moe
te
med
jo
nas
po
tors
dag
10.3
5.4
5.4
17.2
9.2
7.4
5.5
12.2
3.7
12.5
11.2
Total
dur
(sec)
1.93
(100%)
Table 6: Total duration and percentage duration for each syllable for FS1 recording and interpretation 1.
Despite nal lengthening the focused syllable is clearly the longest.
12
Figure 7: Time aligned F0 curve in semitones (ST re 100Hz) for FS1 recording
and interpretation 2.
In this case the peak is two-fold because of the Swedish grave accent. Interestingly enough the second peak is higher than the rst, this supports the
hypothesis that the second peak in Swedish grave accent is a phrase accent
(Bruce, 1977).
Syll
%
dur
har
ja
ett
moe
te
med
jo
nas
po
tors
dag
9
3.7
6.8
12
4.7
6.1
11.9
12.9
5.8
14
13
Total
dur
(sec)
1.76
(100%)
Table 7: Total duration and percentage duration for each syllable for FS1 recording and interpretation 2.
The problem with nal lengthening would in this case be easy to solve by
combining duration with the peak information from F0 analysis.
13
Figure 8: Time aligned F0 curve in semitones (ST re 100Hz) for FS1 recording
and interpretation 3.
The last peak on the focused word <torsdag> is high and highest if a general
decay is taken into account. Including duration gives even more indication on
focus.
Syll
%
dur
har
ja
ett
moe
te
med
jo
nas
po
tors
dag
7.4
7
8.8
7.9
5.8
6.6
8.5
7.8
10.4
15.4
10.4
Total
dur
(sec)
1.75
(100%)
Table 8: Total duration and percentage duration for each syllable for FS1 recording and interpretation 3.
14
Figure 9: Time aligned F0 curve in semitones (ST re 100Hz) for FS2 recording
and interpretation 1.
FS1 produces a clear peak for the intended focus <moe>. Duration produces
the same result (see table 9 below).
Syll
%
dur
har
ja
ett
moe
te
med
jo
nas
po
tors
dag
5.6
3.2
8.8
20.6
7.2
8
7.3
8
7.6
11.8
11.9
Total
dur
(sec)
1.97
(100%)
Table 9: Total duration and percentage duration for each syllable for FS2 recording and interpretation 1.
Approximately a fth of the total duration is here assigned to the focused
syllable.
15
Figure 10: Time aligned F0 curve in semitones (ST re 100Hz) for FS2 recording
and interpretation 2.
The F0 peak information in the second recording does not give us focus
automatically.
Syll
%
dur
har
ja
ett
moe
te
med
jo
nas
po
tors
dag
5.6
3.8
7.6
9.6
6.1
6.4
14.9
11.5
7.4
13.9
13.1
Table 10:
Total
dur
(sec)
1.79
(100%)
Total duration and percentage duration for each syllable for FS2
recording and interpretation 2.
In spite of the nal lengthening most duration was assigned the intended
focus <jo>.
16
Figure 11: Time aligned F0 curve in semitones (ST re 100Hz) for FS2 recording
and interpretation 3.
In this case the F0 peaks are not very helpful at all, since there are several
peaks higher than the one for the focused word <torsdag>.
Syll
%
dur
har
ja
ett
moe
te
med
jo
nas
po
tors
dag
6.5
3.6
7.7
10.2
5.7
9.3
7.2
8.4
8.1
20.4
12.8
Table 11:
Total
dur
(sec)
1.64
(100%)
Total duration and percentage duration for each syllable for FS2
recording and interpretation 3.
The duration of the last word is more than 30%, indicating more than nal
lengthening.
4.4
General discussion on the results
Automatically assigning focus using solely F0 peak information and duration
handles most of the cases (92% here).
There are several problematic issues
discovered here though. First of all the grave accent produces a problem as the
focus correlates seem to end up mainly in the second syllable which has to be
taken into account, for example by using words as the basis for analysis. This
would not be possible with duration though since words have dierent inherent
lengths (syllables too, but not in the same range). However, calculating mean
duration per syllable in a word and assign focus to the longest mean could solve
this. This also indicates that acoustic correlates to focus are certainly language
specic.
Secondly, the general decay of the fundamental frequency in a phrase has
to be compensated for in some way to be able to assign focus based on relative
17
peak heights. Maybe an implementation of the suggested recalculating of the
relative frequency depending on the decay of F0 minima is a solution to this
('t Hart et al., 1990).
Thirdly, duration and F0 are both obvious correlates to focus, but they seem
to operate dierently depending on position in phrase. F0 is more boosted in
phrase initially, while duration is more evident in nal position (nal lengthening).
This has to be taken into account in an implementation into a system.
Generally, an implementation would have to generate some kind of weight to
each parameter and the weights probably have to be dierent depending on
phrase position.
5
Conclusions
The success rate for automatic focus detection seems sucient to provide a
complement to solve ambiguity problems in NLP systems.
Our experiment
shows that F0 peak and duration are phonetic correlates that can provide enough
feedback to detect focus. How to interpret the feedback is not fully investigated
here, neither is the process of an actual implementation and the cooperation
between the two acoustic parameters. It is obvious that the parameters have
to be interpreted dierently depending on phrase position and that the general
decay of F0 has to be taken into account when calculating peak height. It was
also discovered that language specic cues might create problems (such as the
two accents in Swedish) which operate dierently when focused.
6
Future work
An implementation of the current idea ought to be tested to see how it can be
integrated in a system. With an implementation it will be possible to see how it
might aect such a sensitive parameter as speed for example. Some preliminary
tests on real life dialogue system interactions was done with a successful result,
but a much larger amount of tests has to be performed.
More experiments
with dierent acoustic measures for the two parameters in dierent contexts
and positions have to be done as well.
Since our theories have not ben implemented in a dialogue system we have
not been able to investigate user satisfaction. It is important to make an evaluation to measure to what extent the system succeeded to interpret the user's
desire.
7
Acknowledgements
Thank you Håkan, Leif, Soe and Ellen for letting us record you and dissect
your intention and focus. Also thanks to Staan Larsson and Stina Eriksson for
providing us with material and valuable tips.
18
References
Allen, J. F., Byron, D. K., Dzikovska, M., Ferguson, G., Galescu, L., and Stent,
A. (2001). Towards conversational human-computer interaction. AI Magazine, 22 (4):pp 2737.
Allwood, J. (1974). Intensity, pitch, duration and focus. In Logical Grammar
Reports, volume 11. Göteborg University, Dept. of Linguistics.
Austin, J. L. (1962). How To Do Things With Words. Oxford University Press.
Boersma, P. and Weenink, D. (2006).
Praat:
Doing Phonetics by Computer
(Version 4.4.20) [Computer program].
Bruce, G. (1977). Swedish Word Accents in Sentence Perspective. PhD thesis,
Travaux de l'Institut de Linguistique de Lund 12.
Edlund, J. and Heldner, M. (2005). Exploring prosody in interaction control.
Phonetica, 62(2-4):215226.
Engdahl, E., Larsson, S., and Ericsson, S. (2000).
Focus-ground articulation
and parallelism in a dynamic model of dialogue. TRINDI Deliverable D4.2.
Heldner, M., Strangert, E., and Deschamps, T. (1999). Focus detection using
overall intensity and high frequency emphasis. In Proceedings of ICPhS-99,
volume 81, pages 18231826.
House, D. and Bruce, G. (1990).
Word and focal accents in swedish from a
recognition perspective. In Wiik, K. and Raimo, I., editors, Nordic Prosody
V, pages 156173. Turku University.
Larsson, S. (2002). Issue-based Dialogue Management. PhD thesis, Göteborg
University.
Sautermeister, P. and Lyberg, B. (1996).
speech recognition system.
Detection of sentence accents in a
Journal of the Acoustical Society of America,
99:4, pt2.
Shriberg, E., Bates, R., Stolcke, A., Taylor, P., Jurafsky, D., Ries, K., Coccaro,
N., Martin, R., Meteer, M., and Van Ess-Dykema, C. (1998). Can prosody aid
the automatic classication of dialog acts in conversational speech? Language
and Speech, 41(3-4):439487.
Sluijter, A. and van Heuven, V. (1995).
Eects of focus distribution, pitch
accent and lexical stress on the temporal organization of syllables in dutch.
Phonetica, 52:7189.
Sluijter, A. and van Heuven, V. (1996). Acoustic correlates of linguistic stress
and accent in dutch and american english. In Science, P. A. and Engineering Laboratories, A. I. d. I., editors, Proceedings ICSLP 96, pages 630633.
19
Strom, V., Elsner, A., Hess, W., Kasper, W., Klein, A., Krieger, H. U., Spilker,
J., Weber, H., and Görz, G. (1997). On the use of prosody in a speech-tospeech translator. In Proc. Eurospeech '97, pages 14791482, Rhodes, Greece.
't Hart, J., Collier, R., and Cohen, A. (1990). A perceptual study of intonation:
An experimental-phonetic approach to speech melody. Cambridge University
Press.
Tür, G., Stolcke, A., Hakkani-Tür, D., and Shriberg, E. (2001).
Integrating
prosodic and lexical cues for automatic topic segmentation. Computational
Linguistics, 27(1):3157.
20