Tilburg University Audiovisual prosody and feeling of knowing Swerts, Marc; Krahmer, Emiel Published in: Journal of Memory and Language Publication date: 2005 Link to publication Citation for published version (APA): Swerts, M. G. J., & Krahmer, E. J. (2005). Audiovisual prosody and feeling of knowing. Journal of Memory and Language, 53(1), 81-94. General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. - Users may download and print one copy of any publication from the public portal for the purpose of private study or research - You may not further distribute the material or use it for any profit-making activity or commercial gain - You may freely distribute the URL identifying the publication in the public portal Take down policy If you believe that this document breaches copyright, please contact us providing details, and we will remove access to the work immediately and investigate your claim. Download date: 18. jun. 2017 Journal of Memory and Language 53 (2005) 81–94 www.elsevier.com/locate/jml Audiovisual prosody and feeling of knowing Marc Swerts, Emiel Krahmer ¤ Communication and Cognition, Faculty of Arts, Tilburg University, P.O. Box 90153, NL-5000 LE Tilburg, The Netherlands Received 9 July 2004; revision received 2 February 2005 Available online 16 March 2005 Abstract This paper describes two experiments on the role of audiovisual prosody for signalling and detecting meta-cognitive information in question answering. The Wrst study consists of an experiment, in which participants are asked factual questions in a conversational setting, while they are being Wlmed. Statistical analyses bring to light that the speakers’ Feeling of Knowing (FOK) is cued by a number of visual and verbal properties. It appears that answers tend to have a higher number of marked auditory and visual cues, including divergences from the neutral facial expression, when the FOK score is low, while the reverse is true for non-answers. The second study is a perception experiment, in which a selection of the utterances from the Wrst study is presented to participants in one of three conditions: vision only, sound only, or vision + sound. Results reveal that human observers can reliably distinguish high FOK responses from low FOK responses in all three conditions, but that answers are easier than non-answers, and that a bimodal presentation of the stimuli is easier than the unimodal counterparts. 2005 Elsevier Inc. All rights reserved. Keywords: Question answering; Feeling of knowing; Feeling of another’s knowing; Tip of the tongue; Audiovisual prosody; Facial expressions; Speech production; Speech perception Speakers are not always equally conWdent about or committed to what they are saying. When asked a question, for instance, they can be certain or rather doubtful about the correctness of their answer, and they may be unable to respond at all, even though in some cases it might feel as if the answer lies on the tip of the tongue. It has been shown that such meta-cognitive aspects of question answering are reXected in the way speakers ‘package’ their utterances prosodically (Brennan & Williams, 1995; Smith & Clark, 1993), where prosody can broadly be deWned as the whole gamut of features that * Corresponding author. Fax: +31 13 4663110. E-mail addresses: [email protected] (M. [email protected] (E. Krahmer). Swerts), do not determine what speakers say, but rather how they say it (Bolinger, 1985; Cruttenden, 1986; Ladd, 1996). The prosodic cues may be helpful to addressees, because they can use them to determine whether or not they should interpret incoming sentences with a grain of salt (cf. Sabbagh & Baldwin, 2001). Sofar, research has focussed primarily on analyses of verbal features, such as particular intonation patterns, that are encoded in the speech signal itself. Visual cues, such as a wide range of gaze patterns, gestures, and facial expressions, are a natural and important ingredient of communication as well. It seems a reasonable hypothesis that these might be informative just as verbal features are, though this issue is largely unexplored. The current paper tries to Wll this gap by looking at possible signals of meta-cognitive speaker states on the basis of combinations of verbal 0749-596X/$ - see front matter 2005 Elsevier Inc. All rights reserved. doi:10.1016/j.jml.2005.02.003 82 M. Swerts, E. Krahmer / Journal of Memory and Language 53 (2005) 81–94 and visual features, for which we will use the term audiovisual prosody (following, e.g., Barkhuysen, Krahmer, & Swerts, 2005; Krahmer & Swerts, 2004; Munhall, Jones, Callan, Kuratate, & Vatikiotis-Bateson, 2004; Srinivasan & Massaro, 2003). Since our work builds on insights of past studies of memory processes in question answering situations (e.g., Brennan & Williams, 1995; Smith & Clark, 1993), we Wrst discuss such prior work in more detail. In traditional models of memory, answering factual questions (e.g., “Who wrote Hamlet?”) proceeds as follows: when asked a question, a person searches for the answer in memory. If the answer is found, it will be produced, else the person signals search failure by providing a non-answer such as “I don’t know.” A major deWcit of such simple models is that they leave unspeciWed for how long a person should search for an answer and what triggers the decision to produce a non-answer. For these reasons current models of memory often supplement the searching component with a meta-cognitive monitoring component, which continuously assesses a person’s knowledge. For each potential answer that the searching component retrieves, the monitoring component estimates the conWdence in the answer’s correctness. If the conWdence is high enough, the answer can be produced. If not, the search component can look for more conWdent answers, provided the meta-memory assessment indicates that a proper answer may still be found (see, e.g., Andersen, 1999; Koriat, 1993; Nelson, 1996). Smith and Clark (1993) emphasize that in a conversational, every day setting, question answering is a social process, which involves information exchange as well as self-presentation. To account for this, Smith and Clark point out that a further component should be added to a model for question answering, namely a responding component, which uses the output of both search and monitoring to provide feedback to the questioner. For instance, when a person responds to a question with an answer that has a relatively low conWdence score, the person can signal this uncertainty in his or her response; if the answer would turn out to be incorrect later on, the speaker might save face and look less foolish since it was suggested that there was little conWdence in the response (GoVman, 1967). Smith and Clark (1993) note that signalling uncertainty of an answer can be done, for instance, via explicit commentary, using linguistic hedges such as “I guess” or “perhaps,” or, more implicitly, by using a rising, question-like intonation (e.g., Bolinger, 1985; Morgan, 1953). Similarly, speakers may want to account for delays using Wllers such as “uh” or “uhm” (e.g., Clark & Fox Tree, 2002), when the search has not resulted in an answer yet, even though the meta-memory assessment indicates that an acceptable answer may still be found after a prolonged search. The resulting 3-fold model of question answering (searching, monitoring, and responding) yields a number of testable predictions: if a speaker Wnds a question diYcult, this will result in a longer delay and correspondingly more Wllers; and this will also result in more ‘marked’ answer behavior (hedges, and paralinguistic cues such as rising intonation and Wllers). To test these predictions, Smith and Clark (1993) use Hart’s (1965) experimental method, a common technique to elicit meta-memory judgements. In this method participants are Wrst asked a series of factual questions in a conversational setting. Then they are asked to rate for each question their feeling that they would recognize the correct answer [i.e., their Feeling of Knowing (FOK)]. Finally participants take the actual recognition test, i.e., a multiple-choice test in which all the original questions are presented once more. The interesting thing is that this method provides a handle on the Tip of the Tongue (TOT) phenomenon where a speaker is unable to produce an answer, but has the “feeling of knowing” it, perhaps because there was partial access to related information or because the question (cue) seemed familiar (e.g., Koriat, 1993; Reder & Ritter, 1992). Using this paradigm, Smith and Clark (1993) found that there was a signiWcant diVerence between two types of responses, answers and non-answers: the lower a speaker’s FOK, the slower the answers, but the faster the non-answers. In addition, the lower the FOK, the more often people answered with rising intonation, and added Wllers, explicit linguistic hedges and other face-saving comments. It is worth noting that high FOK answers and low FOK non-answers are similar in that both express some form of speaker certainty: in the case of a high FOK answer the speaker is relatively certain of the answer, while a low FOK non-answer indicates that the speaker is relatively certain that he or she does not know the answer. In both cases, the speaker will not embark on a longer memory search, and the reply will tend to contain few paralinguistic cues. The question naturally arises whether other people can reliably determine the FOK of a speaker. To test this, Brennan and Williams (1995) Wrst performed an experiment which replicated Smith and Clark’s earlier Wndings, and subsequently made a selection of the speakers’ responses which were presented to listeners. These were tested on their Feeling of Another’s Knowing (FOAK) to see if meta-cognitive information was reliably conveyed by the surface form of responses. The results for answers showed that rising intonation and longer latencies led to lower FOAK scores, whereas for non-answers longer latencies led to higher FOAK scores. Brennan and Williams (1995) conclude that listeners can indeed interpret the meta-cognitive information that speakers display about their states of knowledge in question answering. Even though they stress the interactive and social nature of question answering, both Smith and Clark (1993) and Brennan and Williams (1995) are only M. Swerts, E. Krahmer / Journal of Memory and Language 53 (2005) 81–94 concerned with linguistic and prosodic cues to meta-cognition. Arguably, the speaker cues for meta-cognition that they consider in their respective studies are obvious ones, but they are not fully representative of the richness of face-to-face communication, in which gestures and facial expressions play important roles as well. The only study we are aware of in which FOAK judgements were made by people that could actually see respondents answering questions is Jameson, Nelson, Leonesio, and Narens (1993), but in that study the respondents had to type their answers and were not allowed to speak. There is ample evidence that visual information is a crucial element at various levels of face-to-face conversation. A number of studies have shown that visual cues are beneWcial for the understandability of speech (e.g., Jordan & Sergeant, 2000; Sumby & Pollack, 1954). During face-to-face communication, people pay a lot of attention to gestures (e.g., Kendon, 1994; Krauss, Chen, & Chawla, 1996; McNeill, 1992) and facial expressions such as nodding, smiling, and gazing (e.g., Clark & Krych, 2004): such visual cues help to structure the interaction, for instance, by facilitating turn taking, or they support/supplement the information a speaker wants to convey. A number of recent studies also link gestures and other non-verbal behaviors to meta-cognitive processes (e.g., Frick-Horbury, 2002; Lawrence, Myerson, Ooknk, & Abrams, 2001; Morsella & Krauss, 2004). Morsella and Krauss (2004), for instance, explicitly relate the use of gestures to the TOT phenomenon. They point out that mental representations are transient and that retrieving them is eVortful (Farah, 2000; cf. also Kikyo, Ohki, & Miyashita, 2002). As a result, holding them in mind for lengthy intervals (as can occur during question answering) may be diYcult; gestures, they argue, facilitate speech production more directly, since they continuously activate the prelinguistic sensorimotor features that are part of semantic representations of target words (Morsella & Krauss, 2004 refer to this as the Gestural Feedback Model). Besides gestures, facial cues can be indicative of meta-cognitive processes as well. Goodwin and Goodwin (1986) (see also Clark, 1996) already discussed the communicative relevance of the socalled “thinking face”: it often happens that, during search, a respondent turns away from the questioner with a distant look in the eyes in a stereotypical facial gesture of someone thinking hard. In a similar vein, it has been argued that gaze may be indicative of memory processes (e.g., Glenberg, Schroeder, & Robertson, 1998). Given such earlier work which suggests that visual information may be a cue for meta-cognition, it is interesting to see whether speakers produce visual cues correlating with their feeling of knowing and whether such cues can be perceived by others and used to estimate the feeling of another’s knowing. Additionally, an audiovisual study of meta-cognitive information is relevant from the view point of multisensory perception. In this Weld of 83 research, one open question is what the relative contributions are of visual and auditory cues in audiovisual perception and how the two groups of cues interact. On the basis of earlier studies, it can be conjectured that the relative importance of auditory and visual cues diVers for diVerent aspects of communication. Information presentation, for instance, is traditionally studied from an auditory perspective. At the basic level of understandability the acoustic signal usually provides suYcient information (see, e.g., Kohlrausch & van de Par, 2005). Visual cues tend to inXuence understandability in, for instance, adverse acoustical conditions (e.g., Benoît, Guiard-Marigny, Le GoV, & Adjoudani, 1996; Jordan & Sergeant, 2000; Sumby & Pollack, 1954) or in the case of incongruencies (as in the well-known McGurk eVect, where an auditory /ba/ combined with a visual /ga/ is perceived as /da/ by most subjects, McGurk & MacDonald, 1976). At the level of prosody a similar dominance of auditory cues can be observed. For instance, a number of prosodic studies that focussed on audiovisual signalling of prominence (indicating which words in an utterance are important) found that, even though facial cues such as head nods and eyebrow movements contribute to perceived prominence, the auditory cues (i.e., pitch accents) dominate (e.g, Keating et al., 2003; Krahmer & Swerts, 2004; Swerts & Krahmer, 2004). In general, the vast majority of studies have concentrated on verbal prosody, while only recently an interest in audiovisual prosody seems to emerge. When looking at emotion, on the other hand, it appears that visual cues (in particular facial expressions) are judged to be more important than auditory cues such as voice information (e.g., Bugenthal, Kaswan, Love, & Fox, 1970; Hess, Kappas, & Scherer, 1988; Mehrabian & Ferris, 1967; Walker & Grolnick, 1983), and it may be noted that emotion research initially focussed on facial perception (see, for instance, the work of Paul Ekman and colleagues, e.g., Ekman, 1999). Interestingly, various more recent studies have shown that both modalities inXuence emotion perception of incongruent stimuli, where the visual channel may oVer one emotion (‘happy’) and the auditory channel another (‘sad’) (e.g., de Gelder & Vroomen, 2000; Massaro & Egan, 1996). Meta-cognitive information during question answering is an interesting candidate to investigate from this audiovisual perspective, since it shares some characteristics with both emotion (in that it reXects a speakers’ internal state) and with information presentation (in the meta-cognitive signalling in the actual response). In this paper, we describe two experiments which study the role of audiovisual prosody for the production and perception of meta-cognitive information in question answering. We will be particularly interested in combinations of auditory cues (rising intonation, Wllers, and unWlled pauses) with visual cues (such as gaze, eyebrow movements, and smiles). For Experiment 1, we use Hart’s (1965) paradigm to elicit answers and non- 84 M. Swerts, E. Krahmer / Journal of Memory and Language 53 (2005) 81–94 answers with diVerent FOK scores, which we then analyze in terms of a number of audiovisual prosodic features. The study by Brennan and Williams (1995) focused on auditory cues alone, while the goal of Experiment 2 is to explore whether observers of speakers’ responses (collected in Experiment 1) are able to guess these speakers’ FOK scores on the basis of visual cues as well. In particular, we are interested in whether a bimodal presentation of stimuli leads to better FOK predictions than the unimodal components in isolation. While we expect that we get the best performance for bimodal stimuli, it is an interesting empirical question whether the auditory or the visual features from the unimodal stimuli are more informative for FOK predictions. Experiment 1 Method Participants Twenty participants (11 male, 9 female), colleagues and students from Tilburg University, between 20 and 50 years old, participated as speakers in the experiment on a voluntary basis. None of the participants objected to being recorded, and all granted usage of their data for research and educational purposes. Materials The materials consisted of a series of factual questions, taken in part from a Dutch version of the “Wechsler Adult Intelligence Scale” (WAIS), an intelligence test for adults. We selected only those questions which would trigger a one-word response (e.g., Who wrote Hamlet? What is the capital of Switzerland?), and added a supplementary list of questions from the game Trivial Pursuit. The 40 questions in total covered a wide range of topics, like literature, sports, history, etc. The list of questions was presented to participants in one of two random orders (see Appendix A). Procedure Following Hart (1965) and Smith and Clark (1993), the experimental procedure consisted of three stages. In the Wrst stage, the 40 questions were posed by the experimenter, and the responses by the participant were Wlmed (front view of head). As in Smith and Clark (1993), participants could not see the experimenter and they were told that this was to avoid the participants from picking up any potential cues the experimenter might give about the correctness of their answers. Participants were told that the experimenter could see them via the digital camera, which was connected to a monitor behind a screen. The experimenter asked the series of 40 questions one by one, and the pace of the experiment was determined by the participant. As an example, here are Wve responses (translated from Dutch) to the question about the name of the person who drew the pictures in “Jip en Janneke,” a famous Dutch book for children written by Annie M.G. Schmidt: (a) (b) (c) (d) (e) Fiep Westendorp uh Fiep Westen-(short pause)-dorp (short pause) Isn’t that Annie M.G. Schmidt? no idea uh the writer is Fiep Westendorp, but the drawings? No, I don’t know The example shows cases of correct answers, which could be Xuent (a) or hesitant (b), an incorrect answer (c), and a simple (d) and complex (e) case of a nonanswer. In the second stage, the same sequence of questions was again presented to the same participants, but now they had to express on a 7-point scale how sure they were that they would recognize the correct answer if they would have to Wnd it in a multiple-choice test, with 7 meaning “deWnitely yes” and 1 “absolutely not.” The third stage was a paper-and-pencil test in which the same sequence of questions was now presented in a multiplechoice in which the correct answer was mixed with three plausible alternatives. For instance, the question “What is the capital of Switzerland?” listed Bern (correct) with three other large Swiss cities: Zürich, Genève, and Basel. Participants were unaware of the real purpose of the study, but were told that its goal was to learn more about the average complexity of a series of questions which could be used in future psycholinguistic research. They were warned beforehand that questions were expected to vary in degree of diYculty. To encourage them to do their best and guess the correct answer in case of uncertainty, the experiment incorporated an element of competition, where the ‘winner’ (i.e., the person with the highest number of correct answers in the Wrst test) got a small reward (a book token). Labelling, annotation All utterances from the Wrst test (800 in total) were transcribed orthographically and manually labelled regarding a number of auditory and visual features by four independent transcribers. The labelling was based on perceptual judgements and features were only marked when clearly present. Regarding verbal cues, we labelled the presence or absence of the following features: Delay:Whether a speaker responded immediately, or took some time to respond. High intonation:Whether a speaker’s utterance ended in a high or a low boundary tone. Fillers:Whether the utterance contained one or more Wllers, or whether these were absent. M. Swerts, E. Krahmer / Journal of Memory and Language 53 (2005) 81–94 For simplicity’s sake, all three verbal features were treated in a binary way. A clearly perceivable pause before speaking was labelled as a delay while an immediate response was not. We did not measure precise delay durations as Smith and Clark (1993) and Brennan and Williams (1995) did. Concerning high intonation, note that we did not attempt to isolate question intonation, as it turned out to be diYcult to consistently diVerentiate ‘real’ question intonation from list intonation. Finally, for the labelling of Wllers we did not diVerentiate between ‘uh,’ ‘uhm’ or ‘mm,’ since the results of the perception study of Brennan and Williams (1995) only revealed minor diVerences between them. In addition to these categorical variables, we counted the number of words spoken in the utterance, where family names, like Elvis Presley, were considered to be one word. This count provides a rough measure for the amount and length of linguistic hedges (such as “I guess”) used by speakers, where we did not diVerentiate between the kind of explicit commentary used. As to the visual cues, we labelled the presence or absence of the following features: Eyebrow movement:If one or more eyebrows departed from neutral position during the utterance. Smile:If the speaker smiled (even silently) during the response. Low Gaze:Whether a speaker looked downward during the response. High Gaze:Whether a speaker looked upward during the response. Diverted Gaze:Whether a speaker looked away from the camera (to the left or the right) during the response. Funny face:Whether the speaker produced a marked facial expression, which diverted from a neutral expression, during the response. Note that an utterance may contain multiple marked features. In addition to the isolated features mentioned above, we also counted the number of diVerent gaze acts, i.e., combinations of high, low or diverted gaze. In the literature on facial expressions, the features gaze (Argyle & Cook, 1976; Gale & Monk, 2000; GriYn, 2001; Glenberg et al., 1998), smiling (Kraut & Johnson, 1979), and brow movements (Ekman et al., 1979; Krahmer & Swerts, 2004) are mentioned as informative cues in spoken interaction, and a Wrst inspection of the recordings revealed that they occur relatively frequently in the current data set. The expression we, for want of a better term, called “funny face” can be seen as speciWc instance of Goodwin and Goodwin’s (1986) thinking face, explicitly suggesting uncertainty. Fig. 1 contains some representative stills for each of the visual features. The visual features are roughly comparable with Action Units (AUs) 85 described by Ekman and Friesen (1978), though there is not necessarily a one-to-one mapping to these Action Units. The action units form the building blocks of Ekman and Friesen’s Facial Action Coding System which builds on the assumption that facial actions can be described in terms of muscular action. The basic action units correspond with single muscle activities and more complex facial expressions can be described using these atomic building blocks. Fig. 1, in particular, displays examples of marked settings of smile (AU 12, AU 13), gaze (AU 61–AU 64), and eyebrow raising (AU 1, AU 2). Funny faces typically consist of a combination of AUs such as lip corner depression (AU 15), lip stretching (AU 20) or lip pressing (AU 24), possibly combined with eye widening (AU 5), and eyebrow movement as well. The labelling procedure was as follows. On the basis of a preliminary labelling protocol, utterances from two speakers were labelled collectively. For most cues the labelling was unproblematic, and for the few more diYcult cases, mostly involving gaze, a consensus labelling was reached after discussion. This Wrst phase resulted in an explicit labelling protocol, after which the labellers proceeded individually by labelling a selection of the cues in the recordings from the remaining 18 speakers. Only the feature gaze was annotated by two persons who labelled the diVerent gaze acts in an utterance (low, high, and diverted) in isolation and resolved mismatches via discussion. All features were annotated independently from the FOK score to avoid circularity. Similarly, annotators only looked at answers without taking the question context into account, so that the answer’s (in)correctness could not bias the annotator. Statistical analyses Our statistical procedure closely follows the ones proposed by Smith and Clark (1993) and Brennan and Williams (1995). We obtained 800 responses, 40 from each of the 20 participants. These responses are not independent so that analyses across the entire data set are inappropriate. Therefore, the following tests are always based on individual analyses per speaker. We computed correlation coeYcients for each subject individually, transformed the correlations into Fischer’s zr scores, and tested the average zr score against zero. The average zr scores were then transformed back into correlations for reporting in this article. Similarly, when comparing means, we computed a mean for each speaker, and used these composite scores in our analyses. In any individual analysis, we did not include any participant for whom we could not compute an individual correlation or mean, so some of our statistics, as in Smith and Clark (1993), are based on a total n of less than 20. The ANOVA tests reported below compare both means for subjects, and for items. 86 M. Swerts, E. Krahmer / Journal of Memory and Language 53 (2005) 81–94 Fig. 1. Stills illustrating a number of annotated visual features; the description and examples represent the marked settings for each feature. M. Swerts, E. Krahmer / Journal of Memory and Language 53 (2005) 81–94 Results It appeared that the participants found a majority of the questions very easy, as they gave a maximum FOK score of 7–61.1% of the questions, a score of 6–13.4% of the questions, and lower scores to the remaining 25.5%. In addition, 71.9% of the questions of the Wrst task were indeed answered correctly and 89.6% of the same list of questions in the multiple-choice test. Table 1 lists the average FOK scores as a function of Question Type (open question versus multiple choice), and the response categories (correct answers, incorrect answers, and non-answers). Speakers’ mean FOK ratings were higher when they were able to produce an answer than when they were not (with subjects as random factor, F1(1,17) D 71.821, p < .001; with items as random factor, F2(1,23) D 59.028, p < .001). The mean FOK ratings were higher for correctly recalled answers than for the incorrect ones (F1(1,19) D 149.233, p < .001; F2(1,35) D 38.086, p < .001). Speakers also gave higher FOK ratings on average for responses that they later recognized in the multiple choice test (F1(1,18) D 55.018, p < .001; F2(1,27) D 19.714, p < .001). These data thus show that there is a close correspondence between the FOK scores and the correctness or incorrectness of a response in both the open test and the multiplechoice. The results are similar to those of Smith and Clark (1993) and Brennan and Williams (1995). In general, it can be observed that participants diVer in the extent to which they use diVerent audiovisual features, but these diVerences do not appear to relate to age or sex of the participants. Table 2 lists the correlation coeYcients between the FOK scores and number of words, gaze acts (deWned as the sum of low, high, and diverted gaze) and marked features (deWned as the sum of all features minus words and gaze). It can be seen that there are negative correlations between the FOK scores and these variables for answers, and positive correlations for non-answers. In other words, for answers, higher FOK scores correspond with a lower number of words, gaze acts and marked features, and the opposite relation holds for non-answers, where higher FOK scores correspond with more words, more gaze acts, and more marked features. An analogous picture emerges from Tables 3 and 4, which give the average FOK scores for presence versus Table 1 Average FOK scores (and standard deviations) for diVerent response categories Experiment Response FOK Open question All answers (n D 704) Correct answers (n D 575) Incorrect answers (n D 129) All non-answers (n D 96) 6.32 (1.27) 6.55 (1.00) 5.29 (1.72) 3.03 (2.12) Multiple choice Correct answers (n D 717) Incorrect answers (n D 83) 6.17 (1.53) 3.84 (2.18) 87 Table 2 Pearson correlation coeYcients of FOK scores with words (number of words), gaze acts (sum of low, high, and diverted gaze), and marked features (sum of Wller, delay, high intonation, eyebrow, smile, and funny face) Correlations of FOK scores with Words Gaze acts Marked features Response Answers (n D 20) ¤¤¤ ¡.265 ¡.325¤¤¤ ¡.410¤¤¤ Non-answers (n D 13) .695¤¤¤ .630¤¤¤ .690¤¤¤ The n indicates for how many participants an individual correlation could be computed. ¤¤¤ p < .001. absence of audiovisual features for answers and nonanswers, respectively. Table 3 shows that the presence of a verbal feature (Wller, delay or high intonation) coincides with a signiWcantly lower FOK score. And similarly for the visual features, where the presence of an eyebrow movement, a smile, a gaze act (high, low, and diverted) or a funny face generally is associated with a lower FOK score, which is signiWcant in all cases except for the smile. As expected, the results in Table 4 display the opposite pattern. When a verbal feature is present in a non-answer this corresponds to a signiWcantly higher FOK score for Wller and delay. Looking at the visual cues, it can be seen that the presence of a marked feature setting is associated with a higher FOK score as well, albeit that the diVerences are only signiWcant for low and high gaze, partly because of the limited number of data points here (reXected by the lower n Wgures for nonanswers). Moreover, non-answers are perhaps inherently less likely to be associated with a high intonation, presumably because speakers do not need to question their own internal state (see, e.g., Geluykens, 1987), which is reXected in a diVerence score of only 0.01. To learn more about the cue value of combinations of features, we also calculated, for answers and non-answers separately, the average FOK scores for responses that diVer regarding the number of marked feature settings (minimum: 0, maximum: 7). Note that the theoretical possibility of an utterance containing 8 or 9 marked feature settings did not occur in our data set. The results of this are visualized in Fig. 2, which again illustrates opposite trends for the two response categories: for answers, the average FOK score decreases with an increasing number of marked features, while the opposite is true for non-answers (cf. also the last row of Table 2). Table 5 lists the distribution (in percentages) of cases with a marked setting for our diVerent auditory and visual features as a function of the relative number of marked features present (1–7). First, looking at the n values in the left column, this table reveals that there are considerable diVerences in relative frequency between the diVerent features: the diVerent gaze acts (low, high, and 88 M. Swerts, E. Krahmer / Journal of Memory and Language 53 (2005) 81–94 Table 3 Average FOK scores (and standard deviations) for answers as a function of presence or absence of audiovisual features Auditory Filler (n D 19) Delay (n D 20) High intonation (n D 19) Visual Eyebrow (n D 19) Smile (n D 17) Low gaze (n D 20) High gaze (n D 19) Diverted gaze (n D 20) Funny face (n D 10) Present (1) Absent (2) DiV. (1) ¡ (2) 5.70 (1.00) 5.23 (0.80) 5.91 (0.80) 6.53 (0.32) 6.53 (0.27) 6.40 (0.27) ¡0.83¤¤¤ ¡1.30¤¤¤ ¡0.49¤ 5.78 (0.87) 5.69 (1.36) 6.05 (0.61) 5.72 (0.98) 5.96 (0.57) 4.78 (1.19) 6.46 (0.26) 6.36 (0.35) 6.47 (0.33) 6.52 (0.34) 6.64 (0.38) 6.38 (0.43) ¡0.69¤¤¤ ¡0.67 ¡0.42¤¤ ¡0.80¤¤¤ ¡0.68¤¤¤ ¡1.60¤¤ Statistics are based on paired t test analyses. The n indicates the number of participants for which individual means could be computed. The DiV. score is obtained by subtracting the average Absent FOK score from the average Present FOK score. ¤ p < .05. ¤¤ p < .01. ¤¤¤ p < .001. Table 4 Average FOK scores (and standard deviations) for non-answers as a function of presence or absence of a audiovisual features Auditory Filler (n D 9) Delay (n D 12) High intonation (n D 5) Visual Eyebrow (n D 8) Smile (n D 12) Low gaze (n D 14) High gaze (n D 10) Diverted gaze (n D 9) Funny face (n D 4) Present (1) Absent (2) DiV. (1) ¡ (2) 4.98 (2.05) 4.21 (2.22) 3.67 (1.70) 2.60 (1.20) 2.20 (1.16) 3.66 (1.21) +2.38¤¤ +2.01¤ +0.01 3.76 (1.99) 3.69 (2.33) 3.96 (2.05) 4.00 (1.75) 3.43 (1.44) 3.75 (2.63) 2.41 (0.92) 2.70 (1.13) 2.49 (1.21) 2.25 (1.50) 2.79 (2.14) 2.85 (1.44) +1.35 +0.99 +1.47¤¤ +1.75¤ +0.64 +0.90 Statistics are based on paired t test analyses. The n indicates the number of participants for which individual means could be computed. The DiV. score is obtained by subtracting the average Absent FOK score from the average Present FOK score. ¤ p < .05. ¤¤ p < .01. especially diverted) and high intonation occur very often, as opposed to the features smile and funny face which are more rarely used. But, in addition, the table also makes it clear that some features can relatively more often be used in isolation or with only one more other feature present (like, low gaze, smile, and high intonation), whereas features like funny face and especially delay tend to co-occur with a high number of marked settings for other features. Discussion This Wrst study has replicated some of the Wndings of the research by Smith and Clark (1993), and extended them into the visual domain. It was found that our participants’ FOK scores correspond with their performance in the open question and multiple-choice test, and, in addition, particular audiovisual surface forms of the utterances produced by our speakers are indicative of the amount of conWdence speakers have about the correctness of their response. For answers, lower scores correlate with occurrences of long delay, Wllers and high intonation, as well as a number of gaze features, eyebrow movements, funny faces, and smile. In addition, speakers tend to use more words and more gaze acts, when they have a lower FOK. For non-answers, the relationship between FOK scores and the diVerent audiovisual features is the mirror image of the outcome with answers, which is as expected given that both high FOK answers and low FOK nonanswers correspond to speaker certainty. If we compare the average FOK scores with the frequency of occurrence for the various marked settings of audiovisual features, it can be seen that the presence or absence of some highly frequent features such as diverted gaze and high intonation has a relatively small impact on the average FOK M. Swerts, E. Krahmer / Journal of Memory and Language 53 (2005) 81–94 89 Materials From the original 800 responses, we selected 60 utterances, of which 30 were answers and 30 non-answers. Of both the answers and the non-answers, 15 had high FOK and 15 low FOK scores. Following Brennan and Williams (1995), only the answer of a question–answer pair was presented to participants, to avoid that participants would unconsciously use their own estimation of the question’s diYculty in their perception of the answer. The selection was based on written transcriptions of the responses by someone who had not heard or seen the original responses. Given the individual diVerences in the use of the FOK scale, we chose to use—per speaker—the two highest scores as instantiations of high FOK scores and the two lowest as low FOK scores. The (in)correctness of the answer was not taken into account when selecting the stimuli, hence both high FOK and low FOK answers could be incorrect. We initially made a random selection of stimuli meeting these restrictions, but utterances were iteratively replaced until the following criteria were met: Fig. 2. Average FOK scores for answers and non-answers as a function of the relative number of marked prosodic features (minimum D 0, maximum D 7; combinations of 8 or more features do not occur). score, while this eVect is substantial for an infrequent feature such as funny face (especially for answers). The Wrst study was a speaker-oriented approach to gain insight into audiovisual correlates of FOK. While our analyses revealed that there was a statistical relationship between the two, this in itself does not prove that the audiovisual properties also have communicative relevance. To prove this, we performed a perception study, for which we used earlier work by Brennan and Williams (1995) as our main source of inspiration. 1. The original question posed by the experimenter should not appear again in the participants’ response. 2. All the answers should be lexically distinct, and should thus not occur twice. This criterion was not applied to the non-answers as they were very similar. 3. The responses should be maximally distributed across speakers. There should be maximally two answers and two non-answers per speaker. Experiment 2 Having applied this procedure on the basis of written transcriptions of the data, we Wnally replaced a couple of stimuli by others, if the signal-to-noise ratio made them unsuitable for the perception experiment. Method Participants Participants were 120 native speakers of Dutch, students from the University of Tilburg, none of whom had participated as speaker in the Wrst experiment. Procedure The selected stimuli were presented to participants in three diVerent conditions as a group experiment: one Table 5 Percentage of marked settings for diVerent audiovisual features as a function of the relative number of marked prosodic features Number of marked feature settings Auditory Filler (n D 191) Delay (n D 161) High intonation (n D 232) Visual Eyebrow (n D 154) Smile (n D 98) Low gaze (n D 280) High gaze (n D 232) Diverted gaze (n D 409) Funny face (n D 36) Total 1 2 3 4 5 6 7 6.3 0.6 13.8 15.7 5.0 20.7 18.8 17.4 25.0 17.8 24.8 15.5 18.8 23.0 10.8 13.6 17.4 8.2 8.9 11.2 5.6 100 100 100 11.0 15.3 22.1 3.9 12.5 0 15.6 17.3 22.5 21.1 24.9 8.3 14.3 10.2 18.2 21.1 20.5 11.1 22.7 17.3 10.0 20.7 17.4 19.4 14.9 17.3 12.1 13.4 12.0 22.2 12.3 14.3 8.6 12.5 8.1 16.7 8.4 7.1 6.1 6.9 4.4 19.4 100 100 100 100 100 100 Minimum D 1, maximum D 7; combinations of 8 or more features do not occur. 90 M. Swerts, E. Krahmer / Journal of Memory and Language 53 (2005) 81–94 third of the participants saw the original clips as they were recorded (Vision + Sound), another third saw the same clips but then without the sound (Vision), whereas the last third could only hear the utterances without seeing the image (Sound). In all three conditions, stimuli were presented on a large cinema-like screen with loudspeakers to the right and left over which the sound was played in the Vision + Sound and Sound condition. Both the visual and the auditory information were clearly perceptible for participants. Participants in each condition were Wrst presented with the stimulus ID (1– 30) followed by the actual stimulus. In case of the sound only stimuli, participants saw a black screen instead of the original video recording. The motivation to present sound only stimuli also visually was to make sure that participants were aware of the start of the utterance, in case there was a silent pause in the beginning of the utterance. The interstimulus interval was 3 s. Within a condition, participants had to judge recordings in two separate sessions, one with answers as stimuli and one with non-answers. The question to the participants about the answers was whether a person appeared “very uncertain” (1) or “very certain” (7) in his/her response. The question for the non-answer stimuli was whether participants thought the person would recognize the correct answer in a multiple-choice test, with 1 meaning “deWnitely not” and 7 “deWnitely yes.” Below, the scores are referred to as the Feeling of Another’s Knowing (FOAK) scores. Each part of the experiment was preceded by a short exercise session with two answers and two non-answers, respectively, to make participants acquainted with the kinds of stimulus materials and the procedure. Statistical analyses The participants’ responses were statistically tested with a repeated measures ANOVA with the FOAK scores as dependent variable, with original FOK scores and response type as within-subjects factors, and cue modality (Vision, Sound, and Vision + Sound) as between-subjects factor. Pairwise comparisons were made with the Tukey HSD method. Results Table 6 shows that there were signiWcant eVects on the participants’ FOAK scores of original FOK status of the utterance (high FOK utterances receive higher FOAK scores than low FOK utterances) and of response category (answers receive higher FOAK scores than nonanswers), while there was no main eVect of the cue modality (Vision, Sound, and Vision + Sound). However, there were signiWcant 2-way interactions between FOK and cue modality (F(2,117) D 54.451, p < .001) and between response and cue modality (F(1,117) D 241.597, p < .001), and a signiWcant 3-way interaction between FOK, cue modality and response (F(2,117) D 3.291, p < .05). Split analyses showed that these main eVects and interactions also hold when looking at the three cue modalities separately. For all cue modalities the diVerence in FOAK score between high FOK and low FOK utterances was signiWcant (Vision: F(1,39) D 270.070, p < .001; Sound: F(1,39) D 957.515, p < .001; Vision + Sound: F(1,39) D 1516.617, p < .001), and similarly for response category (Vision: F(1,39) D 16.629, p < .001; Sound: F(1,39) D 29.090, p < .001; Vision + Sound: F(1,39) D 54.977, p < .001). There was always a signiWcant interaction between FOK status and response category (Vision: F(1,39) D 79.179, p < .001; Sound: F(1,39) D 41.307, p < .001; and Vision + Sound: F(1,39) D 161.703, p < .001). The 2-way interactions can easily be understood when we look at the average scores for combinations of FOK and cue modality, and response and cue modality (see Tables 7 and 8, respectively). The Wrst table shows that the diVerence in scores for low and high FOK scores Table 7 FOAK scores (and SE) for high FOK and low FOK stimuli in diVerent experimental conditions Cue modality Vision Sound Vision + Sound High FOK (1) Low FOK (2) DiV. (1) ¡ (2) 4.434 (0.061) 4.890 (0.061) 5.052 (0.061) 2.903 (0.061) 2.668 (0.061) 2.367 (0.061) 1.531 2.222 2.685 Table 6 ANOVA results (main eVects) for perception experiment: average FOAK scores as a function of FOK, response category, and experimental condition Factor Within subjects FOK Response Between subjects Cue modality Level FOAK (SE) F stats High Low Answer Non-answer 4.792 (0.035) 2.646 (0.035) 3.922 (0.030) 3.516 (0.038) F(1,117) D 2229.886, p < .0001 Vision Sound Vision + Sound 3.779 (0.047) 3.669 (0.047) 3.709 (0.047) F(1,117) D 90.477, p < .0001 F(2,117) D 1.424, p D .245 M. Swerts, E. Krahmer / Journal of Memory and Language 53 (2005) 81–94 Table 8 FOAK scores (and SE) for high FOK and low FOK stimuli for answers and non-answers Response Answer Non-answer High FOK (1) Low FOK (2) DiV. (1) ¡ (2) 5.231 (0.044) 4.353 (0.045) 2.614 (0.035) 2.678 (0.051) 2.617 1.675 is more extreme in the Vision + Sound condition, than in the unimodal conditions, meaning that the participants’ ratings were more accurate when participants had access to both sound and vision. Notice that this explains why no main eVect of experimental condition was found: the diVerences in FOAK scores between high FOK and low FOK stimuli change, while the overall FOAK averages stay the same (see Table 6). The second table shows that the diVerence between high and low FOK scores is—as expected—easier to perceive in answers than in nonanswers. A separate analysis of variance revealed that the diVerences between FOAK scores for high FOK and low FOK stimuli were signiWcantly diVerent for the diVerent cue modalities (F(2,237) D 46.248, p < .001). Post hoc pairwise comparisons using Tukey HSD tests showed that the diVerences in FOAK ratings between high FOK and low FOK stimuli for a given cue modality was signiWcantly diVerent from the diVerences in ratings for any of the other two cue modalities. Discussion The results of the second perception test are consistent with the Wndings of the Wrst analysis of speaker’s expression of uncertainty. It appears that participants are able to diVerentiate low FOK responses from high FOK responses in the unimodal experimental conditions, but they clearly performed most accurately in a bimodal condition. This suggests that the addition of visual information, which was not present in the aforementioned FOK and FOAK studies, is beneWcial for detecting uncertainty. While we had seen that answers and non-answers exhibit opposite audiovisual features, human participants are able to adapt their judgments: they are able to tell the diVerence between low and high FOK for both response categories, be it that the performance for non-answers drops compared to answers, in line with previous observations of Brennan and Williams (1995). In conclusion, this study brought to light that the audiovisual features of our original utterances have communicative relevance as they can be interpreted by listeners as cues of a speakers’ level of conWdence. Since only high and low FOK stimuli were used in Experiment 2, it would be interesting to see whether the FOAK eVect persists for middle range FOK stimuli. We hypothesize that participants would indeed note that speakers are relatively uncertain when uttering a medium FOK answer, since, generally, a lower FOK 91 answer corresponds to more audiovisual marked cues (cf. Fig. 2), but how uncertain speakers are judged remains an open question. Additionally, the current experiment is a judgement study in which participants are speciWcally asked to rate the perceived FOK of a speaker’s utterance. It would be interesting to see if and how people actually use audiovisual prosody as a cue for FOK during natural conversations. We conjecture that such cues will also be picked up and interpreted correctly during ‘on line’ communication, and hope to address this in future research. General discussion In response to a question, speakers are not always equally conWdent about the correctness of their answer. It has been shown that speakers may signal this using verbal prosody (Brennan & Williams, 1995; Smith & Clark, 1993). For instance, a response to the question “Who wrote Hamlet?” might be “[pause] uh Shakespeare?” where the delay signals a prolonged memory search, and the Wller and the high (question-like) intonation both signal relatively low conWdence from the metacognitive monitoring component. In this paper, we have described two experiments which extend these observations into the visual domain, showing that there are clear visual cues for a speaker’s feeling of knowing and that listeners are better capable to estimate their feeling of another’s knowing on the basis of combined auditory and visual information than on the basis of auditory information alone. In the Wrst experiment we used Hart’s (1965) paradigm to elicit speaker utterances plus their accompanying FOK scores. The data collected in this way were analyzed in terms of a number of audiovisual cues. This analysis conWrmed earlier work by Smith and Clark (1993) and Brennan and Williams (1995) in that low FOK answers are typically longer than high FOK answers, and tend to contain more marked auditory cues (Wllers, longer delay, and high intonation). Interestingly, we found that the visual features show a comparable pattern. Focussing Wrst on answers, we found that marked facial expressions (diverging from the ‘neutral’ face) such as smiles, gaze acts, eyebrow movements, and ‘funny faces’ are more likely to occur in low than in high FOK answers. In general, when a speaker gives a high FOK answer to a question, this response is produced in a straightforward way, and marked audiovisual cues are not produced. When a speaker is uncertain of the answer or has diYculty retrieving it, however, various things may happen in parallel; low FOK answers tend to require a longer search, which may trigger a Wller, but may also be accompanied by various gaze acts. Both Wllers and gaze acts tend to occur during memory search, before the actual answer is uttered. The gazing might 92 M. Swerts, E. Krahmer / Journal of Memory and Language 53 (2005) 81–94 partly be explainable from the common observation that it is easier to ‘think’ when not looking at a communication partner. We found that gazing typically occurred in various directions both along the left-right and up-down axis; in these data no single gaze act appeared to fulWl a designated function. That a speaker is uncertain about the correctness of a retrieved answer can be indicated auditorily, via a rising, question-like intonation, but also visually, for instance, by raising the eyebrows (which has been interpreted as the visual counterpart to a rising intonation, see, e.g., Bolinger, 1985). Similarly, the presence of a ‘funny face’ (often, but not always, timed immediately after the answer has been uttered) was also typically associated with a lower FOK score, clearly functioning as a cue for speaker uncertainty. Smiles, however, appeared to be more ambiguous in this respect as some of our speakers smiled when they were asked an ‘easy’ question for which they had no diYculty retrieving the answer, while others smiled in response to questions they considered ‘extremely’ diYcult or for which they felt they should know the answer. The earlier studies by Smith and Clark (1993) and Brennan and Williams (1995) revealed an asymmetry between answers and non-answers in that low FOK answers look similar to high FOK non-answers (in both cases the speaker is uncertain on a meta-cognitive level) and that high FOK answers look similar to low FOK non-answers, both in prosodic terms (in both cases the speaker is certain). The current experiment conWrmed this asymmetry, and extended it into the visual realm. Thus, the presence of smiles, brow movements, gaze acts, and a funny face is less likely for high FOK answers and for low FOK non-answers. Data sparseness presumably resulted in less signiWcant eVects for the non-answers than for the answers (in addition, non-answers are lexically more similar than answers and are perhaps inherently less likely to end on a high boundary tone). In the second experiment, participants watched speaker utterances in one of three conditions (Sound only, Vision only, and Vision + Sound) and had to judge the speaker’s FOK. It turned out that overall participants are good at estimating the FOK of other speakers, where the resulting FOAK scores are somewhat more accurate for answers (high FOK answers receive high FOAK scores and low FOK answers receive low FOAK scores) than for non-answers. Interestingly, FOAK judgments are signiWcantly more accurate in the bimodal (Vision + Sound) condition than in the respective unimodal ones. This indicates that the presence of visual information is actually beneWcial for FOAK judges. If we compare the results for the Sound only and the Vision only experiment, it appears that overall subjects made better use of auditory than of visual cues for the perception of uncertainty. In terms of multisensory perception, the relative dominance of auditory cues for FOK percep- tion appears to be similar to the role of auditory cues for information presentation. However, it should be noted that visual expressions such as funny faces and eyebrow movements occurred relatively infrequently in the second experiment, but when they did occur they seem to oVer a very strong cue for FOAK estimations. Based on this observation we recently performed an additional perception experiment, consisting only of incongruent stimuli combining low FOK facial expressions (funny faces) with high FOK speech prosody and vice versa. Participants were asked how certain the speaker appeared in his or her response. The results indicate that for these incongruent stimuli the facial expression in most cases has the largest impact on the FOAK score, which is comparable to the aforementioned multisensory perception results for emotion. Acknowledgments This research was conducted as part of the VIDIproject “Functions of Audiovisual Prosody (FOAP),” sponsored by the Netherlands Organisation for ScientiWc Research (NWO), see foap.uvt.nl. Swerts is also aYliated with the Flemish Fund for ScientiWc Research (FWO-Flanders) and Antwerp University. Krahmer’s work was carried out within the IMIX project “Interactive Multimodal Output Generation” (IMOGEN), sponsored by the Netherlands Organisation for ScientiWc Research (NWO). Many thanks to Judith Schrier (Antwerp) and Jorien Scholze (Tilburg) for their help in carrying out the experiments. Many thanks also to Pashiera Barkhuysen and Lennard van de Laar for help with the annotation and for technical assistance, to Annemarie Krahmer-Borduin, Per van der Wijst, and Carel van Wijk for help with the statistics and the test questions, and to the three anonymous reviewers for their useful comments on a previous version of the manuscript. Appendix A English translations of the Dutch questions used in the Wrst experiment, as they were presented to participants in one of the two random orders. 1. 2. 3. 4. What do we call the sticks used in golf? Who made the drawings for “Jip and Janneke”? The sahara lies in which continent? Which novel about a knight is the most reprinted book after the Bible? 5. How many months does it take the moon to circle the earth? 6. What does the abbreviation ‘Fl’ for the Dutch guilder stand for? M. Swerts, E. Krahmer / Journal of Memory and Language 53 (2005) 81–94 7. What is the largest mammal? 8. What is the name of the gang of robbers that terrorized Limburg in the 18th century? 9. Who, according to legend, was the bishop of Myra? 10. In which Dutch quiz show are the contestants awarded with a toy monkey for each good answer? 11. What is the highest mountain of the Alps? 12. Who wrote Faust? 13. What is the chemical symbol of water? 14. What does the word ‘Jihad’ mean? 15. What color of light is used on the starboard side of a boat? 16. What is Rembrandt’s last name? 17. Which television series is about the Forrester and Spectra families? 18. Guide Gezelle was a famous man. What was his occupation? 19. What is the boiling temperature for water? 20. In which wind-direction does one travel from Amsterdam to Brussel? 21. What is the name of the cartoon character who owns the dog Pluto? 22. Egypt lies in which continent? 23. Who is the head of state of the Vatican? 24. Who wrote “The discovery of heaven”? 25. What is a “Friese doorloper”? 26. Brazil lies in which continent? 27. What is the pseudonym of the Mexican Don Diego de la Vega? 28. Who wrote Hamlet? 29. Which rocker is also known as “The King”? 30. How many darts is a player allowed to throw in one turn? 31. In which wind-direction does one travel from London to Berlin? 32. Which disease was known during the Middle Ages as “The Black Death”? 33. What is the capital of Switzerland? 34. Supporters of which football club sing “Geen woorden maar daden”? 35. How many degrees are in a circle? 36. Approximately, how many people live in the Netherlands and Belgium? 37. In which country did the Incas live? 38. Which person from the Bible went to look for mustard? 39. Which Dutch soap series has been running on television for the longest period? 40. Who wrote the Iliad? References Andersen, J. (1999). Cognitive psychology and its implications (5th ed). London: Worth Publishing. Argyle, M., & Cook, M. (1976). Gaze and mutual gaze. Cambridge: Cambridge University Press. Barkhuysen, P., Krahmer, E., & Swerts, M. (2005). Problem detection in human-machine interactions based on facial expressions of users. Speech Communication, 45, 343–359. Benoît, C., Guiard-Marigny, T., Le GoV, B., & Adjoudani, A. (1996). Which components of the face do humans and machines best speechread?. In D. Stork & M. Hennecke (Eds.), 93 Speechreading by humans and machines: Models, systems, and applications (pp. 315–328). New York: Springer-Verlag. Bolinger, D. (1985). Intonation and its parts. London: Edward Arnold. Brennan, S. E., & Williams, M. (1995). The feeling of another’s knowing: prosody and Wlled pauses as cues to listeners about the metacognitive states of speakers. Journal of Memory and Language, 34, 383–398. Bugenthal, D., Kaswan, J., Love, L., & Fox, M. (1970). Child versus adult perception of evaluative messages in verbal, vocal, and visual channels. Developmental Psychology, 2, 367–375. Clark, H. (1996). Using language. Cambridge: Cambridge University Press. Clark, H., & Fox Tree, J. (2002). Using uh and um in spontaneous speech. Cognition, 84, 73–111. Clark, H., & Krych, M. (2004). Speaking while monitoring addressees for understanding. Journal of Memory and Language, 50, 62–81. Cruttenden, A. (1986). Intonation. Cambridge: Cambridge University Press. de Gelder, B., & Vroomen, J. (2000). The perception of emotions by ear and by eye. Cognition and Emotion, 14(3), 289– 311. Ekman, P. (1979). About brows: Emotional and conversational signals. In M. von Cranach, K. Foppa, W. Lepenies, & D. Plog (Eds.), Human ethology (pp. 169–202). Cambridge: Cambridge University Press. Ekman, P. (1999). Facial expressions. In T. Dalgleish & T. Power (Eds.), The handbook of cognition and emotion (pp. 301–320). Sussex, UK: John Wiley & Sons Ltd. Ekman, P., & Friesen, W. V. (1978). The facial acting coding system. Palo Alto: Consulting Psychologists’ Press. Farah, M.J. (2000). The neural bases of mental imagery. In M. Gazzaniga (Ed.), The cognitive neurosciences (2nd ed.). (pp. 965–974). Cambridge: MIT Press. Frick-Horbury, D. (2002). The use of hand gestures as self-generated cues for recall of verbally associated targets. American Journal of Psychology, 115, 1–20. Gale, C., & Monk, A. (2000). Where am I looking? The accuracy of video-mediated gaze awareness. Perception & Psychophysics, 62, 586–595. Geluykens, R. (1987). Intonation and speech act type. An experimental approach to rising intonation in declaratives. Journal of Pragmatics, 11, 483–494. Glenberg, A., Schroeder, J., & Robertson, D. (1998). Averting gaze disengages the environment and facilitates remembering. Memory & Cognition, 26, 651–658. GoVman, E. (1967). Interaction ritual: Essays on face to face behavior. Garden City, NY: Doubleday. Goodwin, M. H., & Goodwin, C. (1986). Gesture and coparticipation in the activity of searching for a word. Semiotica, 62, 51–75. GriYn, Z. (2001). Gaze durations during speech reXect word selection and phonological encoding. Cognition, 82, 1–14. Hart, J. T. (1965). Memory and the feeling-of-knowing experience. Journal of Educational Psychology, 56, 208–216. Hess, U., Kappas, A., & Scherer, K. (1988). Multichannel communication of emotion: Synthetic signal production. In K. Scherer (Ed.), Facets of emotion: Recent research (pp. 161– 182). Hillsdale, NJ: Erlbaum. 94 M. Swerts, E. Krahmer / Journal of Memory and Language 53 (2005) 81–94 Jameson, A., Nelson, T. O., Leonesio, R. J., & Narens, L. (1993). The feeling of another person’s knowing. Journal of Memory and Language, 32, 320–335. Jordan, T., & Sergeant, P. (2000). EVects of distance on visual and audio-visual speech recognition. Language and Speech, 43(1), 107–124. Keating, P., Baroni, M., Mattys, S., Scarborough, R., Alwan, A., & Auer, E., et al. (2003). Optical phonetics and visual perception of lexical and phrasal stress in English. In Proceedings of the International Conference of Phonetic Sciences (ICPhS) (pp. 2071–2074), Barcelona, Spain. Kendon, A. (1994). Do gestures communicate? A review. Research on Language and Social Interaction, 27, 175–200. Kikyo, H., Ohki, K., & Miyashita, Y. (2002). Neural correlates for feeling-of-knowing: An fMRI parametric analysis. Neuron, 36(1), 177–186. Kohlrausch, A. & van de Par, S. (2005). Audio-visual interaction In Blauert, J. (Ed.), Communication acoustics. Heidelberg: Springer (in press). Koriat, A. (1993). How do we know what we know. The accessibility account of the feeling of knowing. Psychological Review, 100, 609–637. Krahmer, E., & Swerts, M. (2004). More about brows. In ZS. Ruttkay & C. Pelachaud (Eds.), From brows to trust: Evaluating embodied conversational agents (pp. 191–216). Dordrecht: Kluwer Academic Press. Krauss, R., Chen, Y., & Chawla, P. (1996). Nonverbal behavior and nonverbal communication: What do conversational hand gestures tell us. Advances in Experimental Social Psychology, 28, 389–450. Kraut, R., & Johnson, R. (1979). Social and emotional messages of smiling. An ethological approach. Journal of Personality and Social Psychology, 37, 1539–1553. Ladd, D. R. (1996). Intonational phonology. Cambridge: Cambridge University Press. Lawrence, B., Myerson, J., Ooknk, H., & Abrams, R. (2001). The eVects of eye and limb movements on working memory. Memory, 9, 433–444. Massaro, D., & Egan, P. (1996). Perceiving aVect from the voice and the face. Psychonomic Bulletin and Review, 3, 215–221. McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746–748. McNeill, D. (1992). Hand and mind: What gestures reveal about thought. Chicago: Chicago University Press. Mehrabian, A., & Ferris, S. (1967). Inference of attitudes from nonverbal communication in two channels. Journal of Consulting Psychology, 31, 248–252. Morgan, B. (1953). Question melodies in American English. American Speech, 2, 181–191. Morsella, E., & Krauss, R. (2004). The role of gestures in spatial working memory and speech. The American Journal of Psychology, 117, 411–424. Munhall, K., Jones, J., Callan, D., Kuratate, T., & VatikiotisBateson, E. (2004). Visual prosody and speech intelligibility. Psychological Science, 15, 133–137. Nelson, T. (1996). Consciousness and metacognition. American psychologist, 51, 102–116. Reder, L., & Ritter, F. (1992). What determines initial feeling of knowing? Familiarity with question terms, not with the answer. Journal of Experimental Psychology: Learning, Memory and Cognition, 18, 435–452. Sabbagh, M., & Baldwin, D. (2001). Learning words from Knowledgeable versus Ignorant Speakers: Links between preschoolers’ Theory of Mind and Semantic Development. Child Development, 72(4), 1054–1070. Smith, V. L., & Clark, H. H. (1993). On the course of answering questions. Journal of Memory and Language, 32, 25– 38. Srinivasan, R., & Massaro, D. (2003). Perceiving prosody from the face and voice: Distinguishing statements from echoic questions in English. Language and Speech, 46, 1–22. Sumby, W., & Pollack, I. (1954). Visual contribution to speech intelligibility. Journal of the Acoustical Society of America, 26, 212–215. Swerts, M., & Krahmer, E. (2004). Congruent and incongruent audiovisual cues to prominence. In Proceedings of Speech Prosody (pp. 271–274), Nara, Japan. Walker, A., & Grolnick, W. (1983). Discrimination of vocal expressions by young infants. Infant Behavior and Development, 6, 491–498.
© Copyright 2026 Paperzz