AUTEUR: The Application of Video Semantics and Theme Representation for Automated Film Editing Submitted for the Degree of Ph.D. August 1996 by Frank-Michael Nack (M.Sc.) AUTEUR: The Application of Video Semantics and Theme Representation for Automated Film Editing Submitted for the Degree of Ph.D. August 1996 by Frank-Michael Nack (M.Sc.) Abstract This thesis represents a planner based approach to the application of video semantics and theme representation to the automated editing of visual stories at the level of events. The research draws on film theory (in order to define methods to perform automated editing, to model the fundamental units of the image and the conceptual relationships between image, shot and sequence); on narrative and humour theory (to attempt to automatically generate emotion provoking and credible film narrative); and on Artificial Intelligence (planning, story generation, and knowledge representation). The aim of the research is to define techniques to automatically assemble video sequences that realise a given theme, ultimately to be used to assist in the presentation or interpretation of video. This thesis introduces representational structures for the semantic, temporal and relational features of video; representations and strategies to support automated editing, in combination with a simplified model of the film editing process; and representations for narrative structures such as actions, events, and emotional codes. These representations and strategies form the basis of an intelligent prototype system, AUTEUR, which generates humorous, non-verbal video sequences. AUTEUR is also described in this thesis. Für meine Rasselbande Annika, Merlin und Nuredin I Acknowledgements First, and above all, I would like to thank my supervisor Alan Parkes, who allowed me to follow my research ideas, while supporting me with insightful discussions and suggestions. Alan, you taught much more than how to become a researcher, and I am thankful that your financial support made it possible for me to attend conferences. Also, I am profoundly grateful for your humour, which cheered me up when teutonic gloom darkened into Götterdämmerung. It was a pleasure working with you. Many thanks to my examiners - Philippe Aigrain and Paul Brna, who both agreed to take this extra work on despite their loaded schedules. An extra sprinkling of chocolate on top of it for Paul, for sharing his deep insight in Prolog with me and many, many hours of enlightened discussions, patient advice and encouragement. I wish to thank Lancaster University and SECAMS (School of Engineering, Computing and Mathematics Sciences) for supporting my research through their sponsorship. I am very grateful to the WDR for supporting my work by offering access to their practical editing sessions. I wish also to express thanks to everyone from EB- und Filmbearbeitung for the friendly atmosphere I enjoyed there. Some people deserve special mention: Mrs. Edith Perlaki and Mrs. Britta Sörensen who took more time out of their busy schedules to discuss film and comment on their work than I dared expect, and Mrs. Gailenheuser and Mrs. Zeusch-Fren for giving me an insight into the editing of news, trailers and smaller features. Finally, many thanks to Mrs. Unverdroß who used her administrative talents to allow me to work free of worries. I owe a lot to Séverine Menu and Philippe Robin who helped me to translate Gilles Bloch's thesis into English. Many thanks to my office mate, Sean, a faithful friend and intellectual companion. Acknowledgements II There are no words that are enough to express my thanks to my friends for their love and devotion along this journey: Yasemin, Sakis, Dankmute, Angeliki, Raquel, Séverine, Mark, Tamsin, Paola, Stuart, Marco, Sarah, John, Michael, Nora, Lars, Judith and Jürgen. You have given me the best years of my life. Most of all, I thank my parents, who have never ceased to encourage and support me since the beginning. I know their love and dedication is the key to what I have become - and I am most grateful to have them as a role model for optimism and integrity. III Contents CHAPTER 1 Introduction............................................................................................ ..1 1.1 Automated video editing: a scenario ................................................................ ..2 1.2 Methodologies ................................................................................................. ..3 1.2.1 Artificial Intelligence (AI)................................................................. ..4 1.2.2 Film theory ....................................................................................... ..4 1.2.3 Narrative theory ................................................................................ ..5 1.2.4 Humour theory.................................................................................. ..5 1.3 The thesis: an overview ................................................................................... ..5 CHAPTER 2 Narrativity............................................................................................ ..8 2.1 Narrative principles ......................................................................................... ..9 2.1.1 Theme............................................................................................... 11 2.1.2 Order................................................................................................. 13 2.1.2.1 The event............................................................................ 14 2.1.2.2 The relation between events................................................ 15 2.1.3 Time ................................................................................................. 17 2.1.4 Space ................................................................................................ 19 2.2 From structure versus content to structure and content..................................... 19 CHAPTER 3 Humour.................................................................................................. 22 3.1 Humour - assumptions and definitions ............................................................. 23 3.1.1 Cognitive-Perceptual Class................................................................ 23 3.1.2 Social-Behavioural Class................................................................... 27 3.1.3 Psychoanalytical Class ...................................................................... 28 3.2 Humour primitives........................................................................................... 30 3.2.1 Readiness .......................................................................................... 30 3.2.2 Timing .............................................................................................. 30 3.2.3 Exaggeration ..................................................................................... 31 Contents IV 3.2.4 Incongruity........................................................................................ .31 3.2.5 Derision ............................................................................................ .32 3.3 Humour strategies............................................................................................ .33 3.4 Evaluation of comedy ..................................................................................... .42 3.5 Conclusion....................................................................................................... .44 CHAPTER 4 Film ........................................................................................................ ..45 4.1 Cinematic meaning .......................................................................................... ..45 4.2 Phenomenological Approaches to film............................................................. ..47 4.2.1 Film Image........................................................................................ ..49 4.2.1.1 The sign.............................................................................. ..50 4.2.1.2 Sign and idea ...................................................................... ..52 4.2.2 Film Movement................................................................................. ..60 4.2.2.1 From frame to shot ............................................................. ..60 4.2.2.2 Montage, or the semantics of fragmentation........................ ..63 4.2.2.3 The sequence - film relationship ......................................... ..67 4.3 A model of film editing ................................................................................... ..69 4.4 Conclusion....................................................................................................... ..72 CHAPTER 5 The representation of video content ..................................................... ..75 5.1 Related work ................................................................................................... ..76 5.1.1 Bloch and his machine for audio-visual editing ................................. ..76 5.1.2 Parkes and CLORIS .......................................................................... ..79 5.1.3 Aguierre-Smith and the Stratification System.................................... ..82 5.1.4 Semantic and conceptual indexing for video...................................... ..83 5.1.5 Davis and Media Streams .................................................................. ..87 5.2 An ontology for the representation of film content.......................................... ..92 5.2.1 General concepts and assumptions..................................................... ..93 5.2.2 Cinematographic devices................................................................... ..97 5.2.3 Denotative aspects............................................................................. ..98 5.2.3.1 Character, object and action................................................ ..98 5.2.3.2 Settings: space, time and lighting....................................... 102 5.2.4 Conclusion ........................................................................................ 105 5.3 Technical environments for content annotation................................................ 106 CHAPTER 6 The representation of knowledge for automated editing..................... 108 6.1 Shot editing: Mixage and Cut .......................................................................... 109 6.2 Spatial and temporal continuity in editing: the 180ÛV\VWHP ............................. 110 6.3 Related work ................................................................................................... 115 Contents V 6.3.1 Splicer............................................................................................... 115 6.3.2 Bloch’s machine for audio-visual editing ........................................... 116 6.4 A novel approach to automated video editing .................................................. 119 6.4.1 Plot requirements ............................................................................ 120 6.4.2 Shot intention and the shape of the awareness space........................ 122 6.4.3 Automated establishment and maintenance of content space over several shots ................................................................... 126 6.4.4 The influence of action on continuity editing ................................... 130 6.4.5 The comparison of surrounding content space and graphical pattern .............................................................................. 135 6.4.6 Temporal and rhythmical relations between shot A and shot B............................................................................................... 138 6.4.6.1 Preliminary remarks............................................................ 140 6.4.6.2 Temporal clipping for action expansion .............................. 141 6.4.6.3 Temporal clipping for the temporal equivalence of actions........................................................ 141 6.4.6.4 Temporal clipping for action contraction ............................ 142 6.4.6.5 Rhythmical shaping of a sequence ...................................... 143 6.4.7 Conclusion...................................................................................... 146 CHAPTER 7 The representation of narrative and thematic knowledge.................... 148 7.1 Approaches to knowledge representation ......................................................... 150 7.1.1 Quillian and semantic networks......................................................... 150 7.1.2 Miller, Bateman, Lenat and large databases of semantic relations ............................................................................................ 150 7.1.3 Haase’s approach to memory-based representations ........................... 153 7.1.4 Schank’s conceptual dependencies and dynamic memory .................. 154 7.2 Knowledge representation to support the creation of emotion provoking narrative sequences........................................................................ 155 7.2.1 Actions.............................................................................................. 156 7.2.1.1 Conceptual structure ........................................................... 156 7.2.1.2 Semantic relations............................................................... 159 7.2.2 Abstract concepts .............................................................................. 161 7.2.2.1 Emotions ............................................................................ 162 7.2.2.2 Visualisations ..................................................................... 163 7.2.3 Events ............................................................................................... 163 7.2.4 Conclusion ........................................................................................ 168 Contents VI CHAPTER 8 AUTEUR: An architecture for automated video story generation............................................................................................. 170 8.1 Related work ................................................................................................... 170 8.1.1 Sack & Davis’ video generator IDIC.................................................. 170 8.1.2 Bloch’s machine for audio-visual editing ........................................... 173 8.2 A proposed architecture for the editing of theme oriented video stories .............................................................................................................. 175 8.2.1. Overview ......................................................................................... 176 8.2.2. The video database........................................................................... 178 8.2.3. The video representation .................................................................. 178 8.2.4. The Knowledge Base........................................................................ 178 8.2.5. The Editor - a controller module ...................................................... 179 8.2.5.1 The Structure Planner ......................................................... 179 8.2.5.2 The Content Planner ........................................................... 181 8.2.5.3 The Visual Designer ........................................................... 184 8.2.5.4 The Visual Constructor ....................................................... 185 8.2.6 The Retrieval System and Interface ................................................... 185 8.3 Conclusion....................................................................................................... 186 CHAPTER 9 The operation of AUTEUR: Show me a joke....................................... 187 9.1 The banana skin joke ....................................................................................... 188 9.1.1 Preparation phase .............................................................................. 188 9.1.2 Motivation phase............................................................................... 189 9.1.3 Realisation phase............................................................................... 194 9.1.4 Resolution phase ............................................................................... 197 9.2 The lamp post joke .......................................................................................... 200 9.2.1 Preparation phase .............................................................................. 200 9.2.2 Realisation phase.............................................................................. 203 9.2.3 Resolution phase ............................................................................... 205 9.3 The bus joke .................................................................................................... 205 9.3.1 Preparation phase .............................................................................. 206 9.3.1 Realisation phase............................................................................... 208 9.3.3 Resolution phase ............................................................................... 209 9.4 Conclusion....................................................................................................... 210 Contents VII CHAPTER 10 Achievements and conclusions ............................................................. 211 10.1 Achievements ................................................................................................ 211 10.2 Conclusions ................................................................................................... 213 10.3 Postscript ....................................................................................................... 215 Bibliography.................................................................................................................... 216 Filmography .................................................................................................................... 232 Appendix ......................................................................................................................... 234 VIII Pictures Figures Tables Pictures Picture 4.1 Picture 4.2 Picture 4.3 Picture 4.4 Ruth Gordon in Polanksi’s Rosemary’s Baby (1968) ....................................... 48 An image taken from Spike Lee’s Do the right thing (1989) ........................... 57 Liv Ullmann in Bergman’s Shame (1968) ....................................................... 57 Image from Bertolucci’s The Last Emperor (1987) ......................................... 59 Figures Figure 2.1 Figure 2.2 Figure 2.3 Figure 2.4 The structure of communication (based on Tudor (1974, p. 31)) ..................... ..8 The narrational process (based on Bordwell (1985, p. 50)).............................. 11 Relationships between plot structures .............................................................. 16 Relationship between narrative elements (adopted from Chatman (1978, p.26)) ............................................................. 21 Figure 3.1 Relationship between incongruity and narrative structures .............................. 32 Figure 3.2 Relationship between derision and narrative structures.................................... 33 Figure 4.1 Syntagmatic and paradigmatic structures of clothing (Monaco, 1981, p. 341). .................................................................................. 54 Figure 4.2 The compositional and interpretational structures that make up the image (based on Monaco (1981, pp. 144 - 145))....................................... 60 Figure 4.3 Syntagmatic categories of visual material (based on Monaco (1981, p.145))................................................................... 69 Figure 4.4 Influential communication factors (based on Tudor (1974, p. 31)) .................. 70 Figure 4.5 Simplified model of the film editing process ................................................... 71 Figure 5.1 Bloch’s shot representation (Bloch, 1986),p. 149. ............................................ 78 Figure 5.2 Layers of annotations for a 100 frame shot...................................................... 83 Figure 5.3 FRAMER structure for Fido the Wonder Dog’s legs (taken from Davis (1995, p.137)) .................................................................... 90 Pictures, Figures, Tables IX Figure 5.4 Actions annotated in layers in a 100 frame shot............................................... 96 Figure 5.5 Relevant shot segment for a query for all three actions.................................... 96 Figure 6.1 The 180ÛV\VWHPEDVHGRQ%RUGZHOO7KRPSVRQS ................... 110 Figure 6.2 Schematically description of the POV shot (based on Bordwell & Thompson (1993, p. 273))............................................................................... 114 Figure 6.3 Plot requirements for the editing process......................................................... 121 Figure 6.4 Conceptual relationship between the space of visual awareness and narrative functionality .............................................................................. 125 Figure 6.5 Memory structure for spatial relationships between subjects over a number of shots............................................................................................ 127 Figure 6.6 Influence of sequence decomposition on the number and order of shots................................................................................................................ 128 Figure 6.7 Trimming of a shot from 140 to 108 frames .................................................... 145 Figure 7.1 Semantic subnet for the action "walk" ............................................................. 161 Figure 7.2 The relation between the narrative logic and the choice of visual material used to represent it............................................................................. 166 Figure 7.3 Semantic subnet .............................................................................................. 167 Figure 8.1 GPS operator "threaten renewed violence" in IDIC (Sack & Davis, 1994, p.5) ............................................................................... 171 Figure 8.2 Bloch’s editing model (Bloch, 1986, p. 133).................................................... 173 Figure 8.3 Tasks performed by AUTEUR ........................................................................ 176 Figure 8.4 Proposed architecture for the creation of a visual story of emotional impact.............................................................................................. 177 Figure 9.1 Startshot for the banana skin joke.................................................................... 188 Figure 9.2 Result of startshot analysis for the banana skin joke ........................................ 188 Figure 9.3 Motivation Sequence-Structure for the banana skin joke ................................. 189 Figure 9. 4 - 9.6 Three possible motivation shots for the banana skin joke ....................... 191 Figure 9.7 Event shot for the banana skin joke ................................................................. 192 Figure 9.8 Status of the banana skin joke after the end of the motivation phase............................................................................................................... 193 Figure 9.9 Realisation Sequence-Structure for the banana skin joke ................................. 193 Figure 9.10 Realisation part for the banana skin joke, generated out of two shots.............................................................................................................. 196 Figure 9.11 Status of the realisation phase of the banana skin joke................................... 196 Figure 9.12 Resolution Sequence-Structure for the banana skin joke................................ 197 Figure 9.13 Retrieved shot for the realisation phase of the banana skin joke..................... 198 Pictures, Figures, Tables Figure 9.14 Figure 9.15 Figure 9.16 Figure 9.17 Figure 9.18 Figure 9.19 Figure 9.20 Figure 9.21 Figure 9.22 Figure 9.23 X The banana skin joke generated by AUTEUR ............................................... 200 Startshot for the lamp post joke ..................................................................... 201 Result of startshot analysis for the lamp post joke ......................................... 201 Realisation Sequence-Structure for the lamp post joke .................................. 203 Realisation part for the lamp post joke .......................................................... 204 The lamp post joke generated by AUTEUR................................................... 205 Startshot sequence for the bus example ......................................................... 207 Realisation Sequence-Structure for the bus joke ............................................ 208 Realisation sequence for the bus joke ............................................................ 209 The bus joke as generated by AUTEUR ........................................................ 209 Tables Table 3.1 Classification of humour strategies ................................................................... ..42 Table 4.1 Tudor’s paradigm of cinematic meaning Tudor (1974, p.128). ......................... ..45 Table 5.1 Table 5.2 Table 5.3 Table 5.4 Table 5.5 Table 5.6 Table 5.7 Representational structure for cinematographic devices .................................... ..97 Substructure "character appearance" ................................................................. ..99 Substructure "actor action" ............................................................................... 101 Substructure "object "....................................................................................... 101 Substructure "relations" .................................................................................... 103 Substructure "deep-space composition" ............................................................ 103 Substructure "setting" ....................................................................................... 105 Table 6.1 Relationship between camera distance and size of presented content space..................................................................................................... 112 Table 6.2 Spatial relationships between shots A and B in terms of camera distance............................................................................................................ 123 Table 6.3 Relationship between camera distance and hierarchical representation level of subjects......................................................................... 131 Table 7.1 Table 7.2 Table 7.3 Table 7.4 Table 7.5 Table 7.6 Conceptual structure for a representation of the action "slip" ............................ 158 Simplified conceptual structure for the abstract object "time". .......................... 162 Representation of an emotional doublet ............................................................ 162 Simplified representation of the emotion class "pleasure" ................................. 163 Structure of an event, i.e. "meeting" ................................................................. 164 Actions in the event "getting coffee"................................................................. 165 Table 9.1 Conceptual structure for the event "meeting".................................................... 191 Pictures, Figures, Tables XI Table 9.2 Conceptual structure for a representation of the action "slip" ............................ 194 Table 9.3 Structure of an event "using_transport"............................................................. 206 1 Chapter I Introduction One of the strongest memories of my childhood is of my first film experience, Disney’s animated film Snow White and the Seven Dwarfs. Since then, I have been fascinated by the mystique of this artificial world of light and imagination, and consequently hours upon hours of my youth were taken up by watching TV, or sitting in the darkness of a cinema to follow the action on the screen. Astonishingly enough, film has never been part of the educational system as I know it, though the medium is one of the strongest intellectual and emotional influences in our culture. Curricula were designed around the traditional subjects of arts and science, and I remember that only during the last year of school did we see some films in our German course. We discussed the content of these films in comparison with the novels from which they had been adapted, and that was as far as film analysis went. Unfortunately, skills for creating, manipulating and analysing film were not developed. Today the situation could be different. Access to tools for creating, manipulating and playing with moving images is much more widespread. Due to rapid developments in entertainment technology (e.g. camcorder, TV, video recorder) and computer hardware, it is now affordable to produce videos at home, in the office or at school. Moreover, research has produced computer based environments to assist with the editing, annotation and retrieval of video (see Aigrain & Joly (1994); Aigrain, Joly, & Longueville (1995); Bloch (1986); Chakravarthy (1994); Davis (1995); Gordon & Domeshek (1995); Parkes (1989a); Pentland, et. al. (1994); Picard & Minka (1995); Sack & Davis (1994); Sack & Don (1993); Tonomura, et al. (1994); Ueda, et. al (1993); Yeung, et al. (1995)). Further developments in environments such as the Internet will, in the nearer future, provide users with even more possibilities to create 1: Introduction 2 films, as users will be able to access large video databases, which offer them the kind of visual material they cannot shoot or synthesise themselves. Visual communication is a creative process, like writing a letter or telling a story. Whereas for written or spoken language we are usually experienced in producing and receiving information, we are consumers, rather than producers, of visual statements. Nevertheless, while we usually understand what a film is showing us, we do not understand it on the basis of our knowledge of the filmic mechanisms used, but rather because we understand the film as a unified structure that results from the composition of all its elements (Metz, 1974). It is not uncommon, therefore, that after seeing a film, people wonder why it affected them in the way it did, and they wish that someone could tell them. Thus, we face the ironic situation that while there are more possibilities than ever to become creative in a visual sense, most people still lack the necessary skills, i.e. the "selection, timing and arrangement of given shots into a film continuity" (Reisz & Millar, 1969, p.15), to make video part of their daily communication. The aim of my research is to change this situation through the development of tools to support the process of transforming ideas into appropriate visual presentations, or assist in the understanding of video documents, either for professional or personal purposes. 1.1 Automated video editing: a scenario Let us look at a hypothetical and idealised intelligent video desktop publishing system of the future, in order to understand the motivation behind my research. Suppose a user wishes to create a multimedia document that describes the work of a particular director. The document is to provide an understanding of the director’s work and its specific filmic expression. Though the user has access to a database of digitised video material of work by the director, and about the director (e.g. film essays), there will be the need to summarise some of the material visually. Such summaries can be automatically generated by the system, based on specifications made by the user. For example, the user may specify the sequences to be combined and the time span of the summary. In cases where users edit sequences themselves, there is supporting software available which, analogous to spell checkers and grammar checkers, offers the user feedback regarding whether the created sequence is visually acceptable, and if that is not the 1: Introduction 3 case, how it can be improved. In the latter case, the system may itself generate a sequence, so that the user can compare it with his or her own, thus providing feedback which may influence the nature of the overall work. Sometimes the user will have problems in understanding a particular filmic technique or presentational style, e.g. there may be a problem with a particular cut, or uncertainty may arise as to how the emotional effect of a particular sequence should be achieved. In such cases, the system must be able to dynamically generate appropriate visual examples in order to clarify the problem. Additionally, there may be options for the user to experiment with the particular filmic technique or style, which requires that the system can analyse and criticise the results achieved by the user. To deal adequately with the above situations, it is essential that the system is equipped with editing techniques and content-based representations of video material. The latter are particularly important, as editing requires content-based retrieval of video material. Our research focuses on the definition of methods to support automated editing of video, based on the representations of the semantics and syntax of video, and on the conceptual representation of thematic goals and narrative. To reduce the complexity of this open-ended research program, we have set ourselves the initial goal of producing short sequences of video that realise the theme of humour. Our approach draws on diverse research disciplines, each contributing ideas and/or technologies, and has resulted in a prototype system, AUTEUR, that has achieved a limited degree of success in producing humorous film sequences. The system is far from complete, and it should be appreciated that what is being presented in this thesis is but a small step toward the goal of intelligent video editing systems. Nevertheless, one real contribution of this research is the conceptual representation of the many types of knowledge and skills required by video editing. 1.2 Methodologies This thesis concerns the application of video semantics and theme representation to the automated editing of visual stories at the level of events. Such an endeavour is, of necessity, of a multidisciplinary nature, and draws on research results from four distinct fields: Artificial Intelligence (AI), film theory, narrative theory and humour theory, as is now briefly described. 1: Introduction 4 1.2.1 Artificial Intelligence (AI) As stated above, the goal of this thesis is to provide a machine with the ability to autonomously generate emotionally stimulating visual events. Since our aim is the computer modelling of a hitherto human task, many of the problems we face are those addressed by AI, which attempts to study ’mental faculties through the use of computational models.’, (Charniak & McDermott, 1985, p. 6). In order to understand the functions of human memory, perception and emotion, research in AI has developed representational techniques and mechanisms enabling a computer to perform tasks such as story understanding, deduction, information retrieval and planning. Our attempt is to provide a machine with appropriate knowledge and inference mechanisms so that it can plan events with the intention to make them humorous, and then present the events in the most visually credible way. Therefore, we discuss the relevant AI techniques for search, planning and story generation. In particular, the work of Parkes (1989a) on the representation of video content, Bloch (1986) on automated video editing, and Schank and students (Lehnert (1983); Lehnert, et al. (1983); Riesbeck & Schank (1989); Schank & Abelson (1977); Schank (1982); Wilensky, 1983a; Wilensky, 1983b)) on story understanding and memory-based representation have contributed to the thesis in this respect. 1.2.2 Film theory As our work is devoted to the definition of methods to perform automated editing of film, we use the theoretical constructs provided by film theory, and the practical experience of film makers and editors, to identify and model the fundamental units of the image, and the conceptual relationships between image, shot and sequence. Referring to the initial goal of this thesis, i.e. the automatic production of short sequences of video that realise the theme of humour, we focus our investigation of film theory on narrative film making, which also influences our point of view on editing, in that we focus on the "continuity editing" style. Though we touch on the "is film a language" debate in this thesis, we concentrate on a semiotic oriented approach to the description of film content, which is mainly influenced by work of Eco (1977, 1985), Jakobson & Halle (1980) and Peirce (1960). The most pervasive influence of film theory on our work, with regards to film editing 1: Introduction 5 and its influence on the description of film structure, has been that of the formative movement, represented by Eisenstein (1988, 1991), Kuleshov (1974) and Vertov (as described in Petric (1987)), and the work on context and order by the cognitive psychologist Gregory (1961). 1.2.3 Narrative theory In attempting to automatically generate filmic narrative, we provide an abstract computational model of the causal and temporal relationships between events, people and objects, drawing particularly on the ideas of the Russian Formalist movement (see Lemon & Reis (1965)) which aimed at classifying the artistic skills applied in the process of creating a piece of prose, literature or film. Though targeted, narration is first of all a formal communication system for the dynamic interaction between narrator, content and audience, influenced by social, cultural and medium specific entities. As our aim is to generate film sequences that provoke an emotional reaction, we are particularly interested in the influence the narrative has on the ways in which the viewer understands the film portrayal of events. The theoretical analyses of Bordwell (Bordwell,1985; Bordwell, 1989; Bordwell & Thompson, 1993), Chatman (1978), and Tudor (1974), have contributed to this thesis in this respect. 1.2.4 Humour theory As stated above, the goal is to automatically create emotion provoking video sequences. The emotional reaction we have in mind is laughter. The decision to use humour as our target was based on two factors. Firstly, comedy is appreciated by most people, and thus it seems reasonable to use humour as the reason for telling a story. Secondly, research on the problem of automatic generation of visual humour would help in the evaluation of the many theories of humour that have arisen in such diverse research fields as psychology, philosophy, linguistics, and others. 1.3 The thesis: an overview We have presented the overall aim of our research, and briefly described the diverse influences on our research. We now give an overview of the structure of this thesis. Note, that, due to the multidisciplinary nature of our work, the relevant "literature review" for each domain has been integrated in the relevant chapter. 1: Introduction 6 Chapters 2, 3 and 4 effectively provide the theoretical analyses of narrativity, humour and film. Those specify the essential elements of each domain that are required to support the process of automatically editing visual stories. Chapter 2 identifies and analyses the key narrative principles, i.e. theme, order, event, time, and space, and the relationship between them. The results of the analysis lead to a critique of the merely structural approach to computer-based story understanding and generation, as proposed by AI research in the form of story grammars. In place of this approach, a multi-dimensional planning approach to visual story generation is suggested, which supports the two main representation layers of a story, i.e. structure and content. Chapter 3 provides a brief introduction to humour theory, by grouping humour into three classes, i.e. cognitive-perceptual, social-behavioural and psychoanalytical. The functional humour primitives identified are then combined with the logic of narrative structures to form a set of strategies that automatically generate humorous versions of event sequences. The chapter also describes a novel mechanism to evaluate jokes. Chapter 4 discusses formal and textual aspects of film, paying particular attention to the information carried by the image, its formal permutations, and its capacity to communicate expressive meaning. We investigate the semantic relationship between images within the shot, and the semantic relationships between shots. The chapter concludes with our simplified model of the editing process. Chapters 5, 6 and 7 describe our computational structures to support retrieval, rearrangement and presentation of video for visual story generation. These structures derive their theoretical basis from the preceeding three chapters. Chapter 5 is concerned with the representation of video content and is thus mainly related to chapter 4. We critically examine related work from AI. We then describe our representational structures for the semantic, temporal and relational features of video. Chapter 6 discusses the background knowledge necessary for automated video editing. We provide a brief introduction to the system of continuity editing, a style of editing that provides temporal and spatial continuity between juxtaposed shots. We then discuss and evaluate research into automated editing, and finally discuss in detail our own representations and strategies for automated editing. The structures and strategies developed establish a link between video content (chapter 5) and narrative specifications (chapter 7), so this chapter draws heavily on chapters 2, 3, and 4. 1: Introduction 7 Chapter 7 describes our representations for narrative structures such as actions, events, and emotional codes. We discuss and evaluate the major relevant AI approaches of knowledge representation. Drawing on the theoretical foundation provided in chapters 2 and 3, we then describe our approach to the shallow representation of the physical world and abstract mental and cultural concepts. Chapter 8 demonstrates the practical applicability of the work presented in this thesis by discussing the architecture and function of our prototype system AUTEUR (Artificial Intelligence Utilities for Thematic Film Editing using Context Understanding and Editing Rules). Chapter 9 analyses three example humorous films that were actually produced by AUTEUR. Chapter 10, the conclusion, discusses the achievements of the research, and points the way to future research in this area. 8 Chapter II Narrativity The telling of stories is a pervasive aspect within our life because it helps to shape our experience by structuring the events during our encounters with reality. Narrating means making a comment about a certain event, following an idea about the medium and form of presentation which is grounded in one's own motivational and psychological attributes. Narrating is a targeted phenomenon. There is a receiver and the narrator's perception of him or her will doubtless have an impact on the outcome of the story. Moreover, both narrator and receiver do not exist in a vacuum but share a social environment which adds extra structures to the narrational process. Thus, narration is a dynamic process of interaction in a partly given social context, where ‘the interaction encompasses on the communicator, the content, the audience and the situation’ (Janowitz & Street, 1966, p. 209). Figure 2.1 diagrammatically describes the structure of communication. Personality attributes Task specific knowledge Organismic attributes e.g. male, adult, etc. Outside cultural attributes Narrator Outside social attributes Shared cultural structures Medium Receiver Effects Shared social structures Figure 2.1 The structure of communication (based on Tudor (1974, p. 31)) It should be noted, though, that the influences on the communicator are essentially the same for the receiver, but that the former uses them for construction while the latter uses them for interpretational purposes. Since this thesis deals with the creation of thematic film sequences, it focuses naturally on the narrator. However, the receiver 2: Narrativity 9 cannot be neglected because it is his or her potential intellectual and emotional response which is to be predicted. The position of the structural analysis of narration proposed in this chapter within the above model of the communication process (Figure 2.1) might best be located in the part task specific knowledge. The intention of the analysis is not to present an in-depth analysis of narration in its entirety, but rather to lay one part of the foundation for our approach to automated film editing by reflecting on the process of narration.1 The problem is to identify the devices and syntactic structures that must be considered when modelling the dynamics of the narrational process. The aims of narration are essentially the same for film, literature, plays or pantomime, even though each medium uses its specific features to shape the process into an appropriate form. A more detailed analysis of structures relevant to the understanding of film will be discussed in chapter 4. 2.1 Narrative principles To identify the elements involved in narrative, and the relationship between them was an enterprise of the Russian Formalist movement (1918 - 30), a group of critics who developed the notion of an opus as the sum of all applied artistic skills. Their aim was to understand and describe these skills in their respective functionality (practical, theoretical, symbolic and aesthetic) within prose, metre, literature evolution, genre and film theory. The philosophical context of Russian formalism exerted great influence on the development of early film theory, for example on Eisenstein, Arnheim and Balázs, and, in the 1960s, influenced the development of film semiotics. The resulting analysis, discussed below, is strongly related to the ideas of the formalist approach, especially to Tomashevsky's "Thematics" (Lemon & Reis, 1965). Within narratology, a distinction is drawn between the story being told (the Fabula) and the form in which the story is presented (the plot).2 The Fabula is understood to be the entire structure of causal-chronological joint events within a given time and space. The Fabula is the result of a dynamic process 1 Detailed analyses of narrative can be found in the literature referenced in this chapter. 2 It should be noted, that this distinction has been made since Aristotle's Poetics, where the distinction is between mimesis, the creative imitation of human behaviour and muthos, the organisation of the events (Aristotle, 1968; Golden & Hardison, 1968; Ricoeur, 1985). In particular, Boris Tomashevsky's 'Thematics' has much in common with the Poetics. 2: Narrativity 10 based on assumptions and inferences. The basis of these intellectual mechanisms are three types of schemata as described in Bordwell (1985): prototypes which organise the identification of types of persons, actions, localities, etc., templates which articulate common story formats, where each formal element represents a structural story movement, realising the stages through which the agent of the story must pass, such as: Orientation, Complication, Evaluation, Resolution, Coda3, and procedures which organise the search for appropriate motivations and relations of causality, time and space. Since all these processes are subjective, so is the Fabula. However, the Fabula is not the story as the receiver learns it from the actual chronological order presented but rather the action, event or episode itself. The Fabula is thus to be seen as an elaboration. The plot, on the other hand, shares the same events with the Fabula but in the plot '...events are arranged and connected according to the orderly sequence in which they are presented...' (Lemon & Reis, 1965, p. 67), quoting Thomashevsky. The plot, therefore, accepts all of the temporal and spatial dislocations which the creator, having an intention-directed end in mind, has put into effect. The plot is therefore '...an expression of an I external to it.' (Barthes, 1977, p. 110). In film in particular, the impression can arise that the "I" of the agent which puts forward the story is identical with the receiver (Metz, 1982). Since the plot is the actual arrangement of the Fabula, it can be concluded that narration is precisely manifested within the dramaturgical principles of the plot. However, the structural conception behind the plot (dramaturgical process) is not sufficient for the required presentational task and the additional aspect of style is needed, where style stands for the systematic use of media-specific devices. In film this may concern, among other things, camera 3 This model of a story format is from Labov and Waletzky described in Segre (1979, p. 46). For contrasting models see, among others, Bremond as described in Segre (1979), Greimas (1983) and Propp (1968). 2: Narrativity 11 work and editing. The relationship between plot, style and Fabula is described in Figure 2.2. Plot Fabula Narration Style Figure 2.2 The narrational process (based on Bordwell (1985, p. 50)) From the above analysis it should be clear that the plot is responsible for the perceptional design and for attracting and holding attention, whereas the Fabula is '... a theoretical elaboration of fundamental importance for describing the plot by contrast, for it constitutes a touchstone, a means of measuring dislocations there realized.' (Segre, 1979, p. 15). It is now appropriate to characterise the underlying principles of the plot. 2.1.1 Theme There is initially a purpose to a story, which unites the story's separate elements. This purpose which can be called an external reason (Wilensky, 1983a, 1983b) or idea, is the theme of the story. To make a story coherent, at least one theme must be represented. However, several overall themes are often represented, and, at the same time, each part of the story may reflect its own theme. A theme actually performs two concurrent tasks. First, it arouses the interest of the receiver. In this respect, the underlying selection process for a theme deals merely with general human emotions, such as love, grief, lust, guilt, and so forth, which must be elaborated within particular and well-formed material4. The choice of a theme has, therefore, a significant influence on the material forming the characteristic body of the presented story, or, in other words, the reality on which the story is based. Thus, there is a link between theme and genre5. A genre can be seen as an abstract network of 4 This material can be seen as the internal reason for a story (Wilensky, 1983a, 1983b), such as the reason a character behaves in particular ways. The internal and external reasons must both be taken into account, to communicate the intended idea. 5 For film in particular, genre is to be understood as 'a world common to a range of films which may vary considerably in the detail of their narrative' (Tudor, 1974, p.111). 2: Narrativity 12 features, such as a set of possible narrative objects (Aristotle's Poetics provides a set for tragedy, chapter 3 of this thesis provides a set for comedy), and characteristic objects and actions from the real world, upon which the individual plot defines its structure. Examples of genres are the Russian folktales described by Propp (1968) or, for film, the screwball comedy as described in (Brunovska Karnick, 1995; Gehring, 1986). Thus, genre can be seen as the macro structure of a plot. However, genre implies generality, which leads to the reduction of reality into stereotypes, as described by Lippmann: 'A pattern of stereotypes is not neutral. It is not merely a way of substituting order for the great blooming, buzzing confusion of reality. It is not merely a short cut. It is all of these things and something more. It is the guarantee of our self-respect; it is the projection upon the world of our own sense of our own values, our own position and our own rights. The stereotypes are, therefore, highly charged with the feelings that are attached to them. They are the fortress of our tradition and behind its defences we can continue to feel ourselves safe in the position we occupy.' (Lippmann, 1934, p. 96). It is justified, we believe, to argue that the individual plot often reflects many genres. The reason for this is that a particular genre might be limiting for the development of a theme, a problem that can be solved by combining it with other genres, whose stereotypical representations of reality features the missing aspects of the theme. An example of the combination of genres is Zemeckis' film Back to the Future Part III, which combines the genres of western, adventure film, romantic love story, science fiction and comedy. The influence of the theme on a narrative is that it determines the choice of human, social and natural material and their order of presentation. Hence, the theme is crucial to the creation of meaning, which happens on a semantic and semiotic level. However, a discussion of the semantic and semiotic aspects of film will be deferred until chapter 4. The second role of a theme within a narrative is to stimulate and maintain interest. The effect of a theme depends, in this case, on the intended emotion the theme should evoke, since emotions are a powerful medium for maintaining attention. For example, the aim might be to provoke anger, joy or disturbance, which must be developed from within the story. Thus, a theme evokes and develops feelings of hostility or sympathy 2: Narrativity 13 according to a system of emotional strategies. The structure of these emotion-related strategies, for the example of humour, will be discussed in chapter 3. 2.1.2 Order Narration is a structure-oriented activity that begins in the mind of the narrator but is completed in the mind of the receiver6. The plot must be ordered, therefore, in such a way that the relationships between the portrayed events and the meaning being derived from these events and episodes (a series of events) encourage the receiver to perform causal-chronological inferencing. It should be mentioned here, that the current author does not believe in film as a hermetic system of obvious logic which can be completely provided by its creator. The decoding of the different layers of a film by the receiver might result in a different meaning from that intended by the creator. Thus, we agree with reader-response theory.7 This theory describes the process of creating meaning as a communication process between the end-product and the receiver. However, we do not follow its extension, that the origin of the piece of work, i.e. its creation process, is of no importance, as expressed by Barthes, who wrote: '...a text is made of multiple writings, drawn from many cultures and entering into mutual relations of dialogue, parody, contestation, but there is one place where this multiplicity is focused and that place is the reader, not, as was hitherto said, the author. The reader is the space on which all the quotations that make writing are inscribed without any of them being lost; a text's unity lies not in its origin but in its destination.' (Barthes, 1977, p. 148). The significance of narration is to establish the relevance of an action, event or episode by its position within a logical bond, since each (i.e. action, event, episode) is the consequence of the other. For example, an obscure relationship between events 6 See also Ricoeur (1985, p. 46), and his interesting division of mimesis into memesis1 the representation of human action in its semantics, its symbolic and its temporality memesis2 the mimesis of creation memesis3 the interpretation of the mimesis, which connects the process of writing with reading and thus provides a theory of the relationship between narrative and time. 7 Represented in film theory by Bordwell (1985, 1986, 1989, 1993). 2: Narrativity 14 might lead to the idea of superfluousness, while an unsuitable link might give the impression of incoherence. When discussing the theme, we argued that the plot is based on conventionalised event and object structures.8 This conventionalisation is essential to the process of plot construction, since conventions transform an event into a self-regulated, i.e. closed and self-maintained, structure, or as described by Piaget: '... what they [closure and self-maintenance] add up to is that the transformations inherent in a structure never lead beyond the system but always engender elements that belong to it and preserve its laws. Again, an example will help to clarify: In adding or subtracting any two whole numbers, another whole number is obtained, and one which satisfies the laws of the 'additive group' of whole numbers. It is in this sense that a structure is "closed", a notion perfectly compatible with structures being considered a substructure of a larger one; but in being treated as a substructure, a structure does not lose its own boundaries, the larger structure does not "annex" the substructure; if anything, we have a confederation, so that the laws of the substructure are not altered but conserved and the intervening change is an enrichment rather than an impoverishment.' (Piaget, 1970, p. 14, comments in [] added by the current author). The above has two implications for the construction of a narrative, one being for single events, and the other for the relationships between events. 2.1.2.1 The event To us an event is a closed, self-maintained structure, of composed pre-conditions, main-conditions and post-conditions. The pre-conditions perform a type of introduction of the characteristic objects, and the locations, activities or moods of characters necessary for the main part of the event. We refer later to this structural element as the motivation. However, if an object or activity is perceived, the object, action or the perceived mood of the character suggest certain possible events that can occur, and expectations that need to be realised. The appropriate plot structure for the main-condition of the event will be described as realisation. The particular realisation may lead to certain reactions being expected, which provide additional clarifying 8 "Object" means here either a character or a physical object. 2: Narrativity 15 information. These post-conditions are part of a phase which will henceforth be referred to as resolution. For example, imagine the event of a character attempting to obtain coffee from a coffee machine. During the motivation phase, the character approaches the machine. In the realisation phase of the event, the actor searches for, and then inserts, the change. While the machine is operating, the character waits, and finally the machine provides the cup and the coffee. In the resolution phase of the event, the character looks for change, then takes the coffee, and then finally drinks the coffee while walking away. In defining an event as a single and discrete structure that establishes archetypical behaviour patterns and cultural stereotypes, it is clear that certain functional elements - we actually understand actions as being the core functional elements of events - are more relevant to the event or a particular stage within it than others.9 The indispensable functional elements are dominant or bound functional elements of the event, whereas those that are not vital to the chronological causality are called free functional elements. Though free functional elements are not essential for the story, they can serve as the icing on the cake, since their digressive nature can enrich the presentation and thus support the well-formedness of the plot. In the above example of the coffee machine, the searching for change by the character in the motivation phase is an example of a free action, and the change itself is a free object, whereas the drinking is a dominant action and the coffee machine a dominant object. 2.1.2.2 The relation between events The relationship between events can serve to support resolution by answering the question 'What will happen next?'.10 In this respect, the relationship between events within a narrative is based on a hermeneutic code, as described by Barthes: 'Let us designate as hermeneutic code ... all the units whose function is to articulate in various ways a question, its response, and the 9 We understand actions as being the core functional elements within an event, since they define other functional elements, such as the intentions or moods of characters, or the importance of the objects involved in the actions. 10 The author is aware of the fact that a narrative can serve other purposes apart from development, for example to display events as state of affairs, as Chatman (1978) points out, suggesting Virginia Woolf's novel Mrs. Dalloway as an example of this second kind of narrative. However, this thesis is not focused on such open narrative structures. 2: Narrativity 16 variety of chance events which can either formulate the question or delay its answer; or even constitute an enigma and lead to its solution.' (Barthes, 1974, p. 17). The three essential elements for causality within an event, i.e., motivation, realisation and resolution, also provide the basis for the correlative, logical relations between events. Additionally, we can identify major and minor events. Major events are the driving force of the plot, they raise questions and answer them, and thus function as selective branching points for the possible paths through the narrative. It is obvious, for example, that Hitchcock's film Rear Window depends upon Mrs Thorwald's murder and the main actor's injury, which forces him to stay at home and thus results in him investigating the neigbourhood with his camera. Minor events serve the purpose of aesthetic enrichment, since they can easily be removed from the causal chain without affecting the overall understanding of the plot. Examples of minor events in Rear Window are the accompanying courtyard stories. Figure 2.3. illustrates the relationship between the logical stages motivation, realisation and resolution and plot structure. Plot Episode Episode Episode Mot Real Event Event Event Reso Event Mot Real Reso Action Action Action Action Action Action Mot Subaction Subaction Real Subaction Subaction Reso Subaction Figure 2.3 Relationships between plot structures The analysis thus far has demonstrated that, although the structure of a plot is restricted by the choice of theme and the self-maintenance and closure of event structures, a plot can be distinguished within its particular presentation from any other plot which might share the same internal structure. The reason for this is that the 2: Narrativity 17 narrator can organise the events in a great many ways. Depending on the theme(s) and the related genres, the stylistic aim of the narration may be to emphasise or conceal certain plot-events, to highlight events by providing additional detail, to omit events, to order events sequentially, to present them out of chronological sequence (e.g. flashback), and so forth. The following serve as stylistic devices that alter the way information in a narrative is perceived: exposition retardation which provides crucial information in the plot to encourage curiosity, which increases suspense by delaying information, as often used in digression omission redundancy detective stories, of which commentary is an example of which incompleteness is an example which intensifies significant information through repetition or opposition.11 Finally, since the plot order encourages the perception of information, it is, of course, important to determine how much knowledge a plot can establish (objectivity versus subjectivity), how much recognition can be produced in the perceiver, and how communicative the plot is, i.e. '... how willingly the narration shares the information to which its degree of knowledge entitles it.' (Bordwell, 1985, p.59). These categories are very important for the manipulation of temporal, spatial and causal-chronological factors to cue and guide the perceiver's activity. Thus, the combinatorial possibilities for plot ordering are enormous, but are nevertheless limited by the semantics and semiotics of the available material. 2.1.3 Time As described above, the plot is a dynamic process. Thus, there is mediation between time and narration which is succinctly expressed by Paul Ricoeur: 'Time becomes human to the extent that it is articulated through a narrative mode, and narrative attains its full meaning when it becomes a condition of temporal existence.' (Ricoeur, 1985, p. 52). 11 For an exhaustive description of these devices and their perceptual and cognitive consequences, see (Bordwell, 1985). For a discussion of retardation in particular, see Sternberg (1978), and for a discussion of redundancy, see Suleiman (1983). 2: Narrativity 18 The three key elements of the representation of time within a narrative are duration, frequency and order. The relevance of order and frequency (e.g. repetition) to the composition of the plot was discussed above. The following discussion therefore concentrates on duration, which is particularly important in film. A narrative has a beginning and an end, between which it relates a sum of events, where each event can be either with or without conclusion. The narrative itself is always a closed sequence (i.e. the last image describes the definite end of the film). A narrative is, then, a temporal sequence comprised out of two temporal schemata, where one tells the story and one is the time over which the story is told.12 The established relation in a narrative is, therefore, a time-in-time relation, whereas any description creates a space-in-time relation and an image forms a space-within-space relation (Metz, 1974). A simple example should illustrate this. Imagine a sequence showing successive shots of the sea followed by a shot of a boat sailing across a stretch of water. The first motionless shot of the sea is simply an image of the sea (space-in-space), where the successive shots form a description of the area (space-intime) and the crossing boat establishes the narrative (time-in-time). Thus, a narrative can be described as a system of temporal transformations. The system covers the time relations between Fabula (TF), plot (TP) and performance (TPE) time and contains three transformational classes: equivalence, contraction and expansion. (Bordwell, 1985). Equivalence can be described as the following relation :TF = TP = TPE. Contraction, where TF is reduced, can be divided into two subsystems: TF > TP and TP = TPE Ellipsis plot discontinuity marks omitted portion of TF. Compression TF = TP and TP > TPE no plot discontinuity but a condensed duration of performance time Expansion, where TF is expanded, can also be considered as consisting of two subsystems: 12 There are actually three time levels if we consider the time when the story was created. However, this level is more important for the interpretation of a story. For the understanding of Thomas Mann's novel Dr. Faustus (Mann, 1980) it is important to know that it was actually written during the time when the internal narrator is writing his report on the life of his friend Adrian Leverkühn. A further example is the film Trainspotting by Danny Boyle. Boyle creates a realistic view of the use of heroin and, due to its use of similar structures and stylistic means, a direct contradiction of Tarantino's artistic view presented in Pulp Fiction. 2: Narrativity Insertion Dilation 19 TF < TP and TP = TPE TF = TP and TP < TPE plot discontinuity marks added data no plot discontinuity but TF and TP are elongated The importance of temporal transformational classes for film is discussed in chapter 4. 2.1.4 Space Objects and subjects and their actions, i.e. events, appear in a spatial-referential frame which provides the perceiver with relevant information about the current surroundings, positions and assumed directions. However, the articulation of space is not really bound to specific narrational principles but is rather a cognitive-perceptive process based on lighting, sound, image composition and, especially in film, editing. Thus, space is discussed in detail in chapter 4, in which the perceptual and cognitive influences on the understanding of film are considered. 2.2 From structure versus content to structure and content The underlying assumption of the above analyses is that a plot is a dynamic psychological entity that refers to mental or conceptual objects such as themes, goals, events or actions. In fact, the dynamics within plot construction are twofold. On one hand, the intentions of the narrator must be achieved, i.e. to present the material as plausibly and succinctly as possible. The impact of this relies on articulational techniques, i.e. communication strategies between narrator and receiver. On the other hand, the dynamics within the material must be considered, since these form the bases of the plot, i.e. thematic structures. Referring to the plot structure of Figure 2.3, we see that episodes, events, actions and subactions are organised hierarchically. It is, therefore, suitable to characterise a plot in terms of states and transformational rules, an idea which was propagated by the structuralist movement. Structuralism formulated a rationalised and deductive approach to narration, which considered narrative structure as analogous to language structure and thus linked structure with the determination of content (see Barthes (1967, 1977); Chomsky (1965); Lévi-Strauss (1968, 1977); Metz (1974); Price (1973); Propp (1968); Striedter (1971); van Dijk (1972) ).13 13 This position was adopted by Barthes, who understood language as the master system for all other communication systems: 'Now it is far from certain that in the social life of today there are to be found any extensive systems of signs outside human language' (Barthes, 1967, p. 9). 2: Narrativity 20 An analogous approach to the representation of narrative structure, in the field of Artificial Intelligence, considered the applicability of story grammars to text understanding, where the main influences came from Propp's work on Russian folktales and Chomsky's transformational grammar (see Colby (1973); Knitsch & van Dijk (1978); Lakoff (1972); Mandler (1977); Rumelhart (1975, 1977); Stein (1979); Thorndyke (1977)). The main arguments against this approach are advanced by Black & Wilensky (1979).14 They show that not only are the formal properties of the grammars insufficient, but also that the computational costs of the representation is too high. The number of deletion and reordering transformations in the proposed grammars become extremely large, and yet the grammars are unable to produce a sufficiently varied set of stories. Finally, Black and Wilensky show that the semantic interpretation which should be supplied by the grammars is actually needed in order to apply the grammars. Black & Bower (1980) establish experimentally that the structures of story-grammars do not reflect the human memory structures that are related to story parts. Graham (1983) argues that the finite set of lexical items within the story grammars reflect the assumption that the propositions of stories are, analogous to sentence structures, also finite. However, story propositions are defined recursively and thus their depth is arbitrary, from which it can be concluded that their propositional space is infinite. Hence, the analogy with sentence structures does not hold. Black & Wilensky, Black & Bower, and Graham arrive essentially at the same result: a story is a mental process based on different aspects of people's knowledge, of which structure is but one (see also Wilensky (1983a, 1983b, 1983c)). Referring to Figure 2.1, we understand the plot to be a communicational process which is organised around surface structures (expression) and deep structures (content). However, as described in the above discussions of theme and genre, and the relationship between events, it becomes clear that the communicational process needs further distinctions than content and expression. These additional differentiations concern substance and form, where substance represents the natural material for content and expression, but form represents the abstract structure of relationships which a particular media demands (Chatman, 1978). Figure 2.4 shows the relationships between the differing structures found in narratives. 'It is true that objects, images and pattern of behaviour can signify, and do so on a large scale but never autonomously; every semiological system has a linguistic admixture' (Barthes, 1967, p. 10). 14 For a summary of this debate see Andersen & Slator (1990). 2: Narrativity 21 Events Actions Happenings Existence Characters Settings Form of content Content Plot People, things, etc. as preprocessed by the authors's cultural codes Substance of content Structure of narrative transmission Form of expression Manifestation (Verbal, cinematic, pantomimic, etc.) Substance of expression Expression Figure 2.4 Relationship between narrative elements (adopted from Chatman (1978, p.26)) The theory of narration and the history of story grammars in AI shows us that a story is a representational system based on two main layers, structure and content, each serving two distinct purposes simultaneously, these being form and substance. Hence, a primarily structure-oriented approach to plot generation is not appropriate. A planning approach seems more promising, as then the different levels can be separated, as they should be, while maintaining the interaction between the structure and content layers. Furthermore, we have seen that it is, in actuality, the substance of content and expression which is responsible for the well-formedness of a plot and the distinctions between plots. Chapter 3 (substance of content, form of expression) and 4 (substance of expression) will discuss these matters for two particular examples, humour for the former and film for the latter. 22 Chapter III Humour As described in the previous chapter, the theme is the link between plot construction and plot understanding. This chapter discusses, in detail, the emotional implications of our chosen theme of humour, and presents a system of strategies for evoking and developing a humorous reaction. Emotions are subjective physical and psychological phenomena. Examples are excitement, relief, pleasure and aversion. Every feeling, such as hate, anger or joy has a specific affiliated emotional response, which might vary in intensity. However, there is one particular emotional reaction that is associated with a wide range of feelings, and that is laughter. A cause of laughter might be when we see a bride drop a piece of chocolate cake on her wedding dress, when we draw a winner, or lose our mind. Thus, laughter can be associated with happy or bitter feelings, it can be sarcastic, ironic, nervous or insane - but more importantly it is never rationalised at the time. Laughter arises spontaneously from the unconscious (Jordan, 1975). The ability to appreciate the comic seems to be shared by all people, a factor which can be predicted to help in the evaluation of any results. Hence, humour was chosen as the example emotional theme for the film sequences to be constructed. Given the aim of this thesis, it is necessary to investigate laughter, or rather the processes which raise in the mind that pleasing sentiment of which laughter is the physical sign. 3: Humour 23 3.1 Humour - assumptions and definitions It is commonly assumed that, regardless of actual content, jokes have the same underlying features.1. However, humour is obviously particularly conditioned by the sociocultural background of creator and perceiver. The underlying themes and concepts on which humour is based vary according to culture, social class, role, sex, and so forth. Furthermore, cultures establish role relationships and other rules relating their elements and, therefore, offer contextual clues signifying that specific occasions are appropriate for the creation of humour. Several disciplines feature research on humour, among them being psychology, philosophy, semiotics, sociology, and linguistics. In order to provide a framework for our discussion of the major theories of comedy, we adopt the approach of Raskin (1985), in grouping humour into three main classes: cognitive-perceptual, socialbehavioural and psychoanalytical. Since this thesis deals with the automatic generation of film sequences, further limitations are defined by discussing mainly visual humour. 3.1.1 Cognitive-Perceptual Class The theories represented in the Cognitive-Perceptual class consider forms of humour based on incongruity or exaggeration. The first theoretical definition relevant to this class was given by Aristotle in his Poetics, where he describes the ridiculous as: '... some error or ugliness that is painless and has no harmful effects. The example that comes immediately to mind is the comic mask, which is ugly and distorted but causes no pain.' (Aristotle, 1968, p. 9). The three essential comic principles recognised by Aristotle are exaggeration, readiness and incongruity. The simplest way to create humour is to exaggerate salient traits. The actors in Greek plays, for example, always performed with stylised masks, portraying a type of, rather 1 There is, for example, little shared content or plot structure in such diverse films as Dr. Strangelove, Modern Times, Monty Pythons's the Meaning of Life, Airplane! and Take the money and run. 3: Humour 24 than an actual, character. Exaggeration was used as a poignant device to emphasise a personal shortcoming, which for comic purposes was often the grotesque. This means that an exaggerated facial expression, or perhaps an overly large nose, or the extravagant appearance of an actor is likely to be funny in itself. The same concept can readily be found in silent movies, where appearance almost immediately implies character2. A villain, for example, had to be 'big, fat, expensively-dressed and moustachioed' (Jordan, 1975, p.6). In this view, character is seen in terms of appearance rather than personality, since personality is subordinated to a defective and eccentric notion of people. A further way to create humour is to imitate, or parody, existing works. The exaggerated imitation of a work or genre requires more ingenuity than character parody, though exaggeration is common to both. Examples of parody plots are Abbott and Costello meet the Mummy (as in many of their films, they parodied the horror genre), Bananas (a series of parodies of specific films) or The Naked Gun (a parody of detective stories). Lightheartedness is one of the basic tenets of much comedy. An action or event is only considered to be funny if it takes place in a comic climate, a term coined by Gerald Mast (1979). It is obvious that, even in a tragedy, there is potential for some comic touches. However, the overall aim of tragedy is to promote painful emotions, which then become a catalyst for the intended catharsis of the audience. Comedy, on the other hand, avoids pain by creating a sphere of sympathy and reduced importance or worthlessness3. The aim is to suspend audience disbelief, and establish a frame of mind in which comic events are expected. This can be achieved by introducing insignificant subject matter or by reducing a serious subject to trivia, as, for example, in the film Dr. Strangelove. In such a context, it is also helpful to provide a character with either positive attractiveness (grace, poetic charm, wit, or good looks) or grotesque, stylised wish fulfilment (Jordan, 1975), which enables the audience to feel empathy, or even identify, with the character. The fundamental notion behind the principle of incongruity is that laughter is based on the recognition of the dissimilarity between the way things are and the way they should be, a relation which is concisely described by Schopenhauer: 2 To some extent this also applies to contemporary films. 3 See Olson (1968, p. 36-37), who describes comedy as 'the imitation of a worthless or valueless action.' 3: Humour 25 'In every case, laughter results from nothing but the suddenly perceived incongruity between a concept and the real objects that had been thought through it in some relation; and the laughter itself is just the expression of this incongruity. ... All laughter therefore is occasioned by a paradoxical, and hence unexpected, subsumption, it matters not whether this is expressed in words or in deeds.' (Schopenhauer, 1966, p. 59). Incongruity can be based on different rhetoric strategies, such as inappropriateness, paradox, or dissimilarity, depending on the intended type of comedy, i.e. ludicrous or derisive humour, satire, irony, pun or wit4. Additional principles for incongruity are introduced by Kant who states: 'In everything that is to excite a lively convulsive laughter, there must be something absurd (in which the understanding, therefore, can find no satisfaction). Laughter is an affection arising from sudden transformation of a strained expectation into nothing.' (Kant, 1951, p. 177). If a spectator is confronted with a character in a particular environment, a certain train of thought develops in the viewer, who expects the character to behave predictably. However, if the character acts in an unexpected way, this surprises the viewer. The important aspect is that the expectation within the viewer should be carefully developed, i.e. strained, so that the resulting and sudden surprise can be even stronger. Suddenness is important, as jokes based on surprise are either immediately funny or are not funny at all. There are arguments suggesting that pure incongruity does not entirely explain the structure of humour (Beattie, 1776; Freud, 1960; Rothbart, 1976; Shultz, 1976; Suls, 1972). Such arguments propose a two-stage model in which incongruity becomes meaningful and appropriate through resolution. It is this second stage which conveys information that explains the sense of the joke in the light of the material that 4 Since the last two concern mainly language oriented humour, they will not be considered further, since in this thesis sound is excluded. However, the author is aware of the fact that inferences drawn on the basis of visual clues are often confirmed or disconfirmed by verbal statements, which leads to a dramatic increase in explanatory possibilities. Work concerning the automated generation of puns can be found in Binstead & Ritchie (1994a; 1994b). 3: Humour 26 presented the joke. Within this incongruity-resolution model it is possible to distinguish humour from sheer nonsense, as Shultz points out: 'Whereas nonsense can be characterized as pure or unresolvable incongruity, humour can be characterized as resolvable or meaningful incongruity.' (Shultz, 1976, p. 13). See also Rothbart & Pien (1977), who present an elaborate model of two types of incongruity and two types of resolution: • Impossible incongruity: events that are unexpected and also impossible given one's current knowledge of the world. Film examples can be found in Chaplin's Gold Rush. • Possible incongruity: events that are unexpected or improbable but possible. A film example is the comedy Bringing up Baby, where a series of misadventures befall a couple, including their adoption of a leopard called Baby. • Complete resolution: the initial incongruity is completely solved by the given resolution information. An example is the emotional reaction of a person who slips on an ice-covered path. • Incomplete resolution: the initial incongruity is solved by the given resolution information in some way, but is not made completely meaningful because the situation remains impossible. For the enterprise of creating humour, the theory of Monro (1951) is important. Monro describes humour as consisting of different types of nonsense which serve to create a new reality, such as attitude-mixing or universe-changing, by twisting familiar material using argument, rhetoric, or exaggeration, so as to obtain an absurd conclusion. Two examples from Chaplin movies should further demonstrate incongruity. The first is taken from the Chaplin film The Immigrant. The scene is set on a rolling boat. Chaplin is leaning over the rail and his legs are waving in the air. We automatically assume he is suffering from a heavy attack of seasickness. However, when he turns around we realise that he is actually fishing. The second example is from The Idle Class. A women refuses to forgive her husband (Chaplin) and leaves the room. Chaplin turns his back to the camera. His shoulders 3: Humour 27 heave and shake. Then he turns and is seen to be shaking a cocktail. His face is one big smile. 3.1.2 Social-Behavioural Class The main principle of theories in the Social-Behavioural Class is detraction, which derives from Hobbes, who wrote: 'The passion of laughter is nothing else than sudden glory arising from some sudden conception of some eminence of ourselves, by comparison with the infirmity of others, or with our own formerly: for men laugh at the follies of themselves past, when they come suddenly to remembrance, except they bring with them any present dishonour.' (Hobbes, 1650, p. 45). Although there are many social-behavioural theories of humour, e.g. disparagement theory (Suls, 1977), superiority theory (La Fave, 1972), vicarious superiority theory (La Fave, Haddad, & Maesen, 1976) or dispositional theory (Zillmann & Cantor, 1976), all describe humour as laughter arising from a sense of superiority of the perceiver to the object, person or idea being presented. Laughter, according to this view, is laughter at someone. Thus, such theories refer to the relationship between, or the attitude held by, two parties. To distinguish comedy from spite or cruelty, it is necessary that the laughter does not cause offence, which means it must not be based on the absurdity or infirmity of the other5. A feeling of superiority can also be related to a feeling of triumph. The perceiver, managing successfully to create a logical pattern from the presented material, suddenly realises that the portrayed event has twisted into an unexpected direction and starts laughing. Hence, it is not only perception of incongruity which makes people laugh, but also the realisation which establishes a feeling of superiority to the situation (J.B. Baillie, as presented in Jordan (1975, pp. 17 - 18)). An example from Mike Leigh's film Naked should exemplify the idea of superiority and triumph as basis of joke. A man pastes a poster on a wall (advertising a concert), while the main character constantly speaks to him. Both leave the place in the van of the posterer. They drive through the city and arrive at another place, where the concert 5 See also Beattie (1776), who stated that strong moral beliefs (indignation), feelings (disgust, pity or fear) or common sense (disappointment) of the perceiver can lead to unfunniness. 3: Humour 28 poster is already on the wall. The posterer puts some other posters on the wall and then suddenly he pastes a stripe over the concert poster, marked "cancelled". We see the posterer's behaviour as stupid, and this makes us laugh. A feeling of triumph can, of course, be turned against the perceiver. Imagine a scene where a man approaches a freshly painted bench but shortly before sitting, realises the state of the bench and gets up, only to turn and immediately fall over a litter bin. This example shows that it is not necessarily only the other at whom we laugh (the eventual misfortune), but that the joke can also be on us, the perceivers (the expected misfortune did not take place). A further relevant theory in the Social-Behavioural Class is the Mechanical Theory of Bergson (1956). This theory suggests that a comic phrase, action, character or plot within a comedy is reduced to a mechanical process - a functional principle especially useful for slapstick where the key element is actually the rapid transformation of states. In this context, people are seen as mechanical toys because no human being could possibly bear the physical torture of the slapstick world. However, the perceiver is not at all concerned with the health or safety of these 'puppets' because they are thought of '...as machines - which might break but which can always be fixed or replaced' (Mast, 1979, p. 50). Such situations are found regularly in cartoon films. An example, taken from the film The Naked Gun, should clarify this idea. The main character, Lieutenant Drebin, has to find a brainwashed baseball player and thus takes part in a baseball game as a referee. His behaviour suggests that he is not familiar with the role of a referee, as he stands to close to the batsman. As a result, he is hit by the bat, and from this moment on he behaves like a baseball - propelled backwards into the arms of the catcher. However, the above example shot also reveals a weakness of the theory, i.e. Bergson's denial of emotions which, in his opinion, ruin laughter. Who is Lieutenant Drebin, with all his shortcomings, if not a figure of sympathy? Nevertheless, Bergson's theory applies to certain restricted cases. 3.1.3 Psychoanalytical Class Humour theory from the Psychoanalytical Class deals with suppression and repression. Here, laughter is seen to provide relief from the great number of constraints - such as to be logical, to think clearly, to talk sense - a human being has to 3: Humour 29 obey (Freud, 1960)6. Thus, the comic allows us to return to a kind of childishness by providing an anarchic freedom from existing conventions7. Humour emanating from the unconscious liberation from authority finds its strongest expression in comedy related to aggressive or sexual drives (tendentious humour), two instincts that are strongly repressed by society. The controls of society mean that aggressive instincts, for example, are softened and dispelled by the intellect. Nevertheless, a derivation from social norms, such as that represented by violence, can be used to provide the audience with an emotional release for its antisocial urges, especially if the antagonism displayed exposes a higher morality. An example of tolerable violence of this type can be found in Frank Capra's Mr. Deeds Goes to Town. The film ends in a courtroom climax, in which Deeds hits several characters while simultaneously stating the superiority of his values, with the assent of the judge. However, a similar result was observed at a showing of Quentin Tarantino's Pulp Fiction to a student audience, where the accidental killing of a character by one of the philosophical hitmen produced a great deal of laughter from the audience.8 It is clear that such humour is connected with the abuse of social norms. The above discussion of the sources of laughter leads us to conclude that comedy is a context-related experience of intellectual complexity, influenced by the norms, feelings and ideas of a particular social group9. It is now time to consider the narrative structures discussed in the previous chapter, in conjunction with the approaches to humour discussed above. The aim of the remainder of this chapter is to isolate the key primitives of humour, i.e. those found in most humour theories, and to demonstrate how a formalism based on these primitives might be used to generate a visual statement that effectively provokes a humorous reaction. 6 See also Mindess (1971, p. 28), who wrote: "In its early stages our sense of humour frees us from the chains of our perceptual, conventional, logical, linguistic, and moral system." 7 Examples in this vein are: Brats, Ferris Bueller's Day Off, Big, Bill and Ted's Excellent Adventure, Home Alone and Wayne's World 8 Despite its high level of violence, Pulp Fiction is often described as one of the most humorous movies of its year. 9 Nevertheless, each of us has a sense of humour which is, in part, uniquely ours (see, for example, Leventhal & Safer (1977)). 3: Humour 30 In section 3.2 we present humour primitives. In section 3.3 we describe our scheme of humour formalisms. Finally, section 3.4 defines a method to evaluate jokes. 3.2 Humour primitives The crucial primitives of humour creation are incongruity, derision, timing, readiness and exaggeration. These primitives can be classified into two types, first, those which support the making of a joke, i.e., exaggeration, timing and readiness, and, second, those which have a constructive or intentional drive, i.e., incongruity and derision. The following discussion emphasises the functionality of each primitive. 3.2.1 Readiness Readiness is the establishing of a frame of mind, or a comic climate. To provide the audience with a credible situation or character requires material that is plausible in nature. Imagine a character in a hurry. He jumps into the sidecar of a motorcycle, the engine starts, and the driver disappears on the motorcycle leaving the character and sidecar behind10. For the best humorous result, the audience should appreciate that the character is hurrying. In other words, the emotional continuity of the impression depends on the narrated logic. Thus, plausibility applies to the feelings and attitudes of the audience as well as to the meaning of the presented material. The role of readiness is, therefore, to combine the logical and emotional elements of a joke, by applying the values of serious life (represented by stereotypical behaviour and patterns of events as well as by standards of morality) to achieve the ironies, ambiguities and inconsistencies presented, and maintain the equilibrium between the comic elements involved. 3.2.2 Timing In humour, timing is crucial and serves two purposes. Firstly, much humour relies on the element of surprise (as discussed earlier), without which it becomes merely a logical juxtaposition of emphasis. Comedy must be portrayed quickly and arise from the unknown, or it loses strength, as described by Eastman: 'Therefore, not only must the mind be genuinely on the way (plausibility), and the-not-getting-there a genuine surprise (suddenness), but the surprise must come at the instant when the on- 10 This example is taken from Eastman (1937, p. 343). 3: Humour 31 the-wayness is most complete, and the surprise most unexpected.' (Eastman, 1937, p. 354). A second temporal aspect of humour is repetition. We never laugh as hard at a joke we have seen before, no matter how delighted we were on first viewing. Hence, the same joke should not be immediately repeated. However, if the joke is paraphrased through new material, humour can be prolonged, but only until no further escalation is possible. A wonderful example of the exploitation of one particular idea and comic mechanism is the stateroom scene in the film A night at the opera. During a voyage, the Marx Brothers call many different service people (room service, manicurist, etc.) into their cabin. Each of them perform their duty, while the cabin gradually fills up. The scene ends with all of the occupants of the cabin tumbling out through the door towards the last person for whom the door is opened 3.2.3 Exaggeration The basic idea behind exaggeration is an emphasis of conflict, based on two principles: Hyperbole A way of describing something in order to make it seem bigger, better, worse, etc. Miosis A way of describing something in order to make it seem smaller, milder, etc. The aim of humorous exaggeration is to work on the quality of something with which the perceiver is not deeply concerned, in such a way that the logical extension is beyond what can be seriously accepted. Hence, it is not the event or action itself that matters for the gravity or levity of the situation, but rather the intentional view of these. 3.2.4 Incongruity The term incongruity is not particularly helpful in itself, since it merely defines the conflict between the expected and what actually occurs. It is more useful to look at the various concepts implied by the term: Ambiguity The possibility of interpreting an expression in two or more distinct ways. 3: Humour 32 Irony The use of situation, speech, and so on, to imply the opposite of what is apparently meant. Variations in grade are understatement, satire and sarcasm. Absurdity At variance with reason. Variations in grade are ridiculous and ludicrous. In considering variations in the possible level of incongruity, it is apparent that incongruity can easily lead into derision. However, the essence of incongruity is an understanding of the relationship between two situations or thoughts - one reason why different people laugh at different things. It is, therefore, important that the creator of the comedy is aware of the apparently dominant information and the pattern breaking finale and takes all possible steps to reinforce the deception, so that the final collapse into incongruity is complete. Thus, all types of incongruous humour depend mainly on an appropriate temporal presentation of the final stages of a plot. Figure 3.1. shows the emphasis based relationships between incongruity and the logical elements of plot construction described in Chapter 2. Motivation Realisation Resolution Incongruity Figure 3.1 Relationship between incongruity and narrative structures 3.2.5 Derision There are actually two important features of derision. First, the explicit transformation of moods or states of the portrayed character: Mischief Behaviour that results in trouble (downgrade of mood) and possible damage (transformation of state) to the subject, but no serious harm. A second feature of derision, which is more difficult to measure, is the implicit upgrading of the mood of the viewer: Schadenfreude A delight in another's misfortune, where that misfortune is unexpected by the other. Superiority A delight in another's misfortune, where the misfortune results from inappropriate (e.g. stupid) behaviour. 3: Humour 33 As temporality supports incongruity (discussed earlier), so it supports derision, though to a less critical extent. For derision, sympathy is far more important. If the unpleasantness is not appropriately presented, it might be that the situation is too violent, banal or pathetic, and derision then leads to a lack of interest in being amused and the material is seen as unpleasant rather than funny. Thus, the adequate motivation of the viewer, and the presentation of the humorous event are particularly important for derisive jokes. The position of derision in narrative logic is described graphically in Figure 3.2. Motivation Realisation Resolution Derision Figure 3.2 Relationship between derision and narrative structures 3.3 Humour strategies Humour strategies combine the logic of narrative structures with the functional operators of comic primitives. The goal of the ensuing discussion is to show that the combination of the above primitives in strategies can achieve emotional reactions from viewers of automatically generated meaningful visual sequences.11 The strategies to be presented were derived from the humour theories described above, from discussion with humour theorists and from intensive analysis of the visual humour found in film comedies. As discussed in Chapter 2, every narrative begins by establishing a set of spectator expectations, based on a logical order of events, where these expectations are eventually foiled. A comic event 'can exist only within a narrative context - as a consequence of the existence of characters and a plot' in contrast to gags which 'constitute digression within a story or story-based action' (Neale & Krutnik, 1990, pp. 44, 57). It is thus clear that comic strategies are applicable at the level of events, as well as episodes, or even plot, since all reflect the same logical pattern of motivation, realisation and resolution. The strategies defined below apply to the episode, an approach which can be compared with early silent movies, where a plot was replaced by a basic situation, perhaps focusing on a place (shopping centre, school), an event (a 11 The author is aware of the fact that the strategies cannot be seen in isolation but require global understanding of the video images and their wide range of possible micro- and macroscopic connotations. However, discussion of these problems is deferred until chapter 4. 3: Humour 34 marriage), or an object (a hose), and a series of jokes were based around this central situation (see, for example, Shoulder Arms by Charles Chaplin). Recalling Figure 2.3, which illustrates the relationship between the three logical elements and the structure of a plot, it is clear that action establishes plot. The character's actions express his or her mood or intentions to the audience. Thus, actions define the character. Moreover, actions take place in an environment, which may feature other characters and objects, and thus creates a context. This leads to the following strategy: H-Strategy 1 An action forms the most suitable subject for a joke, then an actor, then an object, and finally a location. An action expresses an intention or mood. Given the assumption that an actor focuses his performance on one action, the following strategies may transform the performance into comedy. The comments in brackets contain the comic primitive and the expected reaction of the viewer. H-Strategy 2 if the action portrays an intention [goal], interrupt the action in a way that is unexpected by the character, so that the goal cannot be fulfilled and the character's mood is downgraded or he or she suffers in some way. (Mischief + Schadenfreude) An example joke of this type is when we see someone eating a cake (goal) and a piece drops onto the shirt (interruption). The narrative element here is the following of one action by an oppositional event. A variation in the level of Schadenfreude can be achieved by changing the motivation of the derisive joke. It is, for example, more harmful to the character if the situation takes place in a public setting in which he or she plays a major role, e.g. if she is the bride in a wedding, than the same event taking place, say, in the privacy of her home. H-Strategy 3 If the intention of the joke is derisive, a public setting is preferable to a private setting but in both cases the character should play a major part. (enhanced Schadenfreude) However, if the character realises that there is too much cake on the fork and yet still tries to get the cake into his or her mouth, with the same humiliating result, an equivalent narrative structure then provides a qualitatively superior joke, since the 3: Humour 35 changed emphasis in the motivation leads to a stronger involvement of the viewer. From this the following two rules can be derived: H-Strategy 4 If the action portrays an intention (goal), interrupt the action, in a way that is expected by the character, so that the goal is unfulfilled and the character's mood is downgraded or he suffers in some way. (Mischief + Schadenfreude + Superiority) H-Strategy 5 If the intention of the joke is derisive, reveal the point in advance. (enhanced Schadenfreude) Referring back to the last cake example, we now see that it is covered by H-Strategy 4, where eating the cake represents the goal, the awareness of the character fulfils his or her required expectation, and dropping the cake represents the interruption and the damage. Nevertheless, the intention behind, or the goal of, an action is not always obvious in its intention or goal. The fishing joke from the Chaplin film The Immigrant (described earlier) cannot be established by the above strategies. Hence, additional strategies must be introduced for such cases: H-Strategy 6 If the action portrays an ambiguous mood, encourage the context related positive attitude but then lapse into a mood that portrays the inferior attitude. (Ambiguity + Superiority) H-Strategy 7 If the action portrays an ambiguous intention [goal], encourage the context related obvious action but then lapse into the alternative. (Ambiguity + Superiority) H-Strategy 8 If the intention of the joke is incongruous, do not reveal the outcome until the realisation or resolution. (Superiority or enhanced Superiority) Consider again the Chaplin "fishing joke". This can be created by H-Strategy 7. The ambiguous intention (fishing, feeling sick) is represented by the action "leaning over the rail"". The impression of "feeling sick" is encouraged by the environment, a rolling boat, while the lapse into the alternative is achieved by Chaplin turning around and showing a fishing rod. Reordering the joke by first showing Chaplin with the 3: Humour 36 fishing rod and then him leaning over the rail, would not be funny and thus, this kind of order should be avoided (see H-Strategy 8). The combination of the above strategies can result in relatively powerful jokes. However, first we introduce four more strategies, to support such combinations. H-Strategy 9 The structure of a joke can only be repeated if different content combinations are possible. The continuation should be of higher emotional intensity. (Timing + Exaggeration) H-Strategy 10 The structure of a joke can only be repeated while the outcome of the action or event portrayed is still warranted. Progression ad absurdum. (Exaggeration) H-Strategy 11 If the exaggeration is part of a repetition it is easier to exaggerate attributes of objects than attributes of actions. (Exaggeration) H-Strategy 12 Given that strategies 9 and 10 are fulfilled, a repetition of a particular strategy should not be repeated more than 4 times, or boredom may result. (Timing) A sequence from the film The Naked Gun 2 1/212 serves as an example of the strategies 9 - 12. In this sequence, Dr. Albert Meinheimer, a kidnapped researcher, and Lieutenant Frank Drebin are held prisoner in a storehouse. Drebin, who is tied to an office chair, attempts to free himself by moving his tied hands along the corner of a steel framed stack of shelves. In front of the same stack sits Meinheimer, tied to his wheelchair. Drebin's rhythmic movements release the following objects from the top shelf onto Meinheimer's head: a baseball bat, a number of balls (H-9), a set of pool balls (H-9 + H11), four horseshoes (H 10 + H 12), a skittle and then a skittle-ball (H-9 + H 12), oil spilled out from a can (H-10 + H 12), white flakes of polystyrol from a overturned box (H-10) and finally a small anvil (H-10). The last object, in particular, presents problems of plausibility. However, the film provides the audience with a large number of exaggerated absurdities, so this one is readily accepted. 12 The films in this series make repeated use of H-Strategies 9 - 12, either in a simple way, such as the given example, or through more advanced variations, such as the repeated parodying of other films (e.g. Casablanca, Ghost, and Psycho, in The Naked Gun 21/2), which actually strengthens the underlying structure of the film series itself, this being the parodying of 'detective stories'. 3: Humour 37 With the above rules we are now in the position to use repetition in jokes. However, we would still have problems with a joke such as the following. Recall the joke where a man approaches a freshly painted bench. The viewer expects the character to sit on it (H-5). The character notices the state of the bench just before sitting down (H-4), and manages to avoid sitting (which is a typical reaction of the character but unexpected by the viewer of the scene), after which he turns, only to fall over the litter bin (see H2, where "falling over" interrupts "getting up"). The extra but crucial structural plot element in this case is not only the repetition of a strategy but the creation of two "butts", the character and the viewer. We therefore require an additional strategy: H-Strategy 13 If the intention of a joke is derisive then suggest a mishap, but do not resolve into mischief but into the expected reaction which avoids the mishap. Then repeat the strategy used on the resulting action, this time enabling a mishap. (Mischief + Frustrated Schadenfreude + Mischief + Schadenfreude + Superiority) The strategies defined thus far reveal, firstly, that the focus on single actions puts the emphasis on derisive humour and, secondly, that strategies involving the character and the perceiver might be more complex, but are definitely more effective: H-Strategy 14 For a single action, mischief is easier to achieve than ambiguity. H-Strategy 15 Strategies that contradict the viewer's expectations before fulfilling them are preferable to those that present the material in a straightforward way. However, it is usual for a character to be involved in several actions simultaneously, which has implications for both the construction and the content development of the presentation. The actions involved in an episode may be connected, in that they form a larger logical pattern, such as the different stages involved in obtaining a cup of coffee from a vending machine, or they may simply be performed at the same time for some particular reason. It should be clear that a familiar pattern more suitably meets the requirements of exposition, since the context is already given and need not be motivated. 3: Humour H-Strategy 16 38 A sequence of actions that is meaningful is more preferable for the construction of jokes than a sequence of unrelated actions. Before investigating actions that take place within an existing context, it is pertinent to consider random actions taking place in the same temporal interval. Parallel actions give rise to the problem of creating a relationship between the actions to place them in the same context. This can be achieved either by constructing a wider narrative context which can accommodate the actions, or by establishing a contextual relation based on conflict. The first approach, the creation of a wider narrative context, requires a comparison of the given actions with a set of expositional actions within an expected chain of events. If a set can be found where one (or preferably more) actions match required actions, then the complete chain of events can serve as the basis for a joke. Imagine a character standing in a large public indoor area, scratching his head and moving his other hand in a pocket of his trousers. In the far background we can see a coffee machine, and our character appears to be looking into this direction. A possible exposition is that our character would like to obtain some coffee, and is searching for money in his pocket. If several expositions are available, the most applicable (the one with the highest number of matching actions and suitable environment) is chosen. The relevant narrative structures involved are exposition and addition, as it is most likely that several actions or objects must be added to form a suitable motivation for the joke. In the above example of the coffee machine, it would be essential to establish the character near enough to the machine to use it. The construction of jokes based on the interaction between the character and the coffee machine will be described later in this chapter. The second alternative for creating a relationship between random actions, based on conflict, reflects the assumption that a number of a character's actions will share resources, i.e. subactions. Conflict can then be used to establish a derisive joke. Imagine, for example, a character sitting at a table for breakfast. He is reading a newspaper, and at the same time he is dipping a croissant into a cup of coffee. Both dipping and reading usually require looking. The character cannot look at the newspaper and the coffee at the same time. Thus, there is a conflict which can lead to an unexpected result for the character, e.g. dipping the croissant into a jar of mustard instead. With reference to the above a new set of strategies can now be introduced. 3: Humour H-Strategy 17 39 The combination of parallel actions into larger meaningful units provides more comic potential than a single action. H-Strategy 18 The combination of parallel actions based on conflict is more straightforward than the combination of parallel actions into larger meaningful units. H-Strategy 19 The combination of parallel actions based on conflict is preferable for the construction of derisive jokes. H-Strategy 20 If parallel actions establish a conflict, then encourage the action with the stronger link to the shared resource but then construct the joke on the basis of the other action by using the existing strategies for a single action. H-Strategy 21 The combination of parallel actions into larger meaningful units is preferable for the construction of incongruous jokes. Referring back to the "croissant joke", we see that it can be created using H-Strategy 20. The conflict is established by using "reading" and "dipping". The use of "dipping" for the joke is justified by its looser relationship to looking. Imagine, that we first see the character reading the newspaper, we can then generate the joke by using HStrategy 2 for the action of dipping, where the mustard represents the suffering caused. Thus far, the assumptions made are relatively restrictive, since they consider only a single character. However, actions usually involve objects, other characters, or even groups of characters or objects. A new level of complexity arises as the idiosyncratic goals of each participant can converge or diverge. This requires the construction of a wider narrative context. We will first investigate the relationship between a character and an object, as in this case we are confronted with only one goal (in cases where the object is active we have to support two goals). Assume that a character approaches a coffee machine. Even before the interaction between human and machine takes place, a perceiver of the scene would have certain expectations concerning the events about to occur. These expectations form the foundation for possible gags. The important narrative elements 3: Humour 40 are immediately apparent: additional repetition, deletion or twisted analogy.13 Since humans tend to empathise with other humans rather than with machines, it is necessary to work additionally on the mood of the character, i.e. we have to establish a sufficiently good mood to be downgraded. A possible sequence, therefore, could show the smiling character approaching the machine, inserting the money, receiving the coffee but no cup and becoming angry (see H-Strategy 23 below). An example of twisted analogy might be the ejection of a chocolate bar instead of coffee. H-Strategy 22 If the relationship between a character and an object can be established within a given sequence of actions (event), either delete actions from the character related chain, repeat actions within the chain or add actions of an analogous behavioural chain of actions. (Superiority or enhanced Superiority). H-Strategy 23 If the relationship between a character and an object can be established within a given sequence of actions (event), either delete actions from the object related chain, repeat actions within the chain or add actions of an analogous behavioural chain of actions. so that the mood of the character is downgraded or he suffers in some way. (Schadenfreude + Superiority or enhanced Superiority) The interaction between two human characters is more complicated than those relating to a character and an object, since for two characters we must deal with multiple intentions. As an example, consider one of the earliest films, L' Arroseur arrosé (the sprayer sprayed) by Louis Lumière. The film shows a gardener watering flowers with a hose. A boy sneaks up behind him, and steps on the hose, thus blocking the flow of water. The gardener looks down the hose to see what is happening, the boy lifts his foot and the gardener gets soaked. In the above example it is the relationship between the two characters (the rascal and the victim) and the object that links them (the hose) that provides an extra level of complexity. Nevertheless, the comic structures involved are similar to the strategies introduced above. We continue to emphasise the action of one actor (the gardener watering the flowers, see H-1), and then interrupt the action to downgrade his mood or make him suffer (due to his unawareness, see H-2). However, this time the action is 13 Twisted analogy means that the behaviour of a participant, or the outcome of an event, is correct but for a different context. 3: Humour 41 interrupted by a second character, who actually intends to make the first character suffer. Though the interruption is based on the relationship between a character and an object (gardener and hose), we cannot apply strategies of the type H-Strategy 22 - 23, since they represent an internal relation, whereas in the above example the interruption is based on an external event (boy and hose). Thus, it is necessary to introduce a set of new strategies, which cover the divergent intentions of two characters in order to achieve a humorous scenario such as described in the gardener example. H-Strategy 24 represents the top level rule for an event such as that of the gardener example, while H-Strategy 25 represents an example rule for the type of "the biter will be bitten", and H-Strategy 26 describes a rule for an incongruous joke based on the relationship between two characters. H-Strategy 24 A relationship between two oppositional characters should be established in such a way that the goal of one character is to interrupt the goal of the other in such a way, that is unexpected by the second character. The reaction of the second character must then be influenced by the first so that the second character's mood is downgraded or he suffers in some way. (Mischief + Schadenfreude) H-Strategy 25 If the intention of a gag, based on the relationship between two characters, is derisive, encourage the context related obvious chain of actions but then lapse into an alternative chain where the second character's mood is downgraded or he suffers in some way. (enhanced Schadenfreude + Superiority) H-Strategy 26 A relation between two equally valued characters should be established in such a way that the context related obvious chain of actions of one character is established but then lapses into an alternative chain where the reaction of the other character would make sense for a related but not explicitly provided context. (Ambiguity + Superiority) An example of H-Strategy 25 is the typical cartoon situation, where one character tries to fool another character by smearing glue on a chair, and finally sits on the chair himself. An example of H-Strategy 26 is when a woman wants her husband, who objects to washing, to take a shower before going to bed. He agrees, only to be seen standing beneath the spray shielded with an open umbrella. 3: Humour 42 Further strategies can be imagined, e.g., the use of analogy to transform the behaviour of a character from human into automaton or vice versa, or to assign human behaviour to an animal, or vice versa. Structural complexity can be increased by introducing more characters with several assigned goals. An example can be found in the film Bringing up baby, where five characters in predefined roles (first partner, initial partner, second partner, conscience figure and blocking figure) attempt to keep the ideal couple (first partner, second partner) apart. The goals of each character are linked by a particular homogenous narrative structure (commitment comedy) which provides all the functionality, i.e. introducing the flaw within the initial relationship, introducing the ideal partner, separating the ideal couple, reuniting the ideal couple, etc.14 Needless to say, the portrayal of such a "film world" or "genre" carries with it emotional overtones, and thus offers considerable potential for humour. The goal of this thesis is not to achieve the automatic creation of such complex structures as those described immediately above. Nevertheless, we claim that our above strategies are sufficient to enable the automatic generation of numerous meaningful and humorous film sequences. Table 3.1 represents the above strategies classified as either constructive or supportive humour strategies. Constructive humour strategies Supportive humour strategies 2, 3, 4, 5, 6, 7, 8, 13, 20, 22, 23, 24, 1, 9, 10, 11, 12, 14, 15, 16, 17, 18, 25, 26 19, 21 Table 3.1 Classification of humour strategies The classification of the humour strategies mirrors the two main layers of a narrative, i.e. form and substance (see section 2.2), where supportive strategies relate to form, while constructive strategies relate to substance. Thus, the above classification of humour strategies reflects our assumption concerning the applicability of a planning approach to the automated plot generation (see section 2.2). The influence of the strategy classes in the design of the planning system will be discussed in later chapters. The remainder of this chapter is devoted to defining a method for evaluating a joke. 3.4 Evaluation of comedy The overall goal of the creation of a meaningful video sequence is that it should communicate the intended theme as successfully as possible, which in our case means 14 For a detailed analysis see Brunovska Karnick (1995, pp. 132 -136). 3: Humour 43 that the sequence is as funny as possible. The measure of this achievement is usually the perceiver's reaction. Knowledge of the actual reaction of the perceiver is, of course, unavailable to an artificial system, unless that system provides an interface that permits the traditionally used mirth index to be applied.15 However, if a system could apply the mirth index, it would continue to be necessary to provide certain measures during the actual generation process, as the mirth index is a post-indicator and can thus be applied only to a completed joke. Thus, indicators are required to support the choice of strategies, and to testify to the potential of content directions to be taken. Examples of the former are meta-strategies such as HStrategies 1 and 15. To testify to the potential of content directions, we allocate: • a positive value for each resolved comic primitive per motivation, realisation, and resolution based on its structural importance; • a negative value for each unresolved comic primitive per motivation, realisation and resolution based on its structural importance; • the degree of resolution for each comic primitive, based on the strength of the semantic relationship of the content elements involved; • the degree of complexity (viewer / character emphasis). Later in this thesis we will show that the degree of resolution for comic primitives, and the degree of structural complexity are both expressed as applicability values, which are calculated, based on particular H-Strategies used, during the generation process. The above enables the simultaneous calculation of two "joke values". One represents an ideal, and thus hypothetical, value which serves as the comparison measurement for the actual joke to be created from the available material. If we calculate the relationship between the hypothetical and real value as a percentage, and compare it with, say, a mirth index transformed into a percentile system (i.e. 90 - 100 % => excellent, 70 - 89 % => very good, and so forth) it is possible to actually rate the created joke. Of course, this does not guarantee that any viewer will find it humorous. 15 The "mirth index" represent an ordinal scale from 0 (negative response) to 4 (audible laughter). For detailed information see Rothbart (1977, p. 93), and Gehring (1986). 3: Humour 44 A further advantage of the comparison approach described above is that it would enable a system to detect the weak parts of a joke and either alter, or at least suggest alternatives to, these parts. 3.5 Conclusion Based on an investigation of the major humour theories, this chapter defined five crucial primitives for the automatic generation of humour, i.e. readiness, timing, exaggeration, incongruity and derision. We then introduced a novel method of combining these primitives with narrative structures, by introducing 26 humour strategies, which can support the generation of numerous meaningful and humorous film sequences. Moreover, we showed that each of these strategies either supports the generation of structure or content for a visual joke, and concluded that this will influence our design of the planning system (i.e. multi-dimensional), to be described in later chapters of this thesis. Finally, we described a novel mechanism to evaluate the potential funniness of a joke, which is correlated with the introduced strategies. 45 Chapter IV Film The main aim of this thesis is to investigate the potential for the automated generation of meaningful and emotionally stimulating film sequences. In dealing with such a task, it is necessary to investigate the medium itself, which obviously plays a central role in any communication. Thus, we now examine theories of film structure which will enable us to identify and then represent the knowledge elements, and the filmic mechanisms, necessary for the creation of meaning through visual material. 4.1 Cinematic meaning We begin our analysis of film by adopting Tudor's perspective on cinematic meaning, as described in Table 4.1, which categorises the various sources of meaning in a film. Channels Nature of film Cognitive Aspects of meaning Expressive Normative world Factual nature of film world (genre) Emotional meanings, associated with film world (genre) Normative meanings implicit in film world Thematic structure Formal structure Events in thematic development, e.g. plot Emotional involvement in thematic structure, (e.g. humour strategies) Factual meanings conveyed by form Emotional consequences of formal structure Normative meanings implicit in thematic structure Normative meanings conveyed by formal means Table 4.1 Tudor's paradigm of cinematic meaning Tudor (1974, p.128). [Text in brackets added by the current author.] Tudor's paradigm is organised around a cross-classification of the aspects of meaning and the channels through which meaning is communicated. The aspects of meaning can be subdivided into, firstly, cognitive aspects, which inform the spectator about actions, appearances, events, and so forth; secondly, expressive aspects, which evoke 4: Film 46 emotions, and, thirdly, normative aspects, which shape the ethical inferences the spectator draws from the film. Transmission channels are classified as: Nature of film world The characteristic body of human content or reality Thematic structure The order of events and development of themes to construct the narrative Formal structure The five categories of substance defined by Metz: the moving photographic image, the recorded phonic sound, the recorded musical sound, the recorded noise and the graphic tracing of written matter. (Metz, 1976b, p. 586). In chapters 2 and 3, we laid down a theoretical foundation for the cognitive and expressive aspects of the channels film world and thematic structures. It was shown that a narrative is pre-selected and arranged material based on an intention to communicate a particular theme or idea to the perceiver. The chosen material is taken from characteristically conceptualised areas of human, social and natural content (genre). Furthermore, we considered how channels serve to evoke the viewers' emotions. In film, this is achieved through emotional overtones pertaining to the content of the material. Within the narrational structure, emotional responses are evoked particularly by the thematic structure, e.g. humour strategies. Applying this theoretical background to film allows us to investigate the structure of film so that we can see why film "works". Meaning is always inextricably linked to the medium and its content, which form and carry the message. The goal of analysing the medium is to reveal the textual or formal aspects of film, and how these support the creation of meaning. We are specifically interested in the information carried by the visual image, its formal permutations and its capacity to communicate expressive meaning.1 1 The author is aware of the fact that the absence of sound is a strong qualitative drawback, since sound provides three out of the five standard categories for substance. Sound defines space and time in addition to being a powerful commentative story element. Instrumental music, for example, was in use during the era of the silent movie (Güttinger, 1984). A recent example of the effective use of music is Richard Linklater's film Dazed and Confused, which offers an key to the feelings and ideas of the mid 1970s. Nevertheless, sound was excluded because its integration is too great a challenge for such a short term project. 4: Film 47 Normative aspects of meaning are not considered in depth in this thesis, though the author is aware of the relevance of questions relating to ideology or "world view" to a film's author. However, we leave the addressing of such questions to further research. The chapter closes with a discussion of a model of film editing that represents the process of generating meaningful film sequences. 4.2 Phenomenological Approaches to film When Parisians filed into the Salon Indien, where the Lumiéres held their first film showing before a paying public, they were excited by the lifelike portrayal of motion, which of course is one of the striking features of film and the main reason behind the psychological effect of reality (Metz, 1974).2 Motion is an almost totally visual experience (a counterexample from the real world is a passing train, which can also be detected audibly and haptically) that cannot be reproduced unless one recreates the same order of reality, which means either repeating the impression of reality or repeating reality itself. The physiological phenomenon responsible for the illusion of movement is stroboscopic motion, which describes the effect of an image being retained on the retina of an eye for a short time, after it is experienced. If the image is displaced by another image in quick succession, the deception of movement is maintained (Hochberg, 1978)3. Unlike a photograph, which also confronts us with reality in the form of the objective reproduction of objects, film includes the element of motion. It is the combination of motion and the appearance of form which together “force” us to perceive a film as something real. The interaction between movement and form causes a deliberation of the objects from the two-dimensional screen, so that they stand out as substantiated figures against their surroundings, or as Metz describes it: 'Movement brings us volume, and volume suggests life.' (Metz, 1974, p. 7). 2 However, the illusion of realism conveyed by photography and motion pictures is historical, because the public learned culturally to distrust them. The photograph or film is understood as an object that has been worked on, with respect to composition and construction, under the influence of aestethical or ideological norms (Barthes, 1977, p. 19). The analysis of the image later in this chapter provides additional evidence of this. 3 The standard projection rate for a film is 24 images per second. However, European television films use a speed of 25 frames per seconds, due to the frequency of 25 images per second used by European television systems (Monaco, 1981). The playback rate of a video camera is normally 30 frames per seconds. 4: Film 48 It can be argued, however, that motion is not solely responsible for the creation of reality in film. Describing a still photograph as a frozen moment from the past (Bazin, 1967a) it is easy to understand why we do not take a photograph as something real, despite its objectivity. We know that this particular image once existed in front of the lens but that we are, at the moment of observing, confronted with a mere reproduction. Although this fact is also indisputably true for images in film, i.e., the objects and characters are reproductions, we are still encouraged to believe that we are witnessing the occurrence of a real event. An example of the engagement of the film audience in a portrayed event occurred in connection with a scene from Polanski's "Rosemary's Baby" (see Picture 4.1). Ruth Gordon is sitting on a bed, talking on the telephone. One can see her back and parts of her head, but not her face. William A. Fraker, the film's cameraman, reported in an interview that he and Polanski actually observed members of the audience attempting to turn their heads to the right so that they could see around the wall and door frame to have a complete view of the character. So, reality assumes presence, which is created out of two parameters, space and time. Picture 4.1 Ruth Gordon in Polanski'sRosemary's Baby (1968) Metz (1974) demonstrated, in comparing film with theatre, that the latter cannot be a convincing duplication of life, because it is so obviously a part of life (e.g. social ritual, real actors, real stage, etc.). Despite the fact that the actors pretend to be in a different time and location, it is impossible for the spectator to accept the performance as reality because he or she shares the same space as the actors. The impression of reality in the theatre is created indirectly and by virtue of a cultural awareness (Bettetini, 1973). Film, on the other hand, confronts the spectator with real people and objects, and their actions take place within a seemingly real environment.4 Moreover, there is no connection between the space of the spectator's reality (i.e. sitting in a 4 This is the characteristic body of 'human content', as discussed while describing the concept of genre in chapter 2, of which certain factors are constitutive for film reality. Within this reality it is narrative that develops 'anything from outer space to an inner dream-world.' (Tudor, 1974, p. 113). 4: Film 49 chair) and the space of the fictional film reality (e.g. Marlon Brando sitting in a Cambodian temple, during the Vietnam war). The result is a steady mixture of the reality of the ongoing film fiction and our perception of it.5 In other words, the denotative material of the film becomes real through our identifications and projections. One aspect of film reality is, therefore, our imagination, which means that film reality is created within us, the audience. To intentionally stimulate the viewer's imagination, we require a functional understanding of the two formal structures within film, the content (realised through the image) and its temporal order (as realised through montage). Since the relationship between the two representational systems is complex, it is useful to examine them separately to determine their relevance to the automated generation of meaning. We begin the investigation with the image. 4.2.1 Film Image The strongest impression of a film comes from its images. After seeing a film we might not remember particular effects achieved by a skilled use of camera movements or editing techniques, but the effect of the images remains, such as the rainy, gloomy streets in Blade Runner, the expressionistic setting of Das Kabinett des Dr. Caligari, the old man in Murnau's film Der letzte Mann, who sits dejected in the lavatory, after being relegated from his position as the doorman of the Hotel Atlantis to that of toilet attendant, or the impressionistic lightning in Murnau's Sunrise. Though the message of the images is perceived by virtue of visual perception, which is, in its physical aspects, the same for all perceivers, we can see from the above collection of visual memories that it is also related to the experience and knowledge of the viewer and thus differs between people and cultures. In this respect, film perception is similar to language understanding. Hence, the language metaphor is a common feature of theoretical analyses of motion pictures.6 In fact, linguistic tools can be useful in the analysis of images. Visually literate people, those who learned to comprehend visual material on anthropologic-cultural and psychological grounds, can understand film more readily, and so the analogy hints at shared communicational 5 With reference to the above it should be clear that the reality reflected in a film is not absolute, as suggested by the theoretical approaches of Kracauer (1960) and Bazin (1967b, 1971), but rather multiple and thus open to permutation and combination. 6 See especially Metz's article Problem of Denotation in the Fiction Film in Metz (1974, pp. 108 - 146), Barthes (1967; 1977) and Carroll (1980). 4: Film 50 mechanisms which may thus be useful for the analysis of the medium. The analogy between film and spoken or written language is, however, inadequate (Bettetini, 1973), as it breaks down when one attempts to identify filmic equivalents to words and sentences. Given the assumption that the word is the smallest meaningful unit in spoken language, the question arises as to whether the single image represents the filmic equivalent.7 Obviously not, since each image includes an indefinite amount of visual information, which provides a continuum of meaning. Since it is this visual information that forms the minimal unit within a film, it seems more appropriate to see film as a semiotic system, where semiology is not understood as a translinguistic system for examining all sign systems in terms of linguistic principles (Barthes, 1967), but rather as a general discipline for the study of signs, in which linguistic and cinematic signs constitutes a specific topic (see Eco (1977, 1985); Jakobson & Halle (1980); Peirce (1960); Saussure (1966)). Semiotic theories form the basis of the following discussion. 4.2.1.1 The sign The content of an image is composed of objects, where each object is a sign. According to Saussure (1966), a sign usually consists of two distinguishable components: the signifier (which carries the meaning) and the signified (which is the concept or idea signified). There are two significant aspects to Saussure's notion. Firstly, it is important to be aware that the signified is not a mere referent of the signifier. The relation between the two elements is not a naming-process only, as the signified resembles not a thing but a concept. Secondly, the relation between the signifier and the signified is arbitrary. Saussure states: 'The idea of "sister" is not linked by any inner relationship to the succession of sounds s-ö-r which serves as its signifier in French; that it could be represented equally by just any other sequence is proved by differences among languages and by the very existence of 7 The current author is aware that the smallest meaningful unit within the sentence is actually the morpheme. 4: Film 51 different languages: the signified "ox" has as its signifier b-ö-f on one side of the border and o-k-s (ochs) on the other. (...) The word arbitrary also calls for comment. The term should not imply that the choice of the signifier is left entirely to the speaker (...); I mean that it is unmotivated, i.e. arbitrary in that it actually has no natural connection with the signified.' (Saussure, 1966, pp. 67 - 69). It is, in particular, the arbitrariness of the relationship between signifier and signified which enables the creation of higher order sign systems and their diversity. However, the principle that the signifier has no natural connection to the signified is problematic for photographic images, where the resemblance of signifier and signified is a key factor. Unlike spoken or written language, the film image depicts, and the viewer does not usually have to struggle to identify what it shows. The denotative power of film, the optical pattern, communicates a precise knowledge, which releases the audience from the process of decision making but nevertheless leaves a problem of interpretation, as is discussed later in this chapter. Since Saussure's definition of a sign is unsatisfactory with respect to the relation between signifier and signified, we arrive at a more comprehensive definition by adopting Peirce's view: 'A sign, or representamen, is something which stands to somebody for something in some respect or capacity. It addresses somebody, that is, creates in the mind of that person an equivalent sign, or a perhaps more developed sign. That sign which it creates I call the interpretant of the first sign. That sign that stands for something, its object. It stands for this object, not in all respects, but in reference to a sort of idea, which I have sometimes called the ground of the representation.' (Peirce, 1960, p. 2228). Eco (1977) argues that, even though Peirce's definition seems very similar to Saussure's, in that both base their view of a sign on the combination of a signifier and signified, Peirce's definition is wider, as it actually inherits three distinct phenomena of communication: the sign in relation to itself, in relation to the object and in relation to the interpretant (Peirce, 1960, pp. 1529 - 1572). Thus, neither any kind of intention nor an artificial production is required. The arbitrariness of a sign is still assumed, as 4: Film 52 in Peirce's understanding it is described as the continual reference of one sign to another or to a string of signs. This constant transformation of signs into other signs is not based on the representation of real factors with immediately motivated meaning, but rather on the presentation of effects of conventionalisation. This is highly significant as it shows that the signification of visual material is not only based on its denotative characteristics but also, and more importantly, on its connotative units. 4.2.1.2 Sign and idea Though the strong denotative quality of the iconic sign can be understood as that which the creator intended it to be, it should be clear at this point that different connotations can be attributed to the sign depending on the circumstances and abductive presuppositions of the receiver at the time of perception, along with the various codes and subcodes the receiver uses as interpretational channels. An image is a cultural product, and as such offers more than merely the sum of its denotative material. For example, an image of a rose in a political portrait of Richard III can be understood as that which it is, i.e. a rose. However, the connotation of the rose in this context means more than the rose itself, because, depending on its colour, it represents either the house of York (white) or the house of Lancaster (red). Furthermore, film-specific devices, such as camera angle, colours, clearness or vagueness of the image, etc., give rise to connotations that provide additional information. For example, a lowangle view of a white rose might indicate the dominance of the house of York. The same rose portrayed by an overhead view might give the opposite impression. Thus, an important feature of the signification within an image is the organisation of signs. Jakobson identifies two fundamental operations that exploit the organisation of signs: '(1) Combination. Any sign is made up of constituent signs and/or occurs only in combination with other signs. This means that any linguistic unit at one and the same time serves as a context for simpler units and/or finds its own context in a more complex linguistic unit. Hence any actual grouping of linguistic units binds them into a superior unit. Combination and contexture are the same operations. (2) Selection. A selection between alternatives implies the possibility of substituting one for the other, equivalent to the former in one respect 4: Film 53 and different from it in another. Actually, selection and substitution are two faces of the same operation.' (Jakobson & Halle, 1980, p.74). According to Jakobson, the application of the above operations to signs results in a system of meaning based on alternations and alignments.8 In the process of alternation (or choice) a sign is compared, not necessarily consciously, with potential but unrealised candidates of the substitutional space, i.e. choices such as camera angle (high - low), colour (bright - dull), appearance (fresh rose or fading), etc. The organisation of a paradigm is usually described as a vertical structure. The process of alignment (or combination) focuses on the order of signs where the relationships between the signs resolve their meaning, or in Saussurean terms: 'Combinations supported by linearity are syntagms. The syntagm is always composed of two or more consecutive units [...]. In the syntagm a term acquires its value only because it stands in opposition to everything that precedes or follows it, or to both.' (Saussure, 1966, p. 123). The paradigmatic and syntagmatic axes of meaning, as described in Figure 4.1, are the basic supporting structures for signification, for any symbolic process or system of signs. A particularly interesting point made by Jakobson is that a sign system does not consist only of the two fundamental structures (paradigmatic and syntagmatic), but that each crystallises into a rhetorical device, i.e. the paradigm into the metaphor and the syntagm into the metonym, which are opponents9. This means that, for Jakobson, even these "free" variations deal with codes that are based on systems of opposition and difference within the language of a culture, a social group or an individual. 8 This extends Saussure's syntagmatic and associative understanding of the linguistical planes of meaning, as described in Saussure (1966, p. 123). 9 However, Whittock (1990) argues that there is no polarisation between the two rhetoric principles of organisation, as suggested by Jakobson, but rather metonymy is one major type of filmic metaphor (the second being what Whittock calls distortion metaphors) (pp. 35 -36). 4: Film 54 Paradigmatic Axis skirt knickers short Syntagmatic Axis shoes socks pants sweater scarf hat kilt culottes tights Figure 4.1 Syntagmatic and paradigmatic structures of clothing (Monaco, 1981, p. 341). This leads to the heart of signification and offers the link to Peirce's understanding of the relationship between object and sign. In his famous trichotomy, Peirce defines a sign as being either symbolic, iconic or indexical. The characteristics of each can be described as follows: • Icon A sign which represents its object mainly through its similarity with some properties of the object, based on the reproduction of perceptual conditions. A zebra, for example, can be identified from at least two characteristics - four leggedness and stripes. • Index A sign which represents its object by an inherent relationship. Examples: A man with a rolling gait can indicate drunkenness, a sundial or clock can indicate the time of day, a wet spot on the ground indicates split liquid, etc. • Symbol A sign with an arbitrary link to its object (the representation is based on convention). Examples: the traffic sign for a forbidden direction, and the cross as an iconographic convention. The above types are not mutually exclusive. Clearly, icons are the most prevalent within images. In the ensuing discussion, we first focus on icons, since they form the basis for the other two types of signs. Due to their "similarity" with the represented object, iconic signs are a major factor in achieving the effect of realism. However, Eco showed that the general arbitrariness of a sign is valid even for this type of sign: 4: Film 55 'Let's look again at the frame indicated by Pasolini - a teacher talking to students in a classroom. Consider it at the level of one of its photograms, isolated synchronically from the diachronic flux of moving images. Thus we have a syntagm whose component parts we can identify as semes combined together synchronically - semes such as 'a tall, blond man stands here wearing a light suit...etc.' They can be analysed eventually into smaller iconic signs - 'human nose', 'eye', 'square surface', etc., recognizable in the context of the seme, and carrying either denotative or connotative weight. In relation to a perspective code, these signs could be analysed further into visual figures: 'angles', 'light contrasts', 'curves', 'subject-background relationships'.' (Eco, 1976, pp. 601 - 602). Thus, an image is based on the triple articulation of photograms, iconic signs and iconic semes and receives its expression by convention. From the above, it should be clear that an image itself should be considered to be a seme. In his pioneering structural analysis of the cinematic image, Eco (1977; 1985) further classifies the underlying code system for the triple articulation of the image described above. Eco argues that the signification of signs is based on a socially determined reticular system of small semantic systems and rules for their combination. He defines a number of such systems, of which those relevant to this thesis are: Recognition codes Structural blocks of perceptive qualifications (signifieds) which are transformed into semes, such as black stripes on white fur, based on which objects are recognised. Iconic codes Perceivable elements that can be subdivided into figure, sign and semes. A figure forms conditions for perception, such as relationships between object and background, contrast in light, geometrical proportions.10 A sign denotes, using conventionalised graphical methods, units of understanding (nose, ear, sky, cloud), abstract models, or idealised diagrams of the object (the sun as a circle with thread-shaped beams). Semes are complex iconic phrases (the image of an object). Iconic codes change readily within the same culture, due to their contextual interlacing (a horse as part of a shop label may suggest the availability of equestrian products, while a horse on 10 All of these codes have been developed and refined by other visual arts, i.e. painting, sculpture and photography. Arnheim (1956) proposes ten determinants: balance, shape, form, growth, space, light, colour, movement, tension and expression. 4: Film 56 a traffic sign may suggest "beware, horses on road"). Figures represent structures on the syntagmatic axis, but only within the frame of an image. Signs and semes represent structures on the paradigmatic axis. Iconographic codes Semes of iconic codes are composed into complex and culturalised semes, e.g. "the four horsemen of the Apocalypse". Iconographic codes represent structures on the syntagmatic axis. Rhetorical codes Models or norms of communication which can be divided into rhetorical figures (e.g. metaphor), premises (e.g. a man riding along a never ending prairie can connote loneliness), and arguments (which create syntagmatic connotations based on the succession or opposition of different images). Stylistic codes A stylistic feature, such as the mark of an author (for example, a man walking along a road tapering off into the distance suggests “Chaplin”), or the typical realisation of an emotion (a woman who leans seductively against the curtain of an alcove suggests “Erotic of the Belle Époque”) or the typical realisation of an aesthetic, technical-stylistic ideal (as in cubism, where objects are portrayed in abstract, geometrical forms). The organisational structure of the above described system of cultural units (Interpretants) is based on semantic fields. Our understanding of a semantic field follows the definition of Bordwell (1989, p. 106): '...a conceptual structure which organises potential meanings in relation to others'11. A semantic field can be built on various principles, such as: clusters terms within a semantic field that overlap semantically, e.g. synonyms doublets semantic fields organised on the basis of polarities, e.g. oppositions proportional series a series of oppositional doublets e.g. female - male, passive active, womb - phallus, etc. hierarchies ordered semantic units based on relations of inclusion or exclusion, e.g. Pekinese/dog/animal/living thing (Bordwell, 1989) 11 See also Eco (1977, pp. 73 - 150). 4: Film 57 The conclusion from the above discussion concerning the iconic sign is that the line between denotation and connotation is continuous, which is explicit in considering the sign as index. This type of sign is the most dynamic, as it is neither "identical" like the iconic sign, nor arbitrary, like the symbol. The images shown in Pictures 4.2 - 4.3 represent examples of how indexical signs create meaning. Spike Lee's movie "Do the right thing" portrays 24 hours of a stiflingly hot day in Brooklyn (see Picture 4.2). The concept of hotness is suggested in the film by showing “hot” colours (the red wall as background) for the three men sitting under the sunshade, objects like the sunshade and aspects of appearance, such as the way in which the men are sitting. Picture 4.2 An image taken from Spike Lee's Do the right thing (1989) Picture 4.3 Liv Ullmann in Bergman's Shame (1968) 4: Film 58 In Bergman's film "Shame", money is seen to be placed beside the head of the main female character (see Picture 4.3), who is lying on a bed. This creates the concept prostitution. The above examples of indexes are metonymic, since associated details or notion is used to invoke an idea or represent an object. Related to this is the concept of a synecdoche, where the part stands for the whole or the whole for the part. A visual example of a synecdoche is marching feet to represent an army. At this point in our analysis of the image, it may seem that only paradigmatic choices are relevant to the signification of an image. This is not strictly true. Since an image creates a space, there are also essential syntagmatic influences on the understanding of an image. As an example of a syntagmatic influence on the understanding of an image, we consider the three-dimensionality in a two-dimensional image, which works on composition planes: frame (the image), geography (bottom line to horizon) and depth.12 Example subcodes within these planes are depth perception or latent expectation. Depth perception is built upon convergence, relative size, density, gradient and occlusion (overlapping). Thus, for example, the importance of an object is related to its position and size within a frame. However, social conventions dictate certain expectations about space (e.g. a text can be read from left to right, right to left, or from the top down). Other spatial expectations are, for example, that the bottom of an image is more important than the top, horizontals are more important than verticals, a diagonal line from bottom left to top right goes up, etc. Needless to say, there are more codes operating in a static image, e.g. form and line, symmetric versus slanting composition, lighting, and so forth. It should be stressed here that the diversity of the semantic system described above provides, with its combinatorial possibilities, the foundation for a subjective interpretation by each viewer, as mentioned earlier. The image, embedded in a myriad of perceptual, cognitive and cultural codes, is subjective in its accessibility. Consider Picture 4.4, which is taken from Bertolucci's "The Last Emperor". 12 Frame and image are not actually synonymous The frame determines the limit of the image and thus framing becomes part of the process of Mise-en-Scène, which takes place on the set and includes decisions as to the position of actors, placing of cameras, choice of lenses, etc. However, in film, the terms image and frame are used interchangeably. 4: Film 59 Picture 4.4 Image from Bertolucci's The Last Emperor (1987) One of the key structural elements in this movie is a colour code, which accompanies different stages of Pu Yi's life. For example, when he cuts his veins, he sees, for the first time in the film, red as a pure colour. This colour represents our beginning; our birth, according to Vittorio Storaro, the film's cameraman. The complexity of this image, however, derives from the added concept of suicide, which merges now with the idea of birth to rebirth. Codes can only realise their full potential impact if there is an awareness of them or, in other words, if they can relate to existing knowledge. The analysis of an image as an abstract element showed that it can be experienced optically (objectively, realistically) and mentally (subjectively). However, both levels are necessary for the creation of meaning. Figure 4.2 summarises the compositional and the interpretational structures that enable the perceiver to understand an image. The results gained thus far from our analysis of the image system enable us to identify initial requirements for the representation of video and the architectural structure of an automated editing system. There is clearly a need for a representation of images which operates on the level of iconic signs and, perhaps, semes. The description of the images themselves should be as objective as possible, in order to facilitate the derivation of various connotative meanings of the material. This requirement dictates that the representation applied directly to the visual material should contain no connotative descriptions of that material (similar points are made by Parkes (1989a) and Davis (1995)). There should be a strict separation between the representation of cultural and visual codes in the form of semantic fields, and the mechanisms for traversing networks of these semantic fields. However, an exhaustive inventory of the codic, constitutive parameters of an image must always be limited by the extent of the representation used, and hence it is very doubtful that an automated system for the creation of meaningful sequences could ever operate in a completely domain independent way. 4: Film Figure 4.2 60 The compositional and interpretational structures that make up the image (based on Monaco (1981, pp. 144 145)). 4.2.2 Film Movement Film is a dynamic medium, and therefore we must analyse not only images but the transition between images. Of particular interest is the effect on the identifiable semantics of the single image when it appears in a shot, and how the syntagmatical behaviour of shots is to be specified. 4.2.2.1 From frame to shot In our above analysis of the image, we discussed the relationship between image and frame. We showed that even the static image has underlying semantic structures, for example iconic codes, that transform the image into a compositional unit. Of higher complexity is the relationship between the frame and the shot, where a shot is 'a single piece of film, however long or short, without cuts, exposed continuously' (Monaco, 1981, p. 452). The additional element here is movement, which provides the basis for the understanding of action, distance and the relationship between characters, based on the relationship between images within a shot and their rhythmical variations. 4: Film 61 In his work on visual codes Eco (1976; 1985) showed that there is a difference between real physical actions and those represented in the cinematic medium: 'But passing from the photogram to the frame, the characters accomplish certain gestures: the icons generate kines, via a diachronic movement, and the kines are further arranged to compose kinemorphs. Except that the cinema is a little more complicated. As a matter of fact kinesics has raised the question of whether kines, as meaningful gestural units (and thus, if you like, equivalent to monemes, and definable as kinesic signs) can be decomposed into kinesic figures - ie. discrete kine fractions having no share in the kine meaning (in the sense that a large number of meaningless units of movements can compose various meaningful units of gesture). Now kinesics has difficulty in identifying discrete units of time in the gestural continuum. But not so the camera. The camera decomposes kines precisely into a number of discrete units which still on their own mean nothing, but which have differential value with respect to other discrete units. If I subdivide two typical head gestures into a number of photograms (e.g. the signs 'yes' and 'no'), I find various positions which I can't identify as kines 'yes' or 'no'. In fact, if my head is turned to the right, this could either be a figure of a kines 'yes' combined with a kine 'nodding to the person to the right' (and in which case the kinemorph would be: I'm saying yes to the person on the right'), or the figure of a kine 'no' combined with the kine 'shaking the head' (which could have various connotations and at this case constitutes the kinemorph 'I'm saying no by shaking my head'). Thus the camera supplies us with meaningless kinesic figures which can be isolated within the synchronic field of the photogram, and can be combined with each other into kines (or kinesic signs) which in their turn generate kinemorphs (or kinesic semes, all-encompassing syntagms which can be added one to another without limit).' (Eco, 1976, pp. 602 - 603). Eco not only shows the difference between real and cinematic action, but also provides a semiotic outline of the paradigmatic and syntagmatic dimensions of the shot, which places him unequivocally on the side of the "montage-roi", Eisenstein, 4: Film 62 whose approach to editing represents the most systematic attempt to address problems associated with the notion of fragmentation (Eisenstein (1948, 1951, 1988, 1991)).13 Eisenstein describes the shot as a fragment from which the total filmic expression is composed. He insists, at great length, on the possibility of mastering the constitutive parameters of the image for a successful shot analysis, and shows that each fragment features the same paradigmatic and syntagmatic structures as the image14. For this reason, Eisenstein describes the frame as the cell of montage15. The creation of a fragment through filming thus supports the specific translation of necessary actions in space into a cinematic vocabulary based on which potential meaning can be extracted (see Eisenstein (1991, pp. 11 - 57)). Thus, elements that operate even in the static frame now take on extra connotative power due to their dynamic qualities. The compositional use of focus, for example, through which the foreground, middle ground or background are emphasised, guides the perception of a shot. If all planes are represented in deep focus, they are attributed with the same level of importance, whereas emphasis can be achieved by use of shallow focus. Citizen Kane, by Orson Wells, provides many well-known examples of the use of focus in these ways. Of even stronger impact than focus is camera movement. Also worthy of consideration here are the pan (from left to right or right to left around the imaginary vertical axis of the camera), the tilt (up and down movement of the camera, rotating around the axis that runs horizontally through the camera head), and the roll, in which the camera moves about the longitudinal axis from the lens to the subject. The tilt, for example, presents the eye-level from which a scene is perceived. The tilt can affect the importance ascribed to an object (for example, highangle shots may diminish the perceived importance of an object, as discussed earlier in the example of Richard III). Using the dynamic qualities of film, specific elements can, in one shot, directly provoke an emotional reaction. Imagine a shot in which the camera follows a 13 This statement is not meant to diminish the achievements of other theoreticians such as Kuleshov, Pudovkin or Vertov. 14 Eisenstein was very much aware of the existence of these parameters, but did not describe them, except in the form of specific examples. The attempt to provide a complete collection of fragmentation rules remains, perhaps, due to the nature of the material, limited to empiricism. See for example the excellent but restricted approaches by Arnheim (1983) and Burch (1981). 15 However, Eisenstein also described the shot as a montage cell (see, for example, Eisenstein (1988, p. 144)). 4: Film 63 character through a group of cheerful, passionate people. The appearance and disappearance of the group in itself connotes the character's sense of isolation. It should also be mentioned that the tempo of a shot can, in itself, provide information. The intense feeling of fast movement may excite, while calm movement expressed, for example, through the slow rolling of waves filmed from a static camera position, may encourage feelings of relaxation. Related to tempo, is the perceived duration of the shot. The actual duration of a long shot full of people and action may well be identical to one of the close-up of a face, and yet the latter will be perceived as being longer. Hence, the organisation of perceptible duration is more complex than the actual duration of a shot16. It should be made clear that, in itself, the shot is an individualised unit with an independent semantics based on the juxtaposition of the intra-frame-components (for a similar argument see Parkes (1989a) and Davis (1995)). However, we must also consider the ways in which content of a shot can be affected by other shots, which is the domain of montage. 4.2.2.2 Montage, or the semantics of fragmentation A landmark in the understanding of film perception was the Kuleshov experiments (Kuleshov, 1974). Kuleshov found that the juxtaposition of two unrelated images would force the viewer to find a connection between the two. In one experiment, described by Pudovkin (1968), Kuleshov focused on the creation of artificial emotions. He took a close-up of the well-known actor Iwan Moszhukin, with a vacant expression, and intercut it with shots of a bowl of soup, a dead man and a lascivious woman to create three distinct sequences. Spectators to whom these three sequences were shown believed that the actor's facial expression showed hunger in the first sequence, sadness in the second, and passion in the last. Other experiments were concerned with the artificial creation of space and character: 'A few years later I made a more complex experiment. Khokhlova and Obolensky acted in it. We filmed them in the following way: Khokhlova is walking along Petrov Street in Moscow near the 16 For a discussion of the interesting relationship between rhythm and shot composition, see Eisenstein's article Vertical Montage in Eisenstein (1991), which provides diagrams in which a sequence of his film Alexander Nevsky is described in musical terms; see also the first two chapters of Burch (1981), which feature the use of analogy between serial music and montage. 4: Film 64 'Mostrog' store. Obolensky is walking along the embankment of the Moscow River - at a distance of about two miles away. They see each other, smile and begin to walk toward one another. Their meeting is filmed at the Boulevard Prechistensk. This boulevard is in an entirely different section of the city. They clasp hands, with Gogol's monument in the background, and look - at the White House! - for at this point, we cut in a segment from an American film, The White House in Washington. In the next shot they are once again on the Boulevard Prechistensk. Deciding to go farther, they leave and climb up the enormous staircases of the Cathedral of Christ the Saviour. We film them, edit the film, and the result is that they are walking up the steps of the White House. [...] In the second experiment we let the background and the line of movement of the person remain the same, but we interchanged the people themselves. I shot a girl sitting before her mirror, painting her eyelashes and brows, putting on lipstick and slippers. By montage alone we were able to depict the girl, just as in nature, but in actuality she did not exist, because we shot the lips of one woman, the legs of another, the back of a third, and the eyes of a fourth. We spliced the pieces together in a predetermined relationship and created a totally new person, still retaining the complete reality of the material.' (Kuleshov, 1974, pp. 52 - 53). The above experiments demonstrates two distinct, but mutually influential, aspects of our understanding of film: • the meaning of a shot depends on the context in which it is situated; • a change in the order of shots within a scene changes the meaning of the shot as well as the meaning of the scene. Experiments investigating the "Kuleshov effect" ascertained the variability of the meaning of a shot within different contexts (Herman D. Goldberg, described in Isenhour (1975), Salomon & Cohen (1977), J.M. Foley, referenced by Isenhour, 1975)). Experiments concerning contextual detail (e.g. direction of movement) were performed by Frith & Robson (1975), who showed that a film sequence has a structure that can be described through selection rules and/or combination rules. An 4: Film 65 example is the continuity of direction within movement, e.g. if a character leaves a shot to the right, we expect him to enter the next shot from the left. Gregory (1961) is responsible for some of the most significant analysis of the importance of context and order in film editing. Gregory claims that not every combination of shots creates a meaning but there are restricted conventions that can help create larger meaningful entities. His key elements for creating meaning by joining shots are assertions and associative cues. An assertion is the relationship between two elements. There are many different types of such relationships. For example, the description of an attribute (such as red for a car) could be as important as a simple action (two men shaking hands). Consider a close-up of a woman who is looking down, followed by an image showing a hand holding an electric mixer directed into a bowl. The assertion made by this juxtaposition is that the woman shown in the first shot is preparing some food (Gregory, 1961, p. 39). Gregory argues that a given shot "A" can build divergent assertions with other shots by using various subsets of the information gathered from shot "A". This is especially important, as it means that the juxtaposition of shots can be analysed, in that the shot can be used as a variable collection of information rather than a fixed visual description. Associative cues result from combinations of the indicators that make creation of meaning possible. Gregory introduces two main groups of cues as being important in the creation of assertions. The first includes cues for the surrounding space. Most human activities, human roles or objects are associated with specific locations. The conceptualisation of space is, therefore, an elementary principle of the analysis and organisation of material in editing. The second type of cue is related to human actions, of which the above description of the woman cooking is an example. If but a small number (or no) cues can be found, the two juxtaposed images invite a combined interpretation by virtue of their ordering, but are nevertheless perceived by the viewer as isolated units. In such a case, the resulting combination is usually meaningless and could be irritating. The main impact of Gregory's work is to show that the juxtaposition of shots is subject to a situation plan, in which the action, emotional circumstances and timing of the potential scene are defined. It is this plan which makes editing possible (Wulff, 1993). 4: Film Thus, montage makes a point. For this thesis, an important factor which a pattern of fragmentation can add "emotional overtones" to another way, we are interested in those elements of montage connotation to the emotional patterns already established by the 66 is the extent to a sequence. Put that add extra narrative itself. However, we will limit ourselves to some types of montage since a full exploration of montage is beyond the scope of this thesis.17 The types of montage we investigate are metric, rhythmic, tonal and polyphonic montage, all introduced by Eisenstein (1988; 1991). The criterion which establishes metric montage is the absolute length of the shots, where: 'Tension is achieved by the effect of mechanical acceleration through repeated shortening of the length of the shots while preserving the formula of the relationship between the lengths ('double', 'triple', 'quadruple', etc.).' (Eisenstein, 1988, p. 186). This means that shots can be combined according to particular rhythms, such as a march or a waltz. It is not necessary that the meter should be immediately recognisable, nor is it advisable to establish too complex a rhythm, but the rhythm is nevertheless a condition for the creation of feelings. An example is the Caucasian dance scene in Eisenstein's film October. A second instance of metric montage is where the absolute length of each shot in a series is shortened, with the effect that a controlled increase in pace builds up to a climax. In short, the control of shot length and shot angle control the tempo, and thus the emotional appreciation, of a sequence. Rhythmic montage, the second category that provides extra connotative meaning by cinematographic means, is based on structures within the shots. 'Here it is quite possible to find a case of complete metric identity between shots and the reception of the rhythmic figures exclusively through the combination of shots in accordance with signs within the shot.' (Eisenstein, 1988, p. 187). 17 Such a work could also pay attention to methods introduced by Arnheim (1983) (the formalistic approach), Balázs (1972) (the idea of micro-dramatics in the close-up), Pudovkin (1968) (relational editing, that supports narrative using contrast, parallelism, symbolism, simultaneity and leitmotiv) and Dziga Vertov, as described in Petric (1987) (notably interval and uninterrupted montage). 4: Film 67 The best, and Eisenstein's favourite, example, is from Battleship Potemkin - the famous Odessa Steps sequence. This sequence is constructed through the contrast between the orderly troops and the fleeing disorderly population. The final intensification is the switch from the marching of the soldiers to the rolling of the pram down the steps, where the relationship between the actions of pram and feet works as a direct accelerator for the rhythm (Eisenstein, 1988). Tonal montage focuses on the emotionally dominant features of shots, represented through combinations of light, graphical elements (e.g. sharp angled objects prevail over round objects), focus, and so on. In a narrative film, the story line dictates which features are initially likely to attract our attention, in a more abstract film the lighting may be critical, and the film may show the changes in the moonlight and its shadows. An example of tonal composition can be found at the beginning of the "Mourning for Vakulinchuk" sequence in Battleship Potemkin, where the montage builds on several foggy shots of the port of Odessa. The dominant feature here is the fog, and thus the sequence does not establish a spatial transposition but rather an emotional setting. The conflicting relationship between metric, rhythmic and tonal montage is, for Eisenstein, one of the key elements of montage and the aim is to resolve it. This idea of conflict leads to the idea of polyphonic montage, where shots are not mechanically joined along a dominant line, but sensitively orchestrated so that the perceiver can receive a multitude of organised stimuli. It should be clear from the above discussion of montage that a shot, which is in itself a unit of composition with individualised semantics, can serve different semantic purposes when it is inserted between two other shots. New levels of meaning can be created in such a way - levels that can change or even override the individual meaning of a shot, or, as Eisenstein put it: '... the juxtaposition of two montage sequences resembles not so much their sum as their product.' (Eisenstein, 1991, p. 297). 4.2.2.3 The sequence - film relationship Though this thesis does not deal with sequences, which represent an episode in the narrational model described in Figure 2.3, we mention them here for the sake of completeness. Ruttmann's avant garde documentary Berlin, die Symphonie der Großstadt is an excellent example of how montage can be used to build the structure of a whole film. However, the usual structure of film is not essayistic but narrational, and thus the relationship between sequences is based on narrational aspects. Metz (1974, pp. 108 - 146), describes a system of binary structures with which he attempts to synthesise various theories of montage into a logical pattern. A key 4: Film 68 product of this work is the set of binary oppositions defined for the film segment, i.e. that a film segment is either: • autonomous or not • chronological or not • descriptive or narrative • linear or not • continuous or not • organised or not. However, based on the discussion of narrativity in chapter 2, and the above description of meaningful structures in visual material, we have strong reservations about the approach of describing a film through its syntax, a reservation which is partly shared by Metz, who asserted that the syntax of a film is understood because the film has been understood, and only when it has been understood. Nevertheless, there are sequence structures that can reinforce meaning based on human content and thematic structures. The temporal aspects of a film can reinforce meaning. For example, in the film High Noon, the real time of the film emphasises the structure of the sequences and thus their tempo and shifts. Tarantino's Pulp Fiction is an example of the exact opposite of this rigid structure. Here four stories are combined together, and over time (here at crucial points of the narrative) the seemingly disorganised pattern falls into place. The repetition of shot devices can also serve to reinforce meaning. A film composed mainly of close-ups excludes information about its setting and becomes claustrophobic, whereas a predominance of long shots emphasises context over characters. The above analysis of the syntactic and semantic structures of visual material indicates that it is important to distinguish between filmic and cinematic codes. The latter codify the reproduction of reality using cinematographic devices, while the former codify that communication based on narrative mechanisms. It is clearly the cinematic code which relies on the filmic code. It is essential that they are not confused. However, the work of Eco, Eisenstein, Kuleshov and Gregory tells us that film, though based on common human content and thematic structures, provides its own 4: Film 69 realities of time and space which are interwoven in the narrative structure. Figure 4.3 summarises the syntagmatical categories of film. Syntagmas Space (synchronic) Frame Time (diachronic) Shot Sequence Film Figure 4.3 Syntagmatic categories of visual material (based on Monaco (1981, p.145)) In order to support film editing, any direct representation of film should be restricted to “pure” content and exclude any narrative mechanism. Finally, it was suggested that the process of arranging visual material is plan based, In the following section we will investigate this process in more detail. 4.3 A model of film editing There are two key problems that the video editing process needs to address. Firstly, the filmic material must be composed so that the film becomes perceptible in its entirety, or the effect of reality is lost. Secondly, the intended idea or theme must be communicated in such a way that the spectator can participate in the final product both emotionally and intellectually. Our model of editing (see Figure 4.5 below) is based on a knowledge elicitation exercise which involved studying and interviewing editors at work in their own environment, and on editing theory (Beller (1993); Katz (1991); Oldham (1992); Rabinger (1989); Reisz & Millar (1969); Rosenblum & Karen (1979)). A foremost task for an editor is the retrieval of information about the film, such as the topic, the story, the characters, the intention of the film and its target audience. The next step is to examine the available material. This is an extremely significant act, because now the editor forms a model of the different characters and the influence of different story lines. The editor retrieves information from and about the different pieces of film by browsing through the material to order and categorise it. This usually 4: Film 70 results in shots and takes being placed on separate "heaps", each heap representing a potential scene. Each heap and its elements is annotated in a list containing information such as heap identifier, shot identifier, shot length in frames and shot characteristics. The act of observing the raw material evokes a chain of associations for the editor. He or she remembers events, persons or emotions and these experiences may influence the created scene. At this stage, the editor, the assistant and the director often talk more about their own experiences than about the actual material. These conversations serve many purposes, the most important being to clarify the material and to compare it with other experiences in order to create a much subtler and richer concept, so that finally the audience is confronted with a theme which can be re-created by each spectator. Figure 4.4 describes the revised communication factors which influence editors' decision making and help them to predict the potential viewer's intellectual and emotional response to the film. Personality attributes Narrativity knowledge Thematic knowledge Outside cultural attributes Editor Editing knowledge Organismic attributes e.g. male, adult, etc. Outside social attributes Shared cultural structures FILM Receiver Effects Shared social structures Figure 4.4 Influential communication factors (based on Tudor (1974, p. 31)) The process of scene creation begins with a discussion of each scene with respect to the available material, its intention and its part in the overall story. While the image is highly important in making the visual statement, it is the specific interaction of shots (at the level of length, rhythm, graphical direction, darkness and lightness, colour etc.) which produces meaning. Thus, every cut must support both the concept of the current scene and the overall appearance of the film. If the film is narrative in nature, then the editor pays particular attention to different forms of spatial and temporal continuity between juxtaposed shots, which may be based on the position of a character in the screen, on the location or on actions performed. If the film is more abstract, the continuity may rather be based on compositional features such as graphical directions, or on rhythmical features such as speed of movement. Typical features of the editing process at this stage, which is known as the rough cut, are 4: Film 71 insertion, elimination, substitution or permutation of shots both within a scene, and the same for complete scenes within the overall structure of the film. These variations are necessary for shaping the scenes until their appearance and position within the film becomes stable. At the end of the rough cut stage, the film continues to lack a definite visual precision with respect to rhythm and technique, which it receives during the fine cut. The fine cut deals with the perception-related connection of two shots, which is given by their graphical appearance (contour, centre of sight, shared axes, etc.). In this stage, work on the overall context is replaced by a narrow field of activity typically concerned with units of something between 10 to 30 frames. (Schumm, 1993). Retrieve film info e.g. story, characters, intention Group shots and takes into heaps Create scene out Control the effect of related heap of scene Recreate scene + Control effect on overall story + - Decide about end + + - Register each heap in a list & set relevant parameters Find start point of film. + -> positive outcome Redo ordering - -> negative outcome - Stop Figure 4.5 Simplified model of the film editing process From the results of the above investigation of the editing process, we derived our simplified model of the editing process, as presented in Figure. 4.5. This model covers only the juxtaposition of takes, shots and scenes and does not take sound editing into account. Furthermore, it emphasises only the rough cut. Finally, the complex interrelationships between different stages of the process (e.g. the influence of personal attributes on decisions or the comparing of different solutions) are not specified in detail. However, the model serves as a workable approximation. It must be made clear that we are, at this stage, considering only the creation of a single scene where a start-shot is given. Thus, we focus on elements such as information retrieval, scene creation, control effects and reordering or recreation of scene structure. We exclude the creation of larger meaningful sequences of scenes. 4: Film 72 4.4 Conclusion In this chapter, we showed that there are two levels involved in the viewer's understanding of a film. Firstly, there is the optical level, which provides the perceiver with mainly denotative information, and secondly, there is a mental experience, which, based on cultural knowledge, provides predominantly connotative information. In order to allow different connotative meanings depending on the context in which the material is presented, we argued that a content-based representation of video must be as objective as possible, must not contain connotative descriptions, and must operate on the level of iconic signs and semes. As the organisational structure for signs and semes we identified semantic fields. We pointed out that an image, in itself, is a compositional unit with individualised semantics, where the semantics may change if the image is combined with another image. The same process of overriding individual meanings through composition also arises in the juxtaposition of shots. We stated that a distinction between filmic and cinematic codes must be made, since the latter codify the reproduction of reality using cinematographic devices, while the former codify communication based on narrative mechanisms. As a result, we showed that film, though based on common human content and thematic structures, provides its own realities of time and space which are interwoven in the narrative structure. Finally, we illustrated that the process of arranging visual material is plan-based, by introducing a model of film editing which focused on narrative oriented scene creation. Thus, the analysis provided in this and the previous two chapters has defined the essential elements to support the automatic generation of meaningful and emotionally stimulating film sequences. It was shown that a model of film generation operates on two levels; firstly, on the surface level, which maps the concrete physical and social properties onto the visual material presented in film, and secondly on the underlying logic of the perceptual process. A more detailed examination of the surface level of the generation model reveals that two main representational tasks can be identified. The first challenge is the problem of representing the optical pattern of images. The second representational task within the surface level of the generation model is concerned with an ontological representation describing the physical world and 4: Film 73 abstract mental and cultural concepts; representations that constitute the narrative "playing field". The main categories in this ontology are based on the elements described in Figure 2.4, i.e. events (actions and happenings), existence (characters and settings) and cultural codes. Obviously, both representational tasks mirror each other, in the sense that plot generation establishes the query for the retrieval of visual material, which must then be accessible on the basis of its content. However, film has its own "reality" which exerts considerable influence on the representation of common sense knowledge. The challenge for AI research is to synthesise the different representational requirements of common sense knowledge and film content, so that both can contribute to the editing of film. The aim of the next three chapters is to describe the necessary representational structures of the surface level, with which an artificial system can then perform the task of developing an emotionally stimulating story by appropriately arranging the relevant film material. Chapter 5 is concerned with the representation of Video content. Chapter 6 discusses representations of the relevant background knowledge to for automated video editing. Finally, chapter 7 is concerned with the representation of narrative structures such as actions, events, and emotional codes, and also considers the representation of conceptual structures required for the construction of thematic effects such as ambiguity and mischief. The actual process of the generation of film sequences by our prototype system, AUTEUR, is described in chapter 8, with examples of created films being presented in chapter 9. However, before entering into a discussion of the above representational issues, we introduce several constraints on the ensuing investigation: • In terms of shots, we assume short units with a restricted range of actors, actions and objects. • In describing the representational structure of common sense knowledge, we do not intend to cover the whole spectrum of human knowledge18. In particular, the 18 Human knowledge here describes that structured information which provides the perceiver with the ability of 'world making' which was described by Dudley Andrew as follows: 'Worlds are comprehensive systems which comprise all elements that fit together within the same horizon, including elements that are before our eyes in the foreground of experience and those which sit vaguely on the horizon forming a background. Those elements consist of objects, feelings, associations and ideas in a grand mix so rich that only the term "world" seems large enough to encompass it.' (Andrew, 1984, p. 38). 4: Film 74 discussion of genre in chapter 2 yielded the insight that the narrated world is a structured and stereotypical world. Hence, our aim is to provide structures for micro worlds that feature stereotypical actions, episodes and behaviour. We introduce the above constraints as the time span of our research project is, of necessity, limited. Therefore, it is important to define a manageable problem space. 75 Chapter V The representation of video content The previous chapter defined the essential elements of the structure and function of video. It was shown that stills and film are representational systems with independent semantics, which may change if arranged according to the principles of montage. The aim of this chapter is to use the insights gathered from the discussion in chapter 4 to describe the necessary representational structures for video content. In particular, we wish to show how the optical pattern of images can be represented, which, as described in Figure 4.2, provide the receiver with mainly denotative information. As discussed in chapter 4, content-based representation of video should be as objective as is possible, in order to allow different connotative meanings to arise according to the context in which the material is presented. Given the current state of the art in machine vision and image processing, we are able to automatically derive certain restricted representations of the content of video. However, the information so obtained is insufficiently rich (and will remain so for the forseeable future), to create a representation of video content on which automated video editing can operate1. Thus, sufficiently detailed content annotations must be provided manually, to enable the artificial system to "perceive" the video material. A scheme for the representation of video content must be concerned with more than the visual aspects of the shot. This scheme must also support the appropriate visual presentation of automatically generated emotionally-motivated narrative sequences. Hence, we need to focus on the underlying organisational structure of the 1 Later in this chapter, we argue that image processing and machine vision may be used to support the creation and maintenance of structures for representing video content. 5: The representation of video content 76 representation of video content, and its relationship to the representation of common sense knowledge (to be discussed in chapter 7), and the representation of editing knowledge (chapter 6). In this chapter, we first examine related work in the representation of video content. We then describe our approach for representing the structure, function and semantics of video. The chapter ends with a brief outline of how the proposed representational scheme can be applied automatically. 5.1 Related work Film representation and automated editing is a relatively young field. Since the mid 1980's attempts have been made to combine computer technologies and media studies to create artificial systems that embody mechanisms to interpret, manipulate or generate video. We now examine this related work. 5.1.1 Bloch and his machine for audio-visual editing The first relevant contribution to the development of video content representation is that of Bloch (1986; 1988). In his thesis Bloch discussed in detail: • the structural differences between the two media, language (spoken and written) and image (still and motion), and the implications of these for transforming written text into a visual construction; • some important features of the juxtaposition of shots; for example, techniques for breaking down an action into shots, the principles of eye-line match and screen direction of character actions, and the importance of camera movement in relation to the movement of a character;2 • the process of film editing, which resulted in the design of a basic editing algorithm similar to that described in chapter 4 of this thesis. 2 Bloch's film analysis is based mainly on the work of Bazin (1967b; 1971), Burch (1981), Eisenstein (1948, 1951), Metz (1974, 1976b), Mitry, as described in Metz (1976a), and Jost. Since Bloch's thesis was written in French, he used the original texts. Most of these texts are available in English, except those by François Jost. 5: The representation of video content 77 Bloch's object-oriented prototype system could generate short film sequences from a conceptual representation of two given stories.3 The first story mentioned two actors looking at each other. Bloch's system generated two visual versions: one version showed the two actors in the same shot, while the other combined two shots, where each shot contained one actor looking in the direction of the other. The second story presented a character walking down a spiral staircase, leaving a building, seeing a wallet and picking it up. The visual material used, shots stored on a video disc, contained simple actions which were particularly designed so that the problematic issues of matching point of view (camera position) and direction of actions between shots could be ignored. In this section, we focus on Bloch's representation of video content. A detailed discussion of the importance of Bloch's work for the representation of editing techniques is given in chapter 6. Details of Bloch's approach to the actual editing process are discussed in chapter 8. Bloch's perspective on film was strongly influenced by Metz's theory (Metz, 1974). Thus, Bloch divides film into segments (sequence shot, parallel syntagma, ordinary sequence, and so on) of which an autonomous shot forms the smallest unit. In his research, Bloch focused particularly on two aspects of the juxtaposition of shots. Firstly, he was concerned with matching two shots based on the line of sight of actors, their position within a shot, or a combination of both. The categories of the spatial transitions used were adopted from Burch (1981), i.e. inclusion, intersection and determinate or indeterminate proximity. Secondly, Bloch considered the problem of maintaining continuity, based merely on maintaining fluency of motion. Bloch divides motion into direction and speed. The direction of motion within a shot is described in terms of horizontal and vertical vectors. The speed of motion is defined as the relationship between the speed of the camera and the speed of the actions performed. The element of time, as represented by Burch's temporal taxonomy, is integrated with other meaningful elements, such as action and movement. 3 Bloch's conceptual representation of the story is influenced by the work on Natural Language Understanding of Schank (1982), Schank & Abelson (1977), Schank & Riesbeck (1981) and Schanks's students Dyer (1982) and Lehnert, et al. (1983). A brief overview of their work appears later in this thesis (section 7.1.4). 5: The representation of video content 78 Thus, Bloch's approach to editing shots focuses on three key constraints, i.e. sight, position and motion, related to semantic units within a shot, i.e. actors, actions and location. Figure 5.1. shows the resulting representation of a shot in Bloch's scheme. (deftrecord (#:machine:gen:plan plan) video ;IMAGE DEBUT ET FIN DU PLAN action [Action of actors described as a triple vector (Actorid, Objectid, Instant) e.g. (attend (actor(said)) (object gilles) (inst yeux))] acteurs [list of character names] lieu [location] ass ;ASSOCIATION : UTILISE SEULEMENT POUR L'ORGUE type [type of shot e.g. plan autonome (PA)] interieur jour pos [A list containing the ActorId-Shotposition, e.g [said-gche, gilles-dte]] mvt ;DIRECTION DU MOUVEMENT APPARENT vapp ;VITESSE DU MOUVEMENT APPARENT regard ;DIRECTION DU REGARD DU OU DES ACTANTS [A list containing the ActorId-Sightdirection, eg. [said-dte, gilles-gche]] couleurs clarte [clarity of light: value between 1 and 10] chaleur[heat: value between 1 and 10] contraste [contrast: value between 1 and 10] mca ;DIRECTION DU MOUVEMENT CAMERA mpe ;MOUVEMENT DES ACTANTS vpe ;VITESSE DES ACTANTS effects ;EFFECTS SPECIAUX (POUR L'ORGUE) Figure 5.1 Bloch's shot representation (Bloch, 1986, p. 149).4 The representation of directions is based on: • ten elementary directions (left, up-left, up, up-right, right, down-right, down, down-left, front, back); • circular direction, where the start position of the character is followed by the direction, e.g. cir: left > up-right; 4 Texts in [] were added by the current author. 5: The representation of video content • 79 directions in sequence, e.g. left + cir: left > down-right + down (describes the movement of a character down a circular staircase). The shot categories intérieur, jour, couleurs, clarté, chaleur, contraste, associations and effects were not used in the prototype. However, they give an impression of the wider shot representation considered by Bloch. Bloch's work reflects a deep insight into the issues of the representation of video content. However, it has deficiencies in terms of the retrieval of denotative and connotative information. The distinction between character and object, for example, is blurred. The representation of space is rudimentary, as, for example, no distinction between foreground and background is made and the relation between intérieur and position is left unclear. Furthermore, Bloch's representation lacks the ability to handle multiple overlapping actions and camera movements. Finally, the representation of point-of-view is not tackled, as noted by Bloch himself. Nevertheless, Bloch's approach provides an extremely useful basis for representing video content. In particular, Bloch introduces semantic and syntactic structures of video content and introduces the key categories: action, character, location, context, shot type, relative position of character and objects within a shot, movement and light (Bloch, 1986, p. 49). 5.1.2 Parkes and CLORIS Parkes describes the CLORIS system (Parkes, 1987; Parkes & Self, 1988; Parkes, 1989a; Parkes, 1989b; Parkes, 1989c; Parkes, 1992). In CLORIS, a user can interrupt a moving film at any point (by executing a simple action with a “mouse”), and then use menu commands to pose questions to the system about the objects shown, and the events and states underway in the narrative at the interrupted point. The video used in CLORIS was pre-shot videodisk material on the use of a micrometer. The CLORIS methodology for knowledge-based description of video sequences is derived from cognitive theories of visual information processing (Gibson, 1950; Gibson, 1971; Gregory, 1971; Kennedy, 1974), research on temporal knowledge representation (Allen, 1983), episodic memory (Lehnert, et al., 1983; Schank & Abelson, 1977; Schank & Riesbeck, 1981), conceptual graphs (Sowa, 1984), and cinema theory (Arnheim, 1983; Balázs, 1972; Bettetini, 1973; Carroll, 1980; Eisenstein, 1970; Metz, 1974; Monaco, 1981; Pudovkin, 1968; Spottiswode, 1955). Three overall dimensions of the representation of video material are defined, and are realised in the CLORIS system as the following components: 5: The representation of video content 80 Domain Representation Inference System (DORIS) In contrast to other multimedia researchers (Clark & Sandford, 1986; McAleese, 1985), Parkes understood that even though a visual medium is specific, it is necessary to make a distinction between the actual portrayed object (e.g. Jim's house) and the class the object belongs to (e.g. houses). He states the following requirements of video representation: 'A representation language for photographic material needs to be able to maintain the distinction (for physical objects, at the very least) between concepts and instances of concepts.' (Parkes, 1989a, p. 73). 'A representation language for photographic material needs to be able to maintain the distinction between, and commonalities of, information about concepts and information about specific instances of those concepts.' (Parkes, 1989a, p. 75). 'A representation language for photographic material needs to be able to maintain a definitional type hierarchy which includes the concepts which have instances depicted in the visual material.' (Parkes, 1989a, p. 76). 'Objects do not exist merely on film, but have an independent existence in the real-world. [...], even if all the visual details of an object are presented, discussion may be required about the non-visual characteristics of that entity, its relationships to other objects in a domain, etc.' (Parkes, 1989a, p. 78). In his representation for the background knowledge required by a viewer of a film, Parkes rigorously distinguishes between classes and instances of classes. CLORIS uses scripts (based on those described by Schank & Abelson (1977)) to represent stereotypical sequences of events. Type labels within the scripts feature in a type hierarchy, and some types are defined by means of type definitions and schemata. Events within scripts are represented by event-frame-rules, which capture the pre- and post-conditions for an event, and also represent information about the states that facilitate, accompany and result from, an event. The constructs used to provide the descriptions of scripts were graph structures and event relations adopted from Sowa 5: The representation of video content 81 (1984). The description of temporal relations between events draws on interval-based temporal logic (Allen, 1983). Image State Descriptions (BORIS) A fundamental insight of Parkes' research is the understanding that the image has a 'continuum of meaning', depending on whether the image is displayed in a dynamic context, or simply as a still image (Parkes, 1989a, pp. 68 - 70). Consequently, he rejects the keyword approach to the representation of images as insufficient, since keywords can neither provide the necessary relations between the objects within an image nor support the relations between images; and finally, keywords do not support inheritance. Additionally, Parkes develops a new conception of the basic unit for the representation of film content, the setting: 'Definition: a setting in a moving film is the unit of film associated with the longest time-interval over which the visual content of the film can be objectively described by using the same conjunction of formulæ. A setting description is such a conjunction of formulæ.' (Parkes, 1989a, p. 44) Thus, a film setting could be a single frame or a collection of several hundred frames. Furthermore, Parkes states: 'The setting is the minimal described unit of film sequence at the level of events i.e. the constituent below which descriptions, at the level of events, are not attached.' (Parkes, 1989a, p. 44). Settings have their own descriptions (so that the system can discuss the content of frames from the setting as pictures in their own right). If a sequence of events described by some script is depicted in a moving film, the script is specialised to refer to the particular objects and actors within that film, and the events within the script are associated with settings over which those events are realised. As the system has access to the original, “abstract” script, it can infer which events are “present” in the narrative but have not been displayed in the film (what Carroll (1980) calls “linear deletion”). The concept of the setting enables Parkes to describe relations between images in terms of relations between settings, where such relations represented camera movements, such as pan, tilt and zoom. The major problem of Parkes' approach is that 5: The representation of video content 82 it can neither cope with multiple overlapping camera movements nor with overlapping setting structures. Film Structure Knowledge (MORISS) When answering questions, CLORIS would, if possible, use sequences of the film to support the text generated as the answer. However, this facility was rudimentary, and CLORIS would merely piece together sequences of film featuring a relevant concept or event, without concern for the rules of editing or montage. Parkes' system was the first to demonstrate that content-based descriptions of moving film could be used by a system to intelligently discuss those films. His major contributions to the representation of video content are the identification of the setting and the highlighting of its context dependent behaviour; introducing the strict separation between instances and classes of objects within an image; and emphasising the objective description of objects in an image. As a result of these insights, Parkes describes the settings structure, which offers a facility for browsing images on a spatial basis. However, as for Bloch's scheme, Parkes' approach to content representation is deficient when it comes to the retrieval of denotative and connotative information within a setting. Furthermore, Parkes' scheme contains no knowledge of editing and montage. 5.1.3 Aguierre-Smith and the Stratification System The research carried out in the Interactive Cinema Group, at MIT, is concerned with exploring the use of digital technologies to support the collection and access of nonlinear media materials. The aim of such research is to provide tools for the design or use of media units, such as interfaces for the annotation of video material or the tailoring of video news stories. Much of the research focusses on the indexing problem for video (Davenport, Aguierre Smith, & Pincever, 1991; Mackay & Davenport, 1989). Results of this work are systems for the annotation of video material which consider stream-based/keyword methods of representation (Aguierre Smith & Davenport, 1992; Aguierre Smith & Pincever, 1991), or stream-based/iconic approaches (Davis, 1993; Davis, 1995). Aguierre-Smith designed the Stratification System to support an anthropological video study in the state of Chiapas Mexico (Aguierre Smith & Davenport, 1992). The idea was to provide a number of researchers with random access to a video archive, in which video could be annotated with complementary or even contradictory descriptions. The video material was stored on a laserdisc. The annotations used in the 5: The representation of video content 83 Stratification System were keywords organised in hierarchical classes which were implemented as directory trees in UNIX. The novel feature introduced by AguierreSmith was the multiple partially overlapping annotation, where each annotation is related to a precise time index (begin and end frame). Figure 5.2 illustrates the idea behind such a layered context representation, for a shot consisting of one hundred frames. 0 100 bike pepsi praying garden Shot Annotations Figure 5.2 Layers of annotations for a 100 frame shot To provide a visual representation of the distinct layers of the representation, the Stratification System used a histogram, where the keyword classes are displayed as buttons along the y-axis and the time code (frame numbers on the laserdisc) form the x-axis. Aguierre-Smith's stream-based content representation for video enables the dynamic development of context while maintaining the completeness of the original footage. The notion of multiple partially overlapping annotations establishes the Stratification System as a breakthrough in the effective representation of video content, despite its weaknesses, i.e. the keyword approach and the lack of a true representation of the semantics of the video. 5.1.4 Semantic and conceptual indexing for video The work of Chakravarthy (Chakravarthy, Haase, & Weitzman, 1992; Chakravarthy, 1994) is in the tradition of standard record retrieval and deductive retrieval in databases. Chakravarthy's scheme provides computer access to a large database of semantic knowledge and rules that manage background knowledge to match user queries to the representation of stored pictures or video clips. The concepts described in the semantic network use the organisational structures of WordNet (Miller, Beckwith, Fellbaum, Gross, & Miller, 1993). We return to WordNet later. 5: The representation of video content 84 Chakravarthy's representational scheme for visual material is based on a set of actions, including information about the agent, object, location, etc. If the picture or video clip does not show actions, then the annotated description describes only people, location or objects. However, the content representation of video does not provide information about the temporal relationships between different actions, nor does it represent cinematic features. The matching rules are of three classes (object, action, semantic relations), each providing heuristics for finding pictures or video clips that match the user's query. The relations provided for each class are: object A-KIND-OF, HAS-MEMBER, ASSOCIATED-WITH PLAYS-ROLE-IN, action ENTAILS, CAUSES, TYPICAL-ACTION semantic relation these are rules describing relations between entities, where matching rules make use of additional contextual information, e.g. combinations of LOCATION-OF and PART-OF may be used to create a rule that can retrieve visual material showing a part of an object in a particular location. Chakravarthy's system enables obvious matches, such as presenting a Basset hound in response to a query for a dog. However, it can also perform more sophisticated matches, such as answering a query for the action "riding", by providing a clip of an astronaut driving a lunar buggy, or satisfying a query for an action in a hospital by a video clip of a doctor positioning a microscope for microsurgery in an operating room. A related approach is taken by Lenat and Guha for their OPIAM system (Lenat & Guha, 1994). In this research project, the large semantic knowledge base and inference mechanisms of the Cyc system (Lenat & Guha, 1990) (also discussed later, in section 7.1.2) are applied to the representation and retrieval of still and moving images. The goal of OPIAM is that the captioner provides a fairly neutral statement of the content. This, together with the domain knowledge, is then used to generate indices on demand, at query time, to support the retrieval process. For example, the system satisfies a user query "Find images of shirtless young men in good physical health", by presenting clips annotated as "Pablo Morales winning the men's 1992 Olympic 100-meter Butterfly event" and "Three blond men holding surf boards on the beach". 5: The representation of video content 85 However, there are several problems in the approaches of both Chakravarthy and Lenat & Guha. Firstly, inferences that are based on the indexing statement may lead to inappropriate retrieval results. Lenat's example of the three blond men holding surf boards on a beach may show men in good physical health, but this may not necessarily be the case. Secondly, both systems do not specifically orient their representations to the requirements of still and moving visual material in ways we specified as necessary in chapter 4. This means that the systems do not take video-specific ontological properties and constraints concerning semantics and syntax into consideration. Thirdly, both systems represent the image or video content in an explicitly determined way, which reduces the possibility of exploiting such semantic and syntactic issues as those raised by the Kuleshov experiments (section 4.2.2.2). Nevertheless, the approach of using semantic background knowledge to represent and retrieve images and video enables the derivation of different connotative meanings depending on the context in which the material is presented (discussed in sections 4.2.1.2, 4.2.2.1 and 4.2.2.2). Furthermore, semantic networks add to an automated editing system by facilitating the drawing of inferences regarding which shots may be substituted for others, which in turns increases the fullness and accuracy of the representation of context. As an alternative to natural language based query and search approaches, Domeshek and Gordon (Domeshek & Gordon (1995), Gordon & Domeshek (1995)), propose conceptual indexing organised around cases in memory that support browse and zoom oriented retrieval. Domeshek and Gordon base their work on Domeshek & Kolodner (1994), Kolodner (1993), Schank (1982), Schank & Abelson (1977), Schank, Kass, & Riesbeck (1994), Riesbeck & Schank (1989), and have developed a stock video library for Andersen Telemedia, to promote video production for training purposes. The conceptual indexes are based on a canonical representation of concepts in particular cases (here for the domain of the everyday social world) and a specific vocabulary. Six types of indexes are suggested (Gordon & Domeshek, 1995)5: • the scene content based on abstract organisational schemes for people, activities and locations, 5 The version seen by the current author during a visit to the ILS in August 1995 supported only indexing for the scene content and the points illustrated by a clip. 5: The representation of video content 86 • the points illustrated by a clip, i.e. an abstract idea or concept communicated by the clip, • the composition and camera work in a clip, • the likely function of a clip in a larger narrative, i.e. as part of a interludes or prologue • information concerning the source of the clip, • the relationship to other clips in the library. The semantic network of the system is composed of single nodes, where each node represents a set of disjoint concepts, and each concept contains the indexes for a case. The indexes for the scene content, for example, include information about the location, the events happening in the clip, the people and their roles, and objects. It must be stressed that the concepts forming the indexes of a case are unstructured, which allows the creation of simple frameworks and basic matching algorithms. The domain dependent vocabulary for each index type is organised into several hierarchical categories. For the component "location", the relevant categories are specific places (organised by contained-in relationships), the function of individual places (e.g. library, submarine, etc.) and the type of place (e.g. natural place, manmade place). The retrieval mechanism used is based on the zoom and browse approach developed for Ask Systems (Schank, 1994). Domeshek and Gordon adopt this approach by offering the user case indexes that allow the user to refer either to the beginning of the retrieval process (zooming), or to navigate through related case indexes by following the system provided links to other conceptual indexes related to the types of indexes on which the user is currently focused. The form of search suggested is, therefore, an incremental case discrimination based on the availability of indexes. The importance of the case-based approach presented by Domeshek and Gordon is that a parallel can be drawn between the retrieval in case-based reasoning systems and visually oriented storage systems. However, there are two drawbacks to the approach. Firstly, the visual material is understood as a text that communicates a particular idea or concept and is thus described as if the idea directly coincides with the content, which is not the case, as our discussion of the connotative information provided in an image showed (sections 4.2.1.1, 4.2.1.2 and 4.2.2.1). Thus, the case-based approach 5: The representation of video content 87 presents serious problems when visual material needs to be resequenced or repurposed. Secondly, the structures in a case-based approach are based upon indexes for particular concepts in particular cases. This means that a specific indexing vocabulary must be introduced, on which the assignment of concepts to cases can be based, which in turn restricts the indexes to those domains that have been analysed. As we will show later in this chapter, the case-based approach presented by Domeshek and Gordon not only serves the task of retrieval, but can also be valuable for the creation of meaningful narrative film sequences, as the collection of cases can, for example, represent a particular genre or theme (discussed in section 2.1.1). We will discuss this point later, in chapters 7 and 9. 5.1.5 Davis and Media Streams A five year program involving collaboration between BT (British Telecom) and MIT has attempted to develop automatic and semiautomatic tools for the construction, interaction and distribution of images and sequences of images (Pentland, Picard, Davenport, & Haase, 1993). The main emphasis of the research is the development of database oriented mechanisms for storing (image representation) and retrieving (including browsing) images on the basis of their semantic content. A second aim is the development of user friendly tools for the recording, annotation and presentation of images. An important result of the research is Davis' system Media Streams (Davis, 1993; Davis, 1995). The main challenge in designing systems to promote the use or manipulation of video, such as interactive TV or video on demand, is defined by Davis to be the mastery of the video content representation problem. In his thesis Davis states: 'Signal-based parsing and segmentation technologies must be combined with representations of the higher level structure and function of video data in order to support annotation, browsing, retrieval and resequencing of video according to its content. [...] The challenge is to develop usable technologies for the representation of video content that can leverage off of what machines can currently offer us and what humans can achieve with computational support. We are in need of technologies which add structure to the signal such that video data becomes a structured data type which can more effectively support current functionality and uses, and more importantly, enable new uses and functionality.' (Davis, 1993, p. 26). 5: The representation of video content 88 Media Streams is a result of a number of influences, such as: • dynamic memory, case-based reasoning and ontological and analogical knowledge representation (Haase, 1994; Lenat & Guha, 1990; Schank, 1982; Schank & Riesbeck, 1981), • text and film analysis based on reader response theory (Bordwell, 1985; Bordwell, 1989; Iser, 1989; Iser, 1993), • formalist, structuralist and semiotic approaches to film theory (Eco, 1976; Eco, 1977; Eisenstein, 1948; Eisenstein, 1951; Eisenstein, 1970; Kuleshov, 1974; Metz, 1974) • the aesthetics and practise of reuse of TV material by fans of TV series (Jenkins, 1992).6 Media Streams is an advanced system for the annotation, retrieval and browsing of video and audio7. Furthermore, it supports repurposing of video. A key feature of Media Streams is an iconic visual language to create temporally indexed, multilayered content annotations that support the retrieval and repurposing of video descriptions. The design of Media Streams' annotation language is based on Davis' application of the distinction between the sequence-independent and sequence-dependent meaning of an image, and the necessity for an objective description of video content (see also the discussion of Parkes work in section 5.1.2). Davis' principle categories for the description of video content include the actions of humans and objects in spatial and temporal locations, also taking account of weather and lighting conditions. Furthermore, he highlights the important problem of representing cinematographic properties, such as camera movement and framing, or properties of the recording medium (colour, granularity, etc.) which also carry denotative and connotative meaning. To support the paradigmatic as well as syntactic features of video, Davis introduces: 6 The list of references here contains only those which can be found in the current bibliography. For certain authors Davis refers to additional material. The references for Iser and Jenkins were added as they are of significant importance to Davis' approach to the representation of video content. 7 At the time of writing, Media Stream runs on a Apple Macintosh Quadra 950. The database contains 17 different videos with a total length of 24.07 minutes and 2090 annotations. Media Streams' visual annotation language is based on 3500 iconic primitives. 5: The representation of video content • 89 A representation for actions based on Eco's triple articulation of codes of action (kinesic figure, kinesic sign and kinesic semes). The spatial decomposition of actions is organised around body parts that participate in the action. The temporal decomposition of actions is based on a hierarchical organisation that describes longer sequences of actions as being composed of temporal subabstractions (Lenat & Guha, 1990). The direction of actions is related to the object or the screen position at which an action is oriented. • A representation of character, which is oriented not towards identity but towards continuity. The description distinguishes between the actor and the role. Actor contains distinguishable descriptive elements such as sex, age, body type, skin colour, etc. The role of a character is based on his or her appearance. The uniform of a general, for example, identifies the character as such. • A representation for objects, oriented to form and function. • A representation of screen geometry, capturing spatial relations between objects in symbolic terms, such as "in front of", "on top of", "inside" etc., and the position of an object on the screen. • A representation of location, stating the actual location of filming and descriptions that distinguish between geographical and functional space. • A representation of time, containing the actual time of filming and details of temporal aspects of the portrayed event (historical period, time of year, time of day, etc.). • A representation of cinematographic devices, containing descriptions relating to the camera, the recording medium, and spatial and temporal transitions of the shot. The camera is represented in terms of descriptions of lens actions (framing, focus, exposure), tripod actions (angle, canting, motion) and truck actions (height and motion). The recording medium is described in terms of stock type, colour quality and colour grain. The spatial and temporal transitions within a shot are adopted from Burch (1981). The above categories are organised in a cascading semantic hierarchy with increasing specificity of primitives on subordinate levels. The relationships between levels are represented as class/instance (adult/male/Paul), class/subclass (dog/Pekinese), whole/part (lamp/electric bulb) and term/co-occurring term (toothpaste/toothbrush). The hierarchy is implemented in FRAMER, a knowledge representation database 5: The representation of video content 90 language developed by Haase (1994). The basic unit in FRAMER is called a frame, which can have other frames (called annotations) subordinate to it. Figure 5.3 shows a typical FRAMER structure. Inheritance, to be understood here as the basis for the paradigmatic ordering of units, is established as a relationship between prototype (animals) and spin-offs (Fido the Wonder dog). animals fish birds amphibians mammals primates canines Fido the Wonder dog legs (ground:4) Figure 5.3 FRAMER structure for Fido the Wonder Dog's legs (taken from Davis (1995, p.137)) Key elements of Media Streams' icon based interface are the Media Time lines, where the iconic annotations for a particular piece of video are given temporal boundaries (in and out points) and semantic relations (spatial location, character, character action, object, object action) which connect the annotations in episodic structures, such as 'Mava lying on the beach' or 'A wave crashing into Mava' (Davis, 1995, p. 147). Davis shows, with the above structures, that user queries can be mapped directly onto the concepts represented in the semantic hierarchy and matched against indexes for each case. The strategies used are based on three types of similarity: semantic similarity, relational similarity and temporal similarity. The final result of a query is valued on the basis of: • an exact match between query and retrieved video material • the hierarchical distance between the prototype of the query and the spin-offs in the match (the lower the better) • the hierarchical distance between the prototype of the match and the spin-offs of the query (the lower the better) • a match, where retrieved material and query both form immediate spin-offs of a common prototype (the higher the better). 5: The representation of video content 91 Media Streams can use inheritance inferences in the matching process. For example, in response to the query 'adult male eating food' Media Streams returns a shot of 'Steve Martin eating pizza', 'an elderly male eating food' and 'Charlie Chaplin eating a shoe' (Davis, 1995, pp. 193 - 194). Davis also provides examples where the user query is not satisfied by a matching sequence of the annotated video material but is composed out of parts from various video fragments. Davis refers to this retrieval strategy as retrieval-by-composition. The mechanisms behind Davis' retrieval strategy are based on the continuity of actor, role, location and/or action. Take the following query as an example: 'An adult female at a beach rotating her body clockwise and then a medium shot of an adult male waving his right arm with a boat in the shot'. (Davis, 1995, p 188). Media Streams retrieves two different sequences. The first shows a shot of Mava on the beach, turning to look off screen, followed by a shot of John waving from a boat. The second version contains the same shot of Mava, this time followed by a shot of a male sitting on a horse, waving a gun. In the background of the shot is a boat. A second example, which is of particular interest for our purposes, describes a query requesting an elderly female with mud on her head, followed by a shot of a laughing character of indeterminate sex. Davis points out that the retrieved result is not particularly successful in terms of the spatial continuity, but that it works due to the presented action-reaction pair. (Davis, 1995, p. 190). Though Davis describes the above sequences, and other example sequences, in cinematic terms, it is obvious that Media Streams contains no explicit knowledge of concepts such as: • methods for shot juxtaposition based on narrative related editing strategies (e.g action match), • object specification based on shot transformation (e.g., longshot -> close-up) • constructive strategies for the creation of emotional reactions, as assumed in the above example of the elderly woman. The ability of Media Streams lies not in the creative combination of shots, but rather in the retrieval of video material that is similar or analogical to a given query, and is then presented in the order specified by the query. Thus, the narrative is explicitly specified by the query. The composition of the given examples is not based on 5: The representation of video content 92 knowledge of sub-narrative structures, organising principles for video sequences (such as continuity of graphical appearance between objects), or thematic intentions. In conclusion, Media Streams is a useful system for the annotation and retrieval of digital video. Davis' stream-based ontology for video is an important development in the representation of video, with respect to its denotative, connotative and semantic features. Media Streams' strongest asset is that it demonstrates how an intelligent interface can facilitate rapid annotation of large quantities of video. However, with respect to the repurposing of video, Media Streams plays a merely supportive role, since any requirements for mise-en-scene, ordering, structure, cinematography, and so on, must be explicitly stated in the query created by the user. Thus, strictly speaking, Media Streams is not a system for generating video sequences automatically. 5.2 An ontology for the representation of film content We now combine the results of our discussion of the semantics and semiotics of film (chapter 4), with elements of the various approaches discussed above, to specify in detail our approach to the representation of video content. An important point concerning the use of an iconic visual language for video annotation and retrieval is made by Davis, i.e that there exists an analogy between the two visual systems of icons and video8. Though icons are not identical to video, they share the same parallel legibility in terms of 'gestalt view of features, foregrounding and backgrounding and spatial relationships' (Davis, 1995, p. 258). From this, Davis concludes that representations of visual media ought themselves be visual. We share Davis' reservations about language oriented representations of video, for the interface. However, here we are concerned not with the interface but rather with underlying representational structures and units to support essential tasks in the editing process, i.e. the maintenance of continuity and temporal clipping. Thus, we need to develop descriptive computational structures for video content that can also be synchronised with structures representing the physical world and abstract mental and cultural concepts. Our proposed solution is based on a textually oriented representation that describes semantic, temporal and relational features of video in a structured way, and uses a vocabulary based on a subset of natural language. Nevertheless, the previous chapters 8 See also Eco (1977; 1985) 5: The representation of video content 93 emphasised the need to describe a communication system in its own terms, which may differ from those provided by written and spoken language. What should be noted, however, is that we use the connotative features of textual language to describe, without saying that the resulting meaning is directly linguistic. In other words, we use textual terms to express the salient features of video in a representational system whose structure is designed to match visual requirements. The advantage of such an approach becomes apparent when we later discuss representations of background knowledge (chapter 7), and the analytical nature of natural language, i.e. to generalise the abstract idea of a mental or cultural concept, is fully exploited. We now address our proposed framework for the representation of video content. First, we state the assumptions made in our approach. 5.2.1 General concepts and assumptions The following video representation formalism pays specific attention to the maintenance of objectivity in the description of shots, so that given shots can be used for a variety of purposes. The shot is a complex combinatorial system for the visual presentation of location, lighting, costume and the behaviour of subjects (mise-enscène). An important and influential source of the representation of such primary visible properties is Marr's theory of vision (Marr, 1982). Two important points arise from Marr's work. First, the content of visual representation is always individuated by reference to the physical subjects, their properties and the relations among the subjects that are seen. Second, there is in principle the possibility that a person's visual interpretation based on objective and physical objects and properties might be mistaken, and require other modalities to rectify the mistake (for Marr, this might be achieved by using another sense, e.g. touch). The preceeding chapter discussed in some detail the two important and fundamental structures for the signification of any sign system, i.e. the paradigmatic and syntagmatic axis of meaning. The basic organisation of these structures is hierarchical, as is the organisation of the description of a shot itself, in that the description descends from general features to detailed specifications. This enables us to reduce redundancy, since valid relationships between descriptional units can be implicitly expressed (inheritance). A similar approach is provided by Parkes' setting structure (Parkes, 1989a). Though the signification of film is strongly based on common human content and thematic structures, film clearly provides its own communicational mechanisms, i.e. cinematic and filmic codes, such as the use of fades and wipes as punctuation devices, or the use of the spatial relationship between camera and subject to create a three 5: The representation of video content 94 dimensional space. Hence, our shot description features two main structures, cinematographic devices and denotative aspects. The denotative part of the shot representation is subdivided into two nested structures, the foreground and background, each containing information about essential categories in the proposed ontology: action, character, object, relative position, screen position, geographical space, functional space and time. It can thus be said that the proposed structure for the content representation of a shot is decontextualised, as inspired by Eisenstein's concept of attractions (Eisenstein, 1948, p. 231). It is the decontextualisation of a shot that enables the selection of objective information, which can then be used as a basis for rearranging or constructing new connotative combinations, and thus new meanings. The aim of our content representation is to identify the visual aspects of the video which can be seen in the shot, rather than those one might infer. As our representation is language oriented, we face two major and related problems concerning the primary level of representation: objectivity and continuity. As we are using textual terms, we must avoid any overly directive choice of labelling. Therefore, we introduce generic terms. For example, instead of instantiating the action of an actor as gorge, which implies greed, only the most general term, eat, is recommended. It is then the task of the system's inference mechanisms to use the wider context established through background knowledge, to either conclude greed, or reject this interpretation. Though the information relevant to the essential categories in the denotative dimension of the ontology should be sufficient for drawing inferences, it is not necessary to slavishly apply the principle of objectivity. To do so may, in any case, lead directly to a single interpretation. Consider, for example, a character wearing a white coat, white trousers, and white shoes, with a stethoscope around his or her neck. He or she appears to be a doctor, and should be labelled so. Further examples relate to the description of geographical or functional space. An image showing the Eiffel Tower can acceptably be labelled as Location: "Paris", or four huge heads of American Presidents as Location: "Mount Rushmore" or "USA". Moreover, in the Paris example, there is yet scope for interpretation, to clarify if Eiffel Tower should, for example, be understood in terms of the semantic tags "Eiffel Tower - Paris France - Life Style" or "Eiffel Tower - Engineering - Intelligence". Such interpretations depend on the wider context and the intended concepts to be communicated. In conclusion, the use of generic terms taken from natural language supports, but does not uniquely determine, the interpretation of a shot. Related to the problem of objectivity is the problem of continuity. Kuleshov's experiments (section 4.2.2.2) and Eco's semantic systems (section 4.2.1.1.) reveal that cinematic continuity can be achieved through different content categories: 5: The representation of video content 95 actor e.g. Charlie Chaplin appearance features of a character are shown without identifying the character action continuity can be achieved through reaction to, chronology of, or direction of, the action location e.g. to show a house and then a room. The essential aim of cinematic continuity is to hold constant the distinguishing details of a character or object over a number of shots unless there is a reason to change the appearance of the character or object. Thus, continuity leads to the problem of identity. The representation of identity is a complex problem, particularly in film, a medium that does not suggest, as does language, but rather states. Imagine a number of shots to be joined, each of which shows a character described with the same values for various attributes (e.g. Race = "black", Role = "doctor", etc.). For an artificial system which depends on the given information, the person in each shot would be assumed to be the same, even though each shot may, in fact, show an obviously different human being. The problem here is one of descriptional depth and its maintenance over time; a problem clearly related to the frame problem.9 Expedience dictates the need to reduce the descriptive richness to a manageable level. For this, we introduce an identifier, which is always used when a character or object is objectively distinguishable, e.g. we see the face of a character.10 To facilitate dynamic use of video material, we follow the stream-oriented approach, inspired by Aguierre Smith & Davenport (1992) in combination with Parkes' concept of the setting as outlined above (Parkes, 1989a).11 The usefulness of combining these two approaches results in gaining the temporality of the multi-layered approach without the disadvantage of using keywords, as keywords have been replaced by a structured content representation. For each objectively described unit of video associated with a time-interval, it can be stated that the visual content of this video holds constant over the time interval, and what is invisible does not exist. 9 Some of the more important works discussing the frame problem are Hayes (1990), McCarthy & Hayes (1990), Raphael (1971), and Sandewall (1972). 10 This approach is in accordance with that of both Bloch and Parkes. A different opinion is expressed by Davis. 11 A similar approach can be found in research by Butler et al. (1996). He realises filmic principles as a "film grammar", so that the system can generate films of a certain type. The nonterminal symbols of the grammar represent groups of video segments (e.g. a close-up fragment), which are derived from rules based on filmic principles, such as parallelism, subjective shot or repetition. Setting descriptions (see above) are used as terminal symbols that offer the opportunity to realise event structures, which are then used to fire the rules of film principles. Butler's notation, like that of Parkes (Parkes, 1989a; Parkes, 1989b), is based on conceptual structures derived from conceptual graphs (Sowa, 1984). At the time of writing, Butler is completing his Ph.D. research. 5: The representation of video content 96 Hence, we associate each descriptive unit in both the content and the cinematographic section with a particular frame sequence. The connection between the different layers of a shot is realised by applying a triple identifier to each layer, which indicates the shot identifier, the start frame and the end frame. Thus, multilayered descriptive structures of video content are created, where the multiple aspects of description are held together by time (sameness of frames) and logical space (shotid). For example, an actor may perform a number of actions in the same time span. The temporal relation between them can be identified using the start and end point with which those actions are associated. For example, they may all share the same start and end point, and may be performed simultaneously. In this way, complex structured human behaviour can be represented and hence the video retrieved on this basis. Figure 5.4 shows a layered description of a shot consisting of hundred frames, featuring the actions of a single character. 0 100 eat sit talk Shot Annotations Figure 5.4 Actions annotated in layers in a 100 frame shot The horizontal lines in Figure 5.4 denote actions, whereas the vertical lines delimit the various content based layers that can be extracted from this shot. Applying this schema to all descriptive units enables the retrieval of particular material with no restrictions on the complexity level of a query. Take the simple example described in Figure 5.4. If there is a need for a character who eats, sits and talks simultaneously, we are now in the position to isolate the essential part of a shot, as shown in Figure 5.5. 0 100 eat sit talk Shot Annotations Figure 5.5 Relevant shot segment for a query for all three actions The detailed description of the relevant procedures for performing such retrieval and cutting are described in chapter 6. Having introduced the overall structure of our shot representation, the next stage is to specify the units in the different representational categories. The discussion begins with cinematographic devices. 5: The representation of video content 97 5.2.2 Cinematographic devices The representation of cinematographic devices, as presented in Table 5.1, is derived from discussions with film editors and the analysis of film theory, as presented in chapter 4. Our aim is to facilitate the application of those cinematic codes that are related to the medium specific technology, i.e. camera, lens, filmstock, as it is this technology which manifests itself in the medium's unique expressiveness (see Figure 2.4). Name Shot ID Shotlength Startframe Endframe Shot kind Shot colour Shot granularity Shot contrast Description Identifier in frames (25 frames for a second) a structure including: lens movement [start camera dist., end camera dist.] zoom-in zoom-out [start camera dist., end camera dist.] masking left, middle, right lens state (deep focus, foreground-focus, background-focus) camera distance (extreme close-up, close-up, medium, medium long, long, extreme long) camera movement (pan_left,pan_right, tilt_up, tilt_down,roll_left, roll_right) camera position (left, midle, right) camera angle (overhead, high-angle, eye-level, low-angle) film speed (slow motion, normal, fast motion) list of the dominant colours colour black & white fine, medium, strong high, medium, low Table 5.1 Representational structure for cinematographic devices The apparent redundancy of representing both lens movement and camera distance is due to the fact that only one structure can be active at a given time. It should also be stressed that the unit Shot colour is not necessarily cinematographic, but is rather a shared feature of other codes. Shot colour has been designated as a cinematographic device, because it is strongly related to shot granularity and contrast. Other refinements and extensions to the representation of cinematographic devices may be possible. For example, the representation does not deal with the stylistic device of a split screen. However, the representation is sufficiently rich to describe complex film specific movements and expressive features, without unduly restricting possible connotative combinations of the described material. 5: The representation of video content 98 5.2.3 Denotative aspects The structures discussed in this section enable the description of visual information that supports modifications to the meaning of a shot, based on common human content, such as actions, characters and settings.12 The essential aim of the proposed representation is, therefore, to describe the complex actions of characters or objects in a geographical space within the three dimensional space of a frame, but without ruling out the exploitation of the dynamic qualities of the medium. Additionally, we need to represent such features as colour, lighting, oblique versus symmetric composition, or depth perception, which, despite their physical denotative appearance, offer a suitable basis for the use of codes in the creation of meaning by automatically generated film sequences. Since a shot posseses features from a large number of categories, it is useful to separate the discussion of its components in two parts. We start with the description of representations of character, object and action, and then provide descriptions of the representations of space, lighting and time. 5.2.3.1 Character, object and action Representing a character on the basis of his or her physical appearance is difficult, due to the large number of features involved. We have already discussed the compeling requirement to provide an objective character description and support essential continuity factors, such as identity or appearance. Our approach to representing character appearance (shown in Table 5.2) attempts to maintain a balance between the divergent aims of continuity, objectivity and computational efficiency, by establishing the essential distinguishable physical aspects of a human being.13 12 The term "setting" is henceforth used in the Blochian sense of "location" and not in the sense used by Parkes (see earlier in this chapter). 13 It must be stressed here that our representation is mainly intended for human beings, though we introduce the gender artificial, which hints at the possibility of exterrestrials being described. However, most of these beings, as represented in films, posses many human features and are thus likely to conform to the presented structure. 5: The representation of video content Name Shot ID Startframe Endframe Identifier Gender Age Race Appearance 99 Description Identifier for a character, e.g. a name or a number male, female, hermaphrodite, artificial e.g. young, old, 25, etc. e.g. black, white, Asian, etc. a structure including: role e.g. lawyer, plumber, stewardess, etc. Costume ....kind e.g. business suit, apron dress, overall, etc. ....colour a doublet list providing the major colour for the top and bottom part e.g. [black, white] appeal e.g. casual, formal, etc. Table 5.2 Substructure "character appearance" The most critical of the above attributes is Appearance. The redundancy it reflects (e.g role and costume may reveal the same information) may appear to be problematic. However, defining appearance in this way promotes computational efficiency, as the need for inferences about identity and continuity is reduced. The above representation provides only the meagrest details of a character. A character is a dynamic, acting entity. As discussed in chapter 2, it is through particular actions which he or she is defined, especially since actions provide cues to the character's mental state or intentions. The problem is that an objective content representation of actions should only contain descriptions of objectively visible motion (see also Parkes (1989a)). These motions need not be represented down to their atomic units, such as representing 'walking' as a cyclic repetition of 'taking a step', as proposed by Davis (1995). By introducing temporally related annotations, we can simply represent complex patterns of human motion as single actions, examples being eating, reading, sitting, sleeping, and so forth. However, such a temporal-symbolic description of an action does not represent its emotional connotations. Moreover, some human actions specifically suggest emotions or intentions, i.e. gestures (Kendon, 1981; Ortony, Clore, & Collins, 1988; Wolff, 1972). The body, face, hands and limbs actions provide significant information either through motion (deliberately indented and expressed in some accepted code, such as winking, smiling, nodding or pointing) or statically (e.g. frowning and having the arms folded). By introducing shape-based representations of the body parts full body, head, hands and feet, and linking these to the temporal-symbolic representations of actions, we gain access to the indexical, metonymic system of gestures. The representation of gestures conforms to the principle of objectivity, because we specify 5: The representation of video content 100 that the meaning of a gesture is not explicitly stated in its representation but must be interpreted by the related inference rule for cultural behaviour (for example, shaking hands for greeting in western societies but bowing in eastern cultures)14. It must be stressed, however, that our representation of hand gestures is currently rudimentary, and should merely be understood as the first step towards a more complete representational scheme, in which detailed gestures of hand and fingers can be described.15 In addition to gestures as emotional indicators, we also include the speed of an action in the representational scheme, since this may provide information concerning the mood of a character. For example, an action which is performed slowly might indicate that the character is not in a hurry and thus might either be relaxed and in a good mood, or bored and in an ambivalent or bad mood. Table 5.3 describes our actiongesture-centred approach which covers a sufficiently wide variation of human behaviour. The approach is inspired by Davis' body-centered structure for the representation of action (Davis, 1995, pp. 107 -111). The attributes of the substructure Direction of action are taken from Bloch's representation (Bloch, 1986, pp. 140 - 141). It may appear to be inefficient to relate information about body gestures to every single action. However, through the use of temporal multilayered representational structures, it is possible to automatically compare the time span for a newly annotated action with existing actions for a particular character. In the case of a match, only a link to the existing annotation for the body gestures must be established. If the result of the comparison is partially overlapping, either the existing gesture annotation must be temporally expanded or the undescribed gesture part of the new action must be annotated. Cases of total temporal mismatch need complete annotation of the actor's action, of course. In such a way, we determine that given information in a shot description is not duplicated or altered, unless necessary. 14 For a discussion of cultural codes relating to gestures see also Bremmer & Roodenburg (1991) and Efron (1972). For a description of gestures for actors see Siddons (1968). 15 Approaches to the generation of gestures for automated agents are described in Russel, Starner, & Pentland (1995), Strassmann (1994) and Tosa et al. (1995). See also the work of the gesture and narrative language group at the MIT: (http://gn.www.media.mit.edu/groups/gn/) 5: The representation of video content Name Shot ID Startframe Endframe Identifier Relative Position Action Speed of action Direction of action Bodygesture 101 Description Identifier for a character, e.g. name or number (Screen position first frame, screen position last frame), e.g. (left, right), (left, middle), (right, right), etc. e.g. eat, drink, walk, read, etc. e.g. slow, medium, fast left, up-left, up, up-right, right, down-right, down, down-left, front, back, circular a structure containing: full body horizontal, vertical, left-diagonal, right-diagonal Head profile right, left, half-left, half-right movement up-down, left-right, up, down, left right, circle eyebrows up, down, straight, etc. line of sight left, right, straight, up, down, etc. mouth up, down, straight, open Hand action/related object e.g.(tap/table) left right action/related object e.g.(holding/head) Foot e.g. tap, lift, etc. left right e.g. tap, lift, etc. Table 5.3 Substructure "actor action" The representation of objects is based on similar structures as those defined for characters, but is much simpler, since for objects emotional reactions need not be considered. Nevertheless, objects possess shape, and can feature in events (i.e. have a function). Table 5.4 describes the structure for the representation of objects. Name Shot ID Startframe Endframe Identifier Type Shape Relative Position Action Speed of action Description Identifier for an object, e.g. a name or a number a structure containing form, colour, size (Screen position at the start, screen position at the end), e.g. (left, right), (left, middle), (right, right), etc. e.g. static, fast, slow, etc. Table 5.4 Substructure "object " 5: The representation of video content 102 We are aware of the need to represent groups (e.g. in mass-scenes, such as an infantry offensive or a demonstration), but we have yet to address this issue. However, a group description of characters and objects would most likely focus on the size of the group (small, middle, crowd), on its constituents (male, female, extraterrestrials, mixed), its appearance (uniform, leisurely, etc.), its relative position on the screen and with respect to other objects or actors, its function, and its direction of movement. 5.2.3.2 Settings: space, time and lighting In the original French, mise-en-scène means "putting in the scene" and refers to the compositional arrangement of subjects in a setting. Thus, mise-en-scène is mainly concerned with the creation of screen space. As described earlier, the syntax of cinematic space has two dimensions, the screen space and the space being portrayed (the location). The former is concerned with the limitations of the frame and the other with composition within the frame. If the camera tends to follow the movement of a subject (character or object) then the form of the frame is usually called "closed", whereas if the the character leaves the frame and reenters, then the form is considered as open. The correspondence between camera movement and movement within the frame forms one of the more sophisticated cinematic codes. The attributes Relative Position and Speed of action within the representation of character and object express the subject side, whereas the sub-structures camera movement, camera direction, camera angle and film speed represent the equivalent options for the cinematographic side. Furthermore, a frame provides a compositional balance that distributes masses and points of interest. The balance of a frame can either be strictly symmetric or provide a loose balancing of the frame's left, middle and right areas. Moreover, as discussed earlier, a frame provides depth. In chapter 4, we described the three compositional planes representing the three-dimensionality of the two-dimensional image: the frame itself, the geography (bottom line to horizon) and depth. Our representational ontology deals with the different planes within a frame, and the problems of compositional balance, by using the structures described in Table 5.5 and 5.6 in combination with the attribute Relative Position, from the representational structure for character and object. 5: The representation of video content Name Shot ID Startframe Endframe Identifier Spatial relation Identifier 103 Description Identifier for an object or a character, e.g a name or a number e.g. above, under, behind, in front, etc. Identifier for an object or a character, e.g. name or a number Table 5.5 Substructure "relations" Name Shot ID Startframe Endframe Foreground Description Background a structure containing: actors list of identifiers for single characters agroup list of identifiers for group of characters objects list of identifiers for single objects ogroup list of identifiers for group of objects composition vertical, horizontal, canted-left, canted-right,neutral a structure containing: actors list of identifiers for single characters agroup list of identifiers for group of characters objects list of identifiers for single objects ogroup list of identifiers for group of objects composition vertical, horizontal, canted-left, canted-right, neutral Table 5.6 Substructure "deep-space composition" The representational structures for depth and horizontal space provide sufficient information to support connotational inferences, such as the importance of frame sides based on character load, or the importance of a character related to his position within the frame or in relation to other subjects. Since it is possible to locate subjects in the horizontal frame space, as well as in the "imaginary" three-dimensional depth plane, our structures provide the essential information to establish continuity of common space between shots (as discussed in section 4.3), of which the 180˚ system is the most prominent example. The 180˚ system will be discussed in mored detail in chapter 6. A further critical feature of spatial content is the actual location, or setting, as it appears in the shot. Location is more than a simple identifier, as André Bazin states: 'The drama on the screen can exist without actors. A banging door, a leaf in the wind, waves beating on the shore can heighten the dramatic effect. Some film masterpieces use man only as an 5: The representation of video content 104 accessory, like an extra, or in counterpoint to nature which is the true leading character.' (Bazin, 1967b, p.102). One might think that an important component of the representation of spatial content would be the identification of the actual location where the material has been recorded (see, for example, Davis, 1995). This may well be important information for interpretational purposes, as outlined in section 2.1.3. However considering the dual semantics of a shot, as demonstrated in the Kuleshov experiments described in chapter 4, it is apparent that this is not the case, unless the shot contains explicit cues which determine location. If a shot shows a stretch of sand, only the context can make clear if the portrayed location is a dune in the Sahara or part of the beach on the island of Sylt. Nevertheless, there are critical aspects of location to be represented, such as formal characterisations of the geography or the functionality of the location. However, a setting not only provides information about location, but also temporal cues. In discussing temporal cues, we do not mean those related to the duration of the portrayed event or action, since these can be deduced from the time span defined from the start to the end frame of the annotated unit. The temporal information we have in mind is content oriented. The costumes may suggest the epoch, or give cues concerning season or time of day. Representing such information explicitly may result in redundancy, especially when the costumes are described using the structures for object annotations described earlier. On the other hand, having the particular temporal information explicitly stated, again reduces the need for inferencing. Thus far, we have said much about the representation of objectively describable components within a shot with respect to their compositional impact. The sole essential category we have omitted is that of lighting. The importance of lighting is precisely described by Bordwell and Thompson, who write: 'In cinema, lighting is more than just illumination that permits us to see the action. Lighter and darker areas within the frame help create the overall composition of each shot and thus guide our attention to certain objects and actions. A brightly illuminated patch may draw our eye to a key gesture, while a shadow may conceal a detail or build up suspense about what may be present.' (Bordwell & Thompson, 1993, p. 152). Thus, lighting is significant and needs to form a part of the representational structure for a setting, as seen in Table 5.7. 5: The representation of video content Name Shot ID Startframe Endframe Time Location Lightning 105 Description a structure containing: epoch e.g. middle-ages, 1994, 5000, etc. season spring, summer autumn, winter daytime e.g. dawn, noon, afternoon, midnight, etc. a structure containing: Geography rural, populated, identifier land sea outer space indoor, outdoor, transparent16 function a structure containing direction front, back, side-left, side-right, high, low, overhead quality e.g soft, hard, light, opaque etc. source physics only applicable to outdoors, and featuring atmospheric conditions, e.g. sunny, windy, etc. Table 5.7 Substructure "setting" 5.2.4 Conclusion We believe that the structured representation presented in Tables 5.1 - 5.7 is sufficent to describe the denotative aspects of film, in addition to its time and space dependencies, without restricting possible connotative combinations - the latter is especially a limitation of keyword based or unstructured free text annotations. However, our work on shot representation is but a first stage. There are a number of problems yet to be solved, such as linking subjects to the information provided by the setting substructure lighting, which, at the moment, can be applied only to the setting in general. A further problem is spatial in nature. Imagine a shot in which the foreground of the left side of the frame shows half of a character's face, and in the background on the right hand side, a group of people sit around a table and gamble. The face is definitely a close-up, whereas the gambling scene is a long shot. Though the representation is able to distinguish between shot types during the process of juxatoposing shots, it is not possible so far to apply the same precision within the border of the frame - unless particular inference mechanisms are provided that use spatial and size information to establish such compositional relations automatically. Further research is needed to address these problematic areas. 16 This represents a function in between that of indoor and outdoor, such as a carriage or a room with a view. 5: The representation of video content 106 It should be stressed that the proposed organisational structure for the representation of video content constitutes but a framework. Not every suggested attribute must be annotated - though it is apparent that the "vision" of an autonomous editing system depends entirely on the amount of information provided by the content annotations. Hence, the remaining question of particular interest is how much of the presented representation of video content can be provided automatically. 5.3 Technical environments for content annotation The achievements of current research in video processing are far from being sufficiently sophisticated to produce representational structures such as those described above. In particular, the automatic parsing of high level semantic and cinematic categories has yet to move beyond the investigation stage. However, there are relevant developments that can contribute to the automated process of video annotation with respect to: • identifying camera motions, such as pans and zooms (Tonomura, Akutsu, Taniguchi, & Suzuki, 1994; Ueda, Miyatake, & Yoshizawa, 1991) • detecting fades, wipes and dissolves (Aigrain & Joly, 1994; Zhang, Kankanhalli, & Smoliar, 1993) • recognising scene boundaries for news (Zhang, Gong, & Smoliar, 1994), which is achieved by using a model designed to recognise particular types of shots • performing macro-segmentation for documentaries (Aigrain, Joly, & Longueville, 1995), based on transition effect rules, shot repetition and similarity rules, editing rhythm and soundtrack rules • segmenting digital video by using explicit models of video production to design feature extractors (Hampapur, et al., 1995a; 1995b) • structuring video based on the correlation of colour between two adjacent frames in an image stream (Nagasaka & Tanaka, 1992) • recognising gestures based on the segmentation of characteristic silhouette features, apriori knowledge of people and knowledge of the human body (Pinhanez & Bobick, 1995; Russel, et al., 1995) • identifying object motion in constrained video (Herzog & Wazinski, 1994; Ueda, Miyatake, Sumino, & Nagasaka, 1993) 5: The representation of video content 107 • semi-automated annotation of sets of images by using several vision-based texture models (Picard & Minka, 1995) • grouping structural image features, such as brightness, edges, and texture features, which can then be transformed into a description of the most important attributes of a set of frames. Detailed relationships between things, i.e. the geometry of a scene or a human face, are captured by the Karhunen-Loeve transform, the Wold transform being used for textural properties, e.g. orientation (Pentland, Picard, Davenport, & Haase, 1994; Picard & Liu, 1994; Ashley, et al., 1995). For the forseeable future, the bulk of the annotation, at the semantic level, will rely on human activity, supported by intelligent and semi-automated annotation systems. In this thesis, we are not concerned with environments to facilitate the annotation of video content. For now, we have omitted this area from our research, and simply refer to ongoing research on such interfaces by Davis (1995), Gordon & Domeshek (1995), Mills, Cohen, & Wong (1992), Oomoto & Tanaka (1993), Tonomura, et al. (1994), Ueda, et al. (1991), and Yeung et al. (1995), among others. 108 Chapter VI The representation of knowledge for automated editing The task of representing editing knowledge may, on first impressions appear to be simple, since at the physical level there are only two ways of joining shots. One can either overlap them or put them end to end. The editing model presented in chapter 4 showed that editing is a much more complex process, which encapsulates the retrieval, shaping and ordering of appropriate shots to support a cinematographically coherent and clear relating of the story, where actions of characters are portrayed in an undistracting way. Moreover, the three essential processes of the editing, i.e., the retrieval, ordering and shaping of shots, are all simultaneously subject to the narrational constraints of space, time and cause-effect. The aim of this chapter is to present the representations of editing structures and mechanisms that are required to establish the link between the available video material and the narrative specification. It is important that the reader is aware that the presented editing mechanisms and related structures do not alter the logic of the narrative, but rather provide the knowledge for an appropriate presentation related to the content and the narrative intention of the scene. It must be re-iterated that, as already mentioned in section 4.3, we do not intend to achieve an automated "fine cut" editing, but rather a joining of shots at the "rough cut" level, and that our approach is not directed towards the production of "art". Finally, the following investigation focuses merely on the creation of meaningful sequences. The combination of shots at higher narrative levels than the event, e.g. the episodic level, is not considered. 6: The representation of knowledge for automated editing 109 We begin our investigation with an analysis of the nature of joins between shots. In section 6.2 we describe one system for dealing with the problem of spatial and temporal continuity in video editing. Section 6.3 discusses related research into automated video editing. The chapter closes with a detailed description of the representations and strategies we have developed to facilitate automated editing. 6.1 Shot editing: Mixage and Cut Joins between shots are of two main types. Firstly, shots can be overlapped (i.e. double exposure, dissolves or wipes). As described in chapter 4, these joins serve as punctuation devices within the syntax of larger narrative units, usually as end points. Since such devices relate to a narrative level that we are not considering, such joins do not feature in the ensuing discussion. A further reason for excluding such joins from our investigation is that fade-outs, fade-ins, dissolves and wipes are optical effects, and are usually achieved in the laboratory. These technical devices are regarded by the current author as too complex to be considered at the current time, but should feature in future developments. The second, and more common, way of combining shots is the cut, which means juxtaposing the last frame of a shot with the first frame of the shot to be joined. However, there is a second option: the insert. An insert is when a shot or a chain of shots is spliced into another shot. Thus, the intention of an insert is not to support the continuous flow of information, as performed by a cut, but rather to introduce a temporal transformation, i.e. the expansion of information (as discussed in section 2.1.3.). Despite their importance as the primary means for editing, cuts are problematic in that they constitute a physical break, which might reduce the viewer's involvement in the presentation on spatial and temporal grounds. Thus, we need to perform cuts so that a smooth information flow from shot to shot results. In the early stages of the film industry, an editing system was established which uses strategies of mise-en-scène and cinematography to ensure visual continuity between shots. This system is called continuity editing, a style of editing used particularly in narrative-oriented film, and thus most relevant to our research1. The following section gives a brief introduction into the basic underlying structure (the 180˚ system) and related cinematographic strategies. 1 There are of course other styles of editing, which might also suit the presentation of narrative film sequences, such as spatial and temporal discontinuity (e.g. the 360ÿ space system, the jump cut or the nondiegetic insert, where a metaphorical or symbolic shot is inserted which is not part of the space and time of the narrative). Another editing style can be found in abstract films, where graphic and rhythmic dimensions have a much more substantial impact. However, these alternative models to editing are not investigated further in this thesis. 6: The representation of knowledge for automated editing 110 6.2 Spatial and temporal continuity in editing: the 180˚ system Bordwell and Thompson describe the 180˚ system as follows: 'The scene's action - a person walking, two people conversing, a car racing along a road - is assumed to take place along a discernible, predictable line. This axis of action determines a half circle, or a 180˚ area, where the camera can be placed to present the action. Consequently, the filmmaker will plan, film, and edit the shots so as to respect this centre line. The camera work and mise-en-scène in each shot will be manipulated to establish and reiterate the 180˚ space.' (Bordwell & Thompson, 1993, p. 262). The aim of the 180˚ system is to provide the viewer of a scene with a clear understanding of the position of characters, i.e. the spatial relationship between characters and the spatial relationship between each character and the setting. Figure 6.1 graphically describes the 180˚ system. 4 A B A B 2 3 1 A A B B A B Figure 6.1 The 180˚ system (based on Bordwell & Thompson (1993, p. 263)) Imagine that A and B in Figure 6.1 are two conversing characters. The simplest way of establishing the axis of action between A and B would be to use the shot provided from camera position 1, because both characters are present in the scene, and their spatial relationship needs not to be inferred by the viewer. For the viewer of the scene, 6: The representation of knowledge for automated editing 111 it is clear that the spatial relationship between A and B is oppositional, and that A is located in the setting space to the left of B. Combining shot 1 with that taken from camera position 3, the viewer can see from the background that some common parts of the shot taken in position 1 appear, i.e. character A and parts of the scenery. Thus, the viewer becomes spatially oriented with respect to the scene and understands that shot 2 presents the same space but from a different angle. However, if we now joined the shot taken from camera position 4, B would suddenly be surrounded by a different background, and, even more importantly, would have changed sides with A. The result would be a visual distraction, which should be avoided, unless such an exchange of character positions is motivated. Now consider a similar situation, except that now both characters are moving, e.g. two people meet in a street. Assume that A moves from left to right and B approaches from right to left. Now imagine that the screen direction of character A changes, which means that he or she is now walking from right to left. Did the character turn around while the walking of B was shown, maybe because A did not wish to meet B? This may or may not be the case, but the important thing is that such a break in continuity can cause confusion. Thus, the 180˚ system allows the creation of a continuous space from autonomous shots, but constrains the order of shots, based on content attributes of spatial importance, such as the direction of characters' glances or their direction of movement and spatial relationships between character and setting. Figure 6.1 not only demonstrates the importance of camera position, but also that the distance between the camera and the key event in the scene provides crucial information. In chapter 4, we showed that there are two cinematographic devices for controlling the distance, and thus the awareness space of the visual plot: the camera distance and the lens movement. Since there is no universal measure of camera distance, we use the classification system developed by Dziga Vertov (described in Petric, 1987), which divides camera distance into seven shot types, as described in Table 6.1. 6: The representation of knowledge for automated editing Camera distance extreme close-up close-up medium close-up medium medium long long extreme long 112 Covered content space This shot isolates details, such as the lips or eyes of a face. This type of shot typically exposes the head, hands, feet and smaller objects. The intention is usually to highlight facial expressions, gestures or particular objects. A human body is shown from the head down to chest. Gestures and expressions are distinguishable. A human body is shown from the head to around the waist. Gestures and expressions become more distinguishable. Taking a human body as the measure, the subject is framed from around the knees upwards. Includes at least the full figure of subjects but the background dominates. The human figure is almost invisible. Used for landscapes, bird's-eye views (e.g. of cities). Table 6.1 Relationship between camera distance and size of presented content space Finally, there are a number of techniques for shot juxtaposition that support a smooth flow of content space. The following presentation, adapted from Bordwell & Thompson (1993, pp. 256 275), discusses only those strategies which are of importance for automated editing at the level of the sequence: establishing shot This is the first shot of a sequence, and describes the general location. The type of shot usually varies, depending on the functionality of the location. For an indoor location, medium or medium-long shots are preferred, whereas for outdoor locations, long shots are usually more successful. However, the establishing shot can also be a composed shot sequence [as demonstrated by Kuleshov (usually featuring shots of types between medium-long and medium close-up)]. Important analytical factors are camera position, camera angle, camera movement and, for the composed version, a memory for already established spatial relations. It is important that the spatial relationship between subjects is kept constant from this 6: The representation of knowledge for automated editing 113 shot onwards, unless their movement motivates a reestablishment. shot/reverse shot A repetitive sequence of similar shot types, where each shot shows the opposite end of the established axis of action. Shots used for such a pattern are usually taken from behind the subject that forms the opposite end of the established axis (overhead angle). This pattern is used for action - reaction situations and is usually used for the visual breakdown of a scene. re-establishing shot This shot is usually used when subjects are added to, or removed from, a setting to re-establish the overall space. Thus, a pattern such as establishing shot - scene breakdown - reestablishing shot is common. The mechanisms involved are the same as for the establishing shot, except that now there is an additional subject introduced (either a character or a group of characters). eyeline match This tactic is used to combine two shots where one shot presents a character or group of characters looking at something, and the other shot presents what is being looked at. The order of shots can also be in reverse order. The important feature here is that in neither shot are object and looker simultaneously present. (see also Bloch, 1986). action match This is an editing device that uses the beginning of an action and reuses the same action in the following shot. The important aspect here is to maintain constant on-screen direction of movement. This is one of the most powerful editing devices for providing continuity, because it activates the perceiver's attention in the motion of the action, and thus lowers his or her attention to differences resulting from the cut. (see also Bloch, 1986). point of view shot This technique combines shot/reverse with eyeline match. It is normally used to show a scene observed by one or more characters. This tactic, as applied to one particular human action, is described in Figure 6.2, where the establishing shot of 6: The representation of knowledge for automated editing 114 B is taken from camera position 1, whereas camera position 2 establishes the axis for the object of B's gaze. The essential analytical factors are the comparison of camera angle for position 2, and the line of sight of the character portrayed from camera 1, which must be equal or related. A 2 B 1 Figure 6.2 Schematically description of the POV shot (based on Bordwell & Thompson (1993, p. 273)) cheat cut A perfect match between two shots, with respect to action or graphical pattern, cannot always be ensured. A cheat cut tackles the problem by using the power of narrative causality. The idea is to emphasise overall similarities of graphical pattern, i.e. constant screen position, constant direction of action and eyeline match, as major control devices which need to be fulfilled. Any extra accomplished constraint then adds to the viewer's acceptance of the screen. The above techniques indicate that visual, content-oriented continuity between shots is mainly based on the direction of action, the relation between subjects (characters and objects) and the position of subjects in a setting. Moreover, there is a need to control the particular information presented, so that the viewer can be visually guided towards the intended understanding of the sequence. Thus, there is a need to combine the narrative logic (point of view, intention of an action in the given context, intention of the sequence, etc.) with the representational structures and editing mechanisms necessary for the automated generation of meaningful film sequences directed towards a particular emotional outcome, i.e. humour. 6: The representation of knowledge for automated editing 115 Before introducing our approach to the representation of the features discussed in the preceding paragraph, we provide a short review of existing systems that support content-based automated editing. 6.3 Related work Little work has been carried out on content-based automated editing. The two major approaches are discussed in this section. 6.3.1 Splicer Sack & Don (1993) describe a prototype video resequencing system called Splicer. The system consists of two main components: • a knowledge base of around 50 video clips, that deal with the Iran-Contra hearings. The clips are annotated on the basis of Sack's representation of point of view and bias in the news (Sack, 1993) and Don's work on narrative construction and point-of-view (Don, 1990). The annotations contain information about the speaker in the clip, the topic, and other features. • montage rules, created by the user. These rules are written in a Prolog-oriented language, and represent, for example, the strategy point-counterpoint. The rules are used to compose sequences. Splicer offers a spreadsheet-oriented interface, where the user chooses or creates strategies to establish relations for rows (Group_of_speaker) and columns (Topic_of_dialog). Creating a query, such as "Group_of_speaker = Topic_of_dialog = Contra-Issues" the system fills in the particular cells relevant clips. Selecting one of the clips and using one of the (rhetorical) rules (i.e. point-counterpoint) the system starts with the selected clip and Viewers, with the Montage adds the related clips according to the given rule. However, the order of shots used by Splicer does not reflect any cutting constraints. The Montage rules used are not editing rules in the sense that they create a visually coherent composition (there being no representation of cinematographic devices or spatial characteristics of actors, and so on), but rather create a coherent intellectual space. Thus, Splicer's contribution is to represent video material in such a way that rhetorical rules can be used to create micro-documentaries expressing distinct ideological points-of-view. This is a similar endeavour to our own, except that we are interested in provoking an emotional reaction in the viewer, and we intend to achieve this by presenting narratives. 6: The representation of knowledge for automated editing 116 6.3.2 Bloch's machine for audio-visual editing Bloch (1986) bases his approach to automated editing on the following two assumptions: • Specific narrational conditions will enforce specific constraints on cuts. • The generation process is based on knowledge of the number of cuts that are needed to create a sequence. Bloch's research focuses on continuity editing (especially the maintenance of fluency of motion between shots), and specifically considers the constraints position, motion and glance: Position The formalisation is based on Burch's taxonomy for joining positions Burch (1981), e.g., when two characters are together in a relatively close shot, following shots should respect the established positions of characters (A on the left, B on the right). The relevant attributes are the character's physical position in terms of both the location and the screen. Glance Bloch distinguishes between shots where two characters or groups of characters are facing each other, and shots where one character looks at nothing in particular. His concern is with the eye-match of two characters. The constraints he introduces are that two characters facing each other shown in different shots must look in opposite directions, where the opposition is based on the line of sight of the character in the first shot. The directions of a character's sight are detected using vectors in the plane of the screen (discussed in section 5.1.1). Motion For motion, Bloch identifies as essential control attributes the speed and direction of actions performed by the characters. For speed, Bloch states that this should be the same between two joined shots. For direction, he states that changes should be avoided, and if necessary should be introduced by a shot in which the direction of the action is unidentifiable, directions he describes as front and back directions (discussed in section 5.1.1). The above constraints are ordered in terms of importance, giving motion precedency over the other two, which are attributed with equal importance. 6: The representation of knowledge for automated editing 117 Bloch uses the related constraints for each control attribute as the basis of a guide for construction, where the construction of the video sequence covers three main tasks. First, the construction process separates the given story into appropriate segments (usually an autonomous sequence, as described by Metz (1974)), based on parsing the punctuation and linguistic temporal forms (e.g. and, and then, etc.) in the story text. The second task within the process of sequence construction is to translate each segment into conceptual dependencies and determines the number of necessary shots for the segment. The translation of an action into a CD is based on work by Schank (1982), Schank & Abelson (1977) and Schank & Riesbeck (1981). The representation of a CD contains the action to be performed, the id of the performer, the object the action is performed on, and a marker, stating if the action is an interaction (inst = yeux) or a movement (inst = direction of movement) or all other kinds (inst = ' '). The instance of an action is provided by a pattern matcher that can associate particular actions with interactions or movement. The decision concerning the number of shots used for generating the sequence is based on the number of CDs (one or several), and on the action type. For example, a story such as 'Gilles and Said speak with each other' can be represented by one conceptual dependency, but since the instance of the action is 'yeux', the action can be presented in one or two shots, which allows the following representation (text in * * added by the current author): * described action to be shown in one shot* (and-simul (attend (actor ("said")) (object "gilles") (inst yeux)) (attend (actor ("gilles")) (object "said") (inst yeux))) Type du segment -> ordinaire *described action to be shown in two shots* ((attend (actor ("said")) (object "gilles") (inst yeux)) (attend (actor ("gilles")) (object "said") (inst yeux))) En 2 plans Type du segment -> ordinaire The third part of the construction process is to match the established CD against the content representation as described in Figure 5.1, based on rules that provide the relevant constraints concerning action, movement and glance, such as: if a segment is to be constructed from two shots and the number of characters is 2 and the instance = "yeux" then directions of sight must be opposite of each other and screen position of characters must be opposite to each other. 6: The representation of knowledge for automated editing 118 Bloch's approach is useful in that it introduces a distinction between guided construction and case dependent constraints. Furthermore, his approach provides a practical solution to the problem of representing the essential elements for continuity editing, i.e. direction and speed of movement, direction of sight and the position of characters on the screen and in relation to each other. Finally, he ranks editing strategies according to importance, e.g. the action match is more important than the eyeline match. However, Bloch's approach to the automated juxtaposition of shots suffers from shortcomings that are partly related to problems in his scheme for representing video content, which was discussed in section 5.1.1. Bloch ignores several continuity problems on the graphical and spatial levels. An example of the former is the problem of light and colour changes, which is avoided by providing only black and white material, featuring the same level of light, which performs adequately in any possible combination. An important omission at the spatial level is that there is no comparison of the content space of the two shots to be juxtaposed, in Bloch's scheme. In his thesis, Bloch shows that is aware of this problem, by pointing out that his scheme cannot support decisions for cases where the number and identity of characters might be correct, but the location in both shots is different. Bloch's system would simply join the shots. Bloch mentions that he cannot provide a solution to such problems due to the rudimentary state of the proposed representation of content space (Figure 5.1 shows Bloch's shot representation). He therefore suggests that the background is kept single coloured and free of objects, which renders the comparison of backgrounds unnecessary. Though Bloch demonstrates an understanding of the process of presenting an event in various ways, by providing mechanisms for decomposing actions based on their type (looking, moving, other), he overlooks the problem that not only the visual presentation of an action can be decomposed, but also the presentation of a character. This means that his system is in the position to present the action of "X looking at Y" in two shots, but could not find solutions for "A walks towards B", where the first shot shows feet walking to the right, and the second shot shows feet walking to the left. Related to the decompositional shortcoming of Bloch's approach is his apparent lack of attention to the relationship between shot kinds and their expressional influence on the presentation of a narrative oriented video sequence. Both problems, the decomposition of characters and the relationship between different shot kinds, are 6: The representation of knowledge for automated editing 119 resolved by providing video material of a particular style (e.g. shots where an eyeline match is required are only available as close-ups). Finally, Bloch himself acknowledges a problem relating to a deficiency of his scheme with respect to the connotative aspects of cinematographic devices in narratives, i.e. the inability to shape a single shot, or to order shots, according to their duration. Shots can only be used in their entire length, which means that trimming a single shot, or rhythmically structuring a sequence are not possible. However, these are essential features of editing, particularly for emotionally stimulating sequences. Nevertheless, Bloch's work certainly exerts an influence on our approach to automated continuity editing, which we present in the following sections. We use features of Bloch's work, such as guided construction and case based constraints, and the use of action, position and sight as major control constraints, in deciding on the suitability of juxtaposing two shots. However, we also attempt to overcome a number of the shortcomings of Bloch's approach. 6.4 A novel approach to automated video editing The first assumption reflected in the ensuing discussion is that plot structure provides the logical relationships between characters and their actions within a setting, but that automated editing visually supports the storytelling, which depends on the available material. However, the available material cannot be predicted in advance, and so Bloch is mistaken in assuming that the number of shots necessary for a presentation can be predefined. We are interested in the ability of an automated editing system to react flexibly to requirements for the visual composition of characters, their actions and the surrounding location, by making use of available video material, without being restricted to a predefined number of combinatorial possibilities. Moreover, it is possible that a narrative-based request for visual material cannot be fulfilled. In such cases, the logic of the narration must be altered, which is done not by the editing process, but by the procedures through which the narrative is constructed. Our second assumption relates to the editing model that was described in section 4.3. We assume that as the created visual material is experienced linearly, every item introduced, i.e. subject, action or setting, holds true for the sequence in which it is shown, and its parameters serve as the basis for logical relations within the causal chain of the narrative structure. For a character, this means, for example, that features of his or her appearance introduced remain constant until changed by the logic of the 6: The representation of knowledge for automated editing 120 narrative. The effect of this assumption is that the temporal and logical framework of the presentation is based on the narrative world, which may not coincide with the temporal or logical dimensions of the real world. In other words, the presented event is taken to be real, and that which we do not see does not exist. Related to the linear aspect of our second assumption, is our third assumption, i.e. that whenever we juxtapose two shots A and B, we actually merely compare the last frame of shot A with the first of shot B, in the sense that we consider only those features of the shot representation that apply to the end of one shot, and the beginning of the next. We now present our scheme for continuity editing. We first describe the plot structure, which provides the basis for the editing process. This structure represents the result of constructive processes performed on narrative structures to be presented later in this thesis (chapter 7). The constructive processes themselves are described in chapter 8. We then present our approach to the control of spatial and graphical relations during the juxtaposition of two shots, with respect to both shot content and the created awareness of space. Finally, we discuss the temporal and rhythmical constraints applicable in the joining of two shots. However, it must be stressed that this separation of constraints on continuity editing into distinct types is done merely to promote ease of presentation. As we will see in chapter 8, each of the different types of constraints applies simultaneously during the relevant editing processes of retrieving, ordering and shaping shots. 6.4.1 Plot requirements In the editing model presented in chapter 4, the process of scene creation begins with a discussion of the scene with respect to the available material, its intention and its part in the overall story. In technical terms, this means that editing is based on a structured framework, which provides the narrative intentions of the scene and the required events, characters and locations for a particular stage in the event development (plot order was discussed in section 2.1.2). Thus, an event can be constructed from one or more sequences, depending on the event phases involved, i.e. motivation, realisation or resolution. The representational form of a sequence is presented in Figure 6.3. 6: The representation of knowledge for automated editing 121 Sequence-Structure Kind Motivation or realisation or resolution Intention relations, details, action, interaction (and, or combinations are possible) Form e.g. H-Strategy X, internal, external, single, composed, first-person, third-person Appearance accelerate, steady, slacken Setting substructure "setting", as described in Table 5.7 Subjects A structured set of descriptions for each subject required in the sequence. The description usually only contains the subject ID. However, if a subject is newly introduced or the appearance changes (e.g. the age of a character), the set is enlarged. Note, that this set also contains information concerning the mood of a character. Action actions performed in relation to subjects in this sequence. The actions are represented by the structure [tempform, single action, parallel actions, serial actions]. Figure 6.3 Plot requirements for the editing process The parameter Kind represents the constructural phase for the overall event to be portrayed (i.e. motivation, realisation or resolution as described in section 2.1.2.1). The parameter Intention represents the main goal of the sequence. The parameter Form provides information about the chosen H-Strategy, the point of view for the scene (viewer, i.e. third-person, or character, i.e. first-person, oriented), and how the overall composition is to be arranged (e.g. a single shot or composed sequence). The parameter Appearance contains information about the visual rhythm of a sequence. A humorous sequence, for example, is likely to be of accelerated speed (as discussed in section 3.2.1.1). The parameters Setting, Subjects and Actions are self explanatory. Setting, subject and actions provide much of the information needed to support the retrieval of relevant visual material. Following retrieval, a pool of shots is available, which forms the raw material for the editing process. Decisions as to selection and ordering of shots are case-based, and derived from the actual information provided by Kind, Intention, Form and Appearance. Editing strategies for each control area are applied, if necessary, to provide the most suitable presentation of the narrative request. Thus, we extend Bloch's notion of case-based editing by introducing consideration of relevant stylistic features into the editing process. The necessary representations for supporting such sequence structures are described in section 7.2, and the mechanisms for sequence construction are presented in chapter 8. For the moment, it is necessary to know only that the editing process can apply this 6: The representation of knowledge for automated editing 122 structure to the related pool of shots (i.e. their relevant content descriptions) to order the shots in a cinematographically acceptable way. 6.4.2 Shot intention and the shape of the awareness space We now investigate the shaping of the viewer's awareness of space, which is based on the logical relationship between camera distance and lens movement for the two shots to be joined. Earlier in this chapter, we noted that this relationship between two shots plays an important role in the ordering and visual presentation of the video material, as it strongly constrains the retrieval of potential shots for juxtaposition, which was not considered by Bloch. Take the establishing shot, as described in section 6.2, as a first example. The establishing shot usually provides a certain frame space, depending on the functionality of the location. This leads to the following editing strategy: E-Strategy 1 If sequence.kind = Motivation then Camera distance of Shot to be chosen is long => location.function = outdoor medium long or medium => location.function = indoor This means that one criteria for choosing an establishing shot from the pool of relevant shots is covered by the constraints described in E-Strategy 1. The above strategy enables us to determine the appropriate camera distance for a single startshot, but we cannot predict how the system should react if the pool of available shots does not provide a single shot which contains exactly the visual information required by the particular narrative situation. The problem is then to select shots to provide the same establishing effect. To look at the problem from a different angle, assume that an alternative representation of content space is required, i.e. the creation of spatial relations from component parts shown in sequence, as examined by Kuleshov (1974, pp 52 - 53), which might be applicable for story structures requiring suspense. It becomes clear that the establishing shot, as described in E-Strategy 1, is no more than a useful exception. In fact, we need a representation that enables us to specify which types of shots can be joined to others. Vertov's classification of camera distances, previously introduced in Table 6.1, is of particular value for the creation of clearly perceptible scenes, since its description of the logical relationships between different camera distances provides many possibilities for shot juxtapositions, as described in Table 6.2. 6: The representation of knowledge for automated editing Shot B Shot A (1) extreme close-up (2) close-up (3) medium close-up (4) medium (5) medium long (6) long (7) extreme long (1) (2) X X X X X X (3) (4) X X X X X X X X O (5) 123 (6) (7) O X X X O X X X X O X X Table 6.2 Spatial relationships between shots A and B in terms of camera distance The information provided in Table 6.2 can be represented as a matrix of the form SDij, for all i,j where 1 i 7 and 1 j 7 and i represents the camera distance of a given shot j represents the camera distance in the shot to be joined Dij = X if the shots can be acceptably joined Dij = O if the shots can be acceptably joined, with the constraints: a) no bridge can be created and b) the complexity of the shot with the longer camera distance is low (fall back rule) Dij = ' 'if the shot combination is impossible Such a matrix serves not only to specify the acceptable direct joining of shots, but also enables the calculation of suitable shot patterns when shots cannot be joined directly. Assume a sequence is required which represents the joke of a man who approaches a freshly painted bench, avoids sitting on it and, in doing so, falls over a litter bin. Assume that the establishing shot of the man walking is a long shot. The next thing to be done, according to the humour strategies described in chapter 3, is to motivate the mishap. A favourable shot type might be a close-up. Locating the relevant field in the matrix SD with a vector Vij, where i represents the shot type of the given shot and j the type of the shot to be joined, it is determined that the direct join is not recommended. Thus, a bridge must be created between the two shots. E-Strategy 2 describes the algorithm for this operation. The bracketed numbers represent an example, where a long shot (6) is to be joined with a close-up (2), resulting in a bridge 6: The representation of knowledge for automated editing 124 described as a join between a long and medium shot (4) and a join between the medium shot and the close-up.2 E-Strategy 2 if a bridge between two shots must be created do [6,2] fill vector Vij up so that ij form a progression [6,5,4,3,2] shorten the new vector V2 depending on the timing_strategy [6,4,2] transform vector V2 into vectors of the form of Vij [6,4], [4,2] The resulting list represents the bridge. However, the required shot types may be unavailable. In such cases, the system can use fall back rules, indicated by "O" in Table 6.2. It should be appreciated that the mechanism provides only one decision concerning the applicability of a join of two shots. It may be that the control mechanisms for the continuity of content space, as described later in this chapter, reject the suggested join. If that happens, the system must change the plot structure, so that it can be realised by the available visual material. The representation of spatial relationships between two shots in terms of camera distance supports decisions concerning the next shot type to be joined. In addition, the system is now also in the position to provide zoom-ins (i > j) or zoom-outs (i < j) automatically, in such cases where a zoom is stylistically required, but cannot be provided by the available material. However, Vertov's classification not only demonstrates logical relations between camera distances, but also hints at how the camera distances can aid decisions taken to guide the interest of the viewer. The aim is to encourage the viewer to understand, through visual means, which aspect of an event is important and why. Thus, at this point, we are concerned with the awareness space of the viewer, rather than the presentation of the content space, which is discussed later in this chapter. Imagine that a medium shot showing two actors is to be combined with a shot portraying a particular emotion of one of the actors, where the emotion is of importance. A problem arises when the system must decide between two shots, both of which provide the appropriate content, but vary in their camera distances (say one is a medium shot, and the other a close-up). The representation discussed thus far would derive no solution, as the joining of these shots is defined as acceptable. 2 The timing_strategy is related to the ongoing narrative. The strategy might be an equivalence or contraction. In the given example, the strategy is contraction. 6: The representation of knowledge for automated editing 125 However, for a human, it is obvious that the close-up more suitably fulfils the requirement, since it better represents the specific visual representation, highlighted with a decrease of visual space. Thus, there is obviously a close relationship between the visual relevance of a scene, which can be determined by the nearness of the viewer to the mise-en-scène, and the narrative functionality of the particular sequence, which is totally dependent on the visual content. Figure 6.4 describes these relationships, according to their influence on the process of generating and interpreting video sequences. Increase visual space => First shot space < Second shot space = favour relation <= Generalisation Increase visual space => zoom_out = favour relation <= Generalisation Descriptive visual space => (First shot space = Second shot space) = favour action / interaction <= Description Decrease visual space => First shot space > Second shot space = favour detail <= Specification Decrease visual space => zoom-in = favour detail <= Specification Decrease visual space => masking = favour detail <= Specification Interpretation Generation Figure 6.4 Conceptual relationship between the space of visual awareness and narrative functionality Based on the description of the relationship between camera distance and narrative functionality, an automated system is now in the position to compare the effects on space development of each join, by using editing strategies of the type described next. E-Strategy 3 if sequence.intention = details and camera distance of shot A ÿ extreme close-up and camera distance shot A > camera distance shot B then favour this shot A more suitable strategy arises if shot B contains a zoom-in, as, in this case, the intershot transition is much smoother. E-Strategy 4 if sequence.intention = details and camera distance of shot A ÿ extreme close-up and camera distance shot A start camera distance shot B and lens movement = zoom_in then favour this shot For the above example, it would turn out that a join of two medium shots would not fulfil E-Strategy 4, whereas the join between the medium and close-up would. As a result of this, a book-keeping mechanism is triggered, which is represented in EStrategy 3 and E-Strategy 4 by the "then" clause favour this shot. This mechanism is required because differing control attributes, e.g. Intention, Form and Appearance, with differing constraints, are queried before two shots are juxtaposed. 6: The representation of knowledge for automated editing 126 The "favour shot" book-keeping mechanism adds an applicability value to the evaluation value assigned to each relevant shot. The same evaluation value is used by all strategies applied to assess the applicability of shots for juxtaposition, whether or not they are concerned with the awareness space, the content space or temporal aspects. The evaluation value of a shot becomes important in cases where a choice between a number of applicable shots must be made. Since the presentation should be as faithful to the narrative request as possible, the system can then choose the shot with the highest evaluation value, as described in E-Strategy 5. E-Strategy 5 Compare the evaluation values for all relevant shots and choose the one with Evaluation_value = MAX. If there are several shots fulfilling this constraint, then use the first. It is desirable for a list containing the alternative candidates to be kept, for cases where a re-editing of this particular join is required due to unexpected subsequent plot changes. It must be stressed that relationships similar to the one between camera distance and narrative functionality can easily be established between further particular narrative requirements and other cinematographic devices, though the mechanisms (e.g. the "favour shot" book-keeping) would remain. For examples of such relationships see our description of tonal montage in section 4.2.2.2. By using the representational structures and editing strategies described in this section, an editing system will be able to decide which shot types can be juxtaposed with which others. Moreover, in cases where alternative candidate shots can be identified, inferences can be drawn on the basis of the formal and stylistic suitability of each shot introduced. The spatial relationship between two shots, based on the logical relationship between their respective frame size, is an important feature of continuity editing. However, it is, of course, essential to provide a consistent content space, as discussed next. 6.4.3 Automated establishment and maintenance of content space over several shots Imagine that a sequence structure, as described in Figure 6.3, requires the introduction of a new location featuring a number of interacting characters. Referring to the editing strategy Establishing shot, as described in section 6.2, the establishing shot can be presented as: 6: The representation of knowledge for automated editing E-Strategy 6 127 If a sequence is to be established where location of shot A ÿ location of shot B, or the sequence is the first sequence to be established then create a memory structure of the spatial relations between all characters of Shot B The memory structure mentioned in E-Strategy 6 represents a hierarchically organised structure for describing the current location and the relationships between subjects in that location, as presented in Figure 6.5. Location-Memory-Structure Start End List of structures of Stable_position of which each contains List_of_content_relations List_of_used_shots Shot-id of start for location Shot-id of end of location Spatial relations between subjects List of shots used for this particular configuration of spatial relations Figure 6.5 Memory structure for spatial relationships between subjects over a number of shots The highest level of the Location-Memory-Structure contains the ID of the shot where the current location first features, and the ID of the last shot in which the current content location is found. The structure Stable_position holds details of the spatial relationships between subjects in that location until changes occur. The organisation of different stable positions is sequential, where the ordering represents the narrative structure as provided by one or several sequences. The List_of_used_shots portrays the acceptable visual representation of events, based on the specified spatial relationships between subjects, as described in the List_of_content_relations. Thus, the List_of_used_shots represents the end result of the editing process. All other structures, i.e. the list of content relations and the top level of the memory structure for a particular location, serve a merely supportive role. We now return to the decomposition of character relationships. E-Strategy 5 presents an ideal case for the introduction of content space, this being when the narrative request can be satisfied by one shot. However, we have already discussed the possibilities of a composed alternative, either because no shot of the required type is available, or because the marked and abrupt shifts produced by cuts are required to direct attention. Figure 6.6 describes our approach to the decomposition of content space represented by a sequence of shots, which is based on: 6: The representation of knowledge for automated editing • the number of required characters • the spatial position of characters in a shot • the spatial relationship between the characters. break up into sub-groups of the kind described to the left and then use their configuration decomposition (1): based on the hierarchical knowledge representation of subjects, the decomposition starts on parts level : Class decomposition (2 - n): the content of each shot should present the same hierarchical level within the knowledge representation, e.g Instance - Instance, or Parts - Parts. Subclass Instance Parts Subparts 1 subject 128 2 subjects 3 subjects 4 subjects n subject (where n > 4) number and size of characters to be portrayed and their spatial relationships involved. Legend: Symbols: shot, cut, Type of spatial relations: subject line, triangle or half circle, square or circle, over cross Spatial relationships in the shot and created via a cut Figure 6.6 Influence of sequence decomposition on the number and order of shots The dimmed (red) text in Figure 6.6 outlines our representation of narrative structures, which is described in detail later in this chapter. 6: The representation of knowledge for automated editing 129 From Figure 6.6 a number of decompositional editing strategies can be derived, of which E-Strategy 6 is an example of the composition for the Location-MemoryStructure. E-Strategy 7 and E-Strategy 8, which follow, are examples of decomposition strategies. E-Strategy 7 If the relation between characters must be decomposed, then keep the number of characters for shot A as high as possible. Then create a memory structure for the spatial relationships between the characters in shot A. Then establish spatial relationships with the characters of the remaining shots, which all become part of the memory structure. E-Strategy 8 If number of character = 2 then these combinations of screen positions of character are possible shot A ([right]) with shot B ([middle]) => line shot A ([right]) with shot B ([left]) => line or its permutations. E-Strategy 9 If number of character = 3 and camera distance of both shots medium long then these combinations of screen positions of character are possible shot A ([left | right]) with shot B ([middle]) => circle / triangle shot A ([left | middle]) with shot B ([right]) => circle / triangle shot A ([middle | right]) with shot B ([left]) => circle / triangle shot A ([left]) with shot B ([middle]) with shot C ([left]) => line or its permutations. The editing strategies and memory structures introduced so far enable us to model the 180˚ system, though the location may need to be created over several shots. The number, or the positions, of characters may change within a location. Assume, for example, that three characters are sitting at a table. If one of them leaves the room, we can use the already established spatial relationship between the remaining two, but we must ensure that the relationship between the remaining two characters now represents a line. Hence, additional strategies must be introduced to cover cases such as adding or removing characters from a scene, or rearranging the relationship between characters due to their movement. The following strategy is an example of the changes made due to the disappearance of a character. 6: The representation of knowledge for automated editing 130 E-Strategy 10 If sequence = Realisation or Resolution and sequence_setting A = sequence_setting shot B and number of character in sequence A < number of character in sequence B then introduce a new structure of stable position, where the spatial relationship between remaining characters is downgraded and the relations between remaining characters and removed characters are deleted. Thus far, we have assumed that the location remains constant. However, this may not be the case. Imagine that the plot establishes a situation where two characters decide to leave a cafe. The current location would no longer be applicable and a new location would need to be introduced. A further example might be a change in time, e.g. the same location but in a different era, which would also necessitate the re-establishing of spatial relationships. The relevant editing strategy is as follows: E-Strategy 11 If setting of shot A ÿ setting of shot B then assign shot id of A to the parameter End of the Location-MemoryStructure and use E-Strategy 5 with the id of shot B. We are now in a position to control the spatial relationships between subjects within a location. As stressed before, film is an dynamic medium. The position of an actor at the point of a juxtaposition between two shots is, therefore, part of a temporal scheme, usually based on actions. Thus, the influence of actions on continuity editing must be considered. Of particular interest are such cases where there is a change in the awareness space between two shots to be joined, or when the narrative content needs to be decomposed. 6.4.4 The influence of action on continuity editing If two shots are to be joined and an action is involved, referring to the sequence structure as presented in Figure 6.3, there are more applicable constraints than direction of movement, the latter suggested by Bloch. The first continuity problem for an action is related to the functionality of the action. Imagine a situation where an action such as tapping the fingers on a table is to be highlighted. Assume a medium shot of a character sitting at a table. Using the strategies introduced so far, an automated editing system could decide which shot type it should choose in order for the action to be highlighted, i.e. a close-up. However, the problem that arises is how to relate the different subjects of the two shots, i.e. the character in the medium shot and the hand in the close-up, since so far the system can only infer the spatial relations between the identified character and the 6: The representation of knowledge for automated editing 131 table. Thus, a structure needs to be introduced to enable the system to establish relationships between the different detail and identification levels of subjects, as presented in the shot content. Referring back to Figure 6.6, we are now ready to consider the structure represented by the dimmed text. This knowledge structure describes a subject (character or object) as a hierarchical tree structure, where the root represents the class (e.g. human) and the leaves represent subparts (e.g. finger). Such representational structures are standard, and can be found in Davis (1995), Lenat & Guha (1990) and Parkes (1989a). The inheritance provided by such a structure allows the drawing of inferences based on the relationship between a detail and the whole (e.g. finger and character). Moreover, the different levels of detail provided by the hierarchical representation of subjects as discussed immediately above, corresponds to the content space of the shot types, as described in Table 6.1. This correspondence is particularly useful, since it allows the establishment of relationships between camera distance (e.g. extreme close-up) and a level in the hierarchical representation of subjects (e.g. the leaves), as presented in Table 6.3. Such relationships enable the system to restructure the presentation of the narrative request at the action level, without changing the required logic. In other words, the introduction of relationships between a filmic device and conceptual structures of the "narrative world" support a decomposed presentation of actions through continuity editing. Value of camera device Level of content detailness extreme close-up object:subparts[form,colour] close-up character: subparts one detail of either head, hand, feet object:Instance shape medium close-up medium medium long long extreme long character: Instance or parts of either Head+Id, Hand or Foot object:Instance shape character: Instance Appearance, Head object:Instance shape character: Instance Appearance setting: Time, Location, Lighting object: Instance shape setting: Time,Location, Lighting Table 6.3 Relationship between camera distance and hierarchical representation level of subjects Thus, the system is now also in the position to create the action of a particular character, even if the required visual material is not directly available. Consider the 6: The representation of knowledge for automated editing 132 above example of the character tapping his or her fingers on a table, but assume this time that there is no request for highlighting and no reference to fingers. Now suppose that there is no single shot showing the particular character doing the tapping. The system is still able to create the required action, by splitting the character and the action into separate sequences where one presents the character and the second the action, as described below in E-strategy 12. It must be stressed that E-Strategy 12 represents a case where the previous shot showed several actors so that the system has to focus on the relevant actor. Thus, it becomes apparent that the knowledge structures provided enable the system to react more flexibly than would be possible with only a two shot solution. E-Strategy 12 If an action for a character is required and there is no shot available to portray that action then isolate the character in a shot retrieve the body part related to the action retrieve a suitable shot where a body part performs the required action build a bridge into or out of this sequence if necessary continue with the sequence which was interrupted by this subsequence. E-Strategy 12 guides us directly into a further spatial continuity problem concerning actions and the juxtaposition of shots. When discussing the representation of shot content earlier in chapter 5, we argued that some actions are static in their nature, which means that the subject is not moving significantly in the content space (e.g. gestures), and some are moving, which means that the subject is changing position within the content space. If a subject only performs one action per sequence, we could ignore the above distinction. However, the sequence structure introduced in Figure 6.3 makes it clear that this is not necessarily the case. Each sequence provides a number of options concerning the actions performed by each character, such as: single action The narrative requires that a character perform only this particular action, e.g. to walk. Since a single action can either be moving or static there are no compositional problems here. parallel actions The narrative requires that a number of actions be performed at the same time (e.g. scratching one's head while drinking coffee). As described in sections 2.1.2.1 and 5.2.3, such a combination of several actions defines other functional elements, such as the intentions or moods of the characters. Here, the order of actions is of no importance as long as the impression of simultaneity is provided. 6: The representation of knowledge for automated editing serial actions 133 The narrative requires that a number of actions be performed in a particular logical order, e.g. the events involved when a character obtains coffee from a coffee machine. Here the proper order of events should be maintained. The combination of static and moving actions presents a problem only in cases where the automated system must decompose several actions in a chain of single actions, but should still continue to provide the impression of simultaneity. E-Strategy 12 shows that a sequence can be decomposed into sub-sequences in cases where a particular visual event must be created. We use the same mechanism for the decomposition of actions performed in parallel and introduce the following strategies: E-Strategy 13 If several actions for a character are required and the actions should be performed simultaneously then use actions of type moving first insert actions of type static which should be presented in the same shot type and finish with the actions of type moving build a bridge into or out of this sequence if necessary continue with the sequence which was interrupted by this subsequence. E-Strategy 14 If several actions for a character are required and the actions should be performed simultaneously and there are no actions of the type moving then order the actions of type static, which should be of the same shot type, sequentially; build a bridge into, or out of this sequence if necessary; continue with the sequence which was interrupted by this subsequence. A similar approach can be applied in cases where a particular single action is required, but the retrieval process can provide only one shot in which several actions are performed simultaneously, among which the required action is but one. In this case, the difference to the above strategies is that the required action should be highlighted, which means that all other actions are to be ignored. The decomposition of serial actions can be achieved by using strategies of the type EStrategy 11. Actions imply an actor, which leads to the fourth problem concerning actions and the juxtaposition of shots. Earlier, we demonstrated that the 180˚ principle supports the continuity system, so that a smoothly flowing space for the narrative action can be achieved. However, most actions are performed for a reason, often to provoke a 6: The representation of knowledge for automated editing 134 reaction, or as a reaction. Editing techniques such as shot/reverse shot, or POV shot, embody editing principles for presenting an action as an action or as a reaction. These techniques are related to the point of view of a particular character in the film. On the other hand, the viewer is onlooker, and does not usually see things from the point of view of a particular character. It is therefore essential for an automatic editing system to distinguish between these two cases, so that the appropriate editing technique, with its related spatial constraints for the visual presentation, can be chosen. Referring to the sequence structure, as described in Figure 6.3, this distinction of point of view is provided by the structural element Form. If a sequence focuses on a presentation assuming the viewer as external observer, the form will be external. On the other hand, if the presentation requires that the viewer's perspective coincides with one of the characters, the form will be internal. It must be stressed, however, that though we are aware of the importance of this feature, our approach to it is but a first step towards a true representation. At the current stage in our research, we have dealt with only the external type of presentation. Though we will subsequently refer to the POV shot, we do not refer to the merging of the point of view of the viewer and the portrayed character. Rather, we intend the POV shot to describe a reaction to the action "look". Since the system is now in a position to distinguish particular styles of presentations, a set of new editing strategies can be introduced to specify the spatial constraints on actions, based on their motivation and the number of characters involved. These strategies represent a number of editing techniques, and the following examples are the shot/reverse shot (15, 16), the eyeline match (17) and the eyeline match for a moving character, i.e. POV (18). E-Strategy 15 If sequence.kind is Realisation or Resolution and sequence.intention = interaction number of actors = 2 and shot A contains both characters and actions are of type static and camera angle shot A is overhead then camera distance shot A = camera distance shot B and character relation shot A opposite character relation B and camera angle shot B = overhead => favour this shot as shot B 6: The representation of knowledge for automated editing 135 E-Strategy 16 If sequence.kind is Realisation or Resolution and sequence.intention = interaction number of actors = 2 and shot A contains both characters and actions are of type static and camera angle shot A is overhead and available shot B contains only character for reaction then camera distance shot A camera distance shot B and camera position shot B is middle and camera angle shot B = Head.line_of_sight of action character in shot A => favour this shot as shot B E-Strategy 17 If sequence.intention = relation number of actors = 2 shot A contains 1 character then camera distance shot A = camera distance shot B line of sight character shot A = camera position in shot B camera angle shot A = camera angle shot B => favour this shot as shot B E-Strategy 18 If sequence.intention = relation and action and interaction number of actors = 1 and action of type moving then camera distance shot A camera distance shot B line of sight character shot A = camera position in shot B camera angle shot A = camera angle shot B direction of action in shot A = camera movement in shot B => favour this shot as shot B The number of editing rules such as E-strategy 15 - 18 are limited, as the number of characters involved is assumed to be no more than 3. The strategies and representational structures introduced so far would enable an automated editing system to react flexibly to narrative demands by providing visual presentations that show a constant content space with respect to the position, sight and direction of movement of subjects. The remaining problem in juxtaposing two shots, while preserving continuity of content, is the comparison of the surrounding space of the characters in both shots. This is discussed in the next section. 6.4.5 The comparison of surrounding content space and graphical pattern Though the above structures and strategies provide a constant content space over various shot juxtapositions, a problem that remains is to compare the overall 6: The representation of knowledge for automated editing 136 appearance of two shots to be joined, with respect to the similarity of surrounding space and the overall graphical pattern. Earlier in this chapter, we discussed the influence of different camera distances on the shape of the awareness space of the viewer. We also mentioned that the comparison between shots is based on the last frame of shot A and the first frame in shot B. These two schemes form the basis for our comparison mechanisms. Camera distances are either identical for two shots, or they differ, i.e. the camera distance in shot A is larger or smaller than the one of shot B. Furthermore, in cases of a change of location, comparisons need only be performed on the id of the relevant subjects appearing in both shots. However, it is also advisable to compare the characters' screen positions (to keep the viewer's attention on the same spot) and directions of actions, to fulfil the viewer's expectations of continuity (action match). These mechanisms are described in detail by Bloch (1986). A different situation applies if the location remains constant over two shots. Then it becomes necessary to compare the relationship between subjects and their position in the setting. However, it is not always essential to compare every single relationship due to the difference in camera distances. For example, if the order of the shots is medium shot followed by long shot, it is advisable that all the elements within the medium shot are represented in the long shot. This is not the case if the long shot is followed by the medium shot. With reference to the above, a new set of E-Strategies can be introduced. Again, not all possible strategies are expressed, but rather examples of the different types of strategy. E-Strategy 19 If camera distance of shot A = camera distance shot B and location of shot A ÿ location of shot B and subjects(s) of shot A appear in shot B then position of subject(s) in Shot A = position of subject(s) in shot B action direction of shot A = Action direction of shot B => favour this shot as shot B E-Strategy 20 If camera distance of shot A = camera distance shot B and location of shot A ÿ location of shot B and character(s) of shot A do not appear in shot B then action direction of shot A = camera direction shot B => favour this shot as shot B 6: The representation of knowledge for automated editing E-Strategy 21 If camera distance of shot B = medium close-up and location of shot A = location of shot B and sequence.intention is interactive and number of character in shot B = 1 action of character in shot B is of type static then verify that all objects of the Location-Memory-Structure located character of shot B appear in shot B => favour this shot as shot B E-Strategy 22 If camera distance of shot A is long and camera distance of shot B is medium long and location of shot A = location of shot B and character(s) of shot A appear in shot B and character action in shot A is of type moving then verify that the objects related to the character(s) in shot A in front of the action direction appear in the back of the the characters action direction of shot B => favour this shot as shot B E-Strategy 23 If camera distance of shot A is medium and camera distance of shot B is close-up and location of shot A = location of shot B and subject of shot A appears in shot B and action in shot A is of type static then verify that the background shot A = Background shot B => favour this shot as shot B E-Strategy 24 If camera distance of shot A < camera distance of shot B and location of shot A = location of shot B and character(s) of shot A appear in shot B and character action in shot A is of type moving and position of character(s) in shot B is opposite of character(s) position in shot A then no content comparison necessary. 137 The need to individually specify the comparison mechanisms for different camera distances is unavoidable, in cases where the camera distance is decreased, as the different levels of detail provided by different types of shots require particular mechanisms for comparing content. In cases where the camera distance between shots increases and the actions involve movement (E-Strategy 24), no comparison of content is necessary, since the action itself is sufficiently powerful to provide continuity. Thus, in such cases, exactly the right direction of movement and position of characters must be maintained. 6: The representation of knowledge for automated editing 138 Related to the comparisons made by E-Strategies 19 - 24 is the evaluation of the general graphical pattern (e.g. lighting, colours, composition). The link between two shots on the basis of patterns can be motivated by two requirements: first to achieve contrast, and second, to establish a similarity. It may initially seem strange to assume that contrast can contribute to a smooth narrative flow. Within a particular sequence of a film it might be necessary to emphasise the spatial opposition of two characters by showing each of them with different light and colour backgrounds. However, it is usually similarities in graphical pattern which serve as a continuity control. It is not necessary that the graphical match between colours, compositional features or direction of movements (camera and subject) are identical across the cut but the precision should be appropriate. An example of an editing strategy related to such problems is described below. E-Strategy 25 If setting.location shot A = setting.location shot B and setting.time shot A = setting.time Shot B and setting.lighting shot A = setting.lighting Shot B then Shot granularity shot A = Shot granularity Shot B Shot colour shot A = Shot colour Shot B Shot contrast shot A = Shot contrast Shot B The knowledge structures and their analysis mechanisms presented so far provide an automated editing system with the ability to combine shots in such ways that satisfy the required narrative logic, even if the exact material required is not available. The ability to shape the viewer's awareness space enables the system to provide a kind of rhythmical structure to the sequence, even though, at the current stage of the discussion, this structure is solely content oriented. The remaining problem to be discussed is how this rhythm can be controlled in its temporal appearance. This issue is discussed in the next section. 6.4.6 Temporal and rhythmical relations between shot A and shot B In chapter 2, we defined the three dimensions through which plot time leads the viewer to construct the story time, i.e. the order, frequency and duration of actions. The preceeding sections of this chapter discussed the implications of the ordering of actions on the creation of visual space. Furthermore, we stated that continuity editing merely supports the intelligent sequencing and orchestration of the narrative chain of causality by providing appropriate visual material. Changes in the logic of the plot are a feature of story generation. 6: The representation of knowledge for automated editing 139 The principle ordering mechanism for continuity editing is sequential. However, EStrategies 13 and 14 show how editing can intelligently order shots of actions sequentially, and yet provide the impression of simultaneity. A common violation of the temporal order of events found in film is the flashback or the flashforward, which are signalled either by a cut or by cinematographic devices such as dissolves. Since we exclude the use of dissolves at the current stage of our research, our approach supports only the cut, and we have shown above how the system can react to environmental changes (see E-Strategies 6 and 10). Though these mechanisms describe spatial changes, it would not be difficult, by simply focusing on the temporal attributes of a setting, i.e. epoch, season, daytime, to provide similar strategies for the appropriate visual presentation of a temporal change. Moreover, if emphasis is placed not on the creation, but rather on the interpretation of a sequence, we can imagine how such strategies for the detection of temporal differences may be used to interpret a cut between two shots as a temporal switch. Such results then can be used to evaluate and thus prefer certain cuts, to establish the appropriate presentation of a flashback or flashforward. The second feature controlling the temporal shape of a sequence is the frequency of actions or events. While describing the relationship between events in section 2.1.2.1, we showed that repetition and heaping are common ways of intensifying significant information. Again, we face a similar problem to that of the control of temporal ordering. It is the plot structure which requires a repetition. However, having included the substructure Form in the sequence structure (see Figure 6.3), we are in a position to outline how an automated editing system can cope with the structural repetition of an event. Due to the sequential order of shots in the Location-Memory-Structure, it is possible to detect the sequence to be repeated, and along with it, all of the stylistic elements used for its presentation (e.g. camera distances, camera angles, colours, etc.). These stylistic and structural features can now either be copied or used for a similar scene with slight structural differences.3 The remaining feature of the temporal control of actions or events, i.e. duration, is more relevant to shaping the established order of a sequence. The shaping of duration by the editing process can be applied to the actual length of a shot, which exerts an influence on the content, and the overall appearance, of the whole sequence, which, in turn, influences the temporal rhythm. Both features are obviously correlated, since both are related to the shot length. We now discuss editing strategies for automated physical clipping that support the narrative flow of information with respect to temporal structure, paying particular 3 An example of such relationship between events is the film Groundhog Day, by Harold Ramis, where the same day is repeated again and again but with slight changes in the event structure. 6: The representation of knowledge for automated editing 140 attention to transformational classes related to duration, i.e. expansion, equivalence and contraction (as discussed in section 2.1.3). Furthermore, we discuss the problem of creating an overall appearance for a sequence. 6.4.6.1 Preliminary remarks Referring to the sequence structure (see Figure 6.3), we can identify two structures that hold information relevant to making decisions on the temporal portrayal of an action or event. One is the sequence substructure Action, where the attribute tempform indicates if the requested order of actions represents a screen time greater than that of the story time (expansion), a screen time equal to story time (equivalence) or a screen time shorter than the story time (contraction). The second temporal indicator is provided by the sequence substructure Appearance, which provides the editing process with information concerning the overall rhythmical appearance of the sequence, i.e. a steady pace (shots are approximately of the same length), a dynamic slowing pace (steadily lengthening shots), or a dynamic accelerating pace (steadily shortening shots). The problem with these temporal control attributes is that they are not only responsible for shaping the duration of the shot, but will have already influenced the previous steps, i.e. retrieval and ordering. Take the expansion of an action as an example. An expansion requires that an action is decomposed for presentation purposes, which obviously influences the retrieval and order of shots. However, if the required decomposition cannot be achieved, the related temporal clipping cannot be performed, and must be replaced with the now appropriate temporal control mechanisms, say related to equivalence. As we see editing as a task which supports the automated generation of meaningful film sequences, the claim can be made that our aim is to fulfil the content request, based on the temporal requirement if possible, and to perform the clipping necessary for the temporal requirement once a sequence is completely specified, and then only if the required material can be provided. The necessary information concerning the shot numbers used for the presentation of an action can be retrieved from the LocationMemory-Structure. The above indicates that there are two stages to the editing. Firstly, the retrieval and ordering of shots and, secondly, the shaping of shots and the overall appearance of the scene. This corresponds to the description of our editing model in section 4.3. All forms of temporal clipping discussed in the following sections conform to this scheme. 6: The representation of knowledge for automated editing 141 6.4.6.2 Temporal clipping for action expansion The effect of expansion is usually valid for a single action and can be achieved by repetition. The repetition is performed either through the use of an action match, or through the overlapping of the portrayal of the same event with shots taken from different camera positions. Since we do not support overlapping, we discuss only the former case. Figure 5.5 shows how our representation of video content, based on time intervals, supports the isolation of particular actions within a shot. The expansion of an action uses the same mechanism of temporal clipping, which means that an action is presented twice, where the repetition is either shown from a different camera position or as a detailed continuation, i.e. a zoom-in. Since it is confusing to show the same action twice, it is necessary to shorten the first presentation, which can be achieved through temporal clipping, i.e. a certain number of frames must be removed from the first shot. Two additional strategies are required to expand an action through automated editing, where E-Strategy 26 represents the constraints for retrieval and ordering, and EStrategy 27 describes the temporal clipping. E-Strategy 26 If sequence.action.tempform = expansion and action is a single action then favour decomposed forms of presentation where the camera distance of shot A > camera distance shot B or the camera position of shot A ÿ camera position shot B E-Strategy 27 If sequence.action.tempform = expansion and action is portrayed decomposed then clip the last half of shot A show complete action for shot B. 6.4.6.3 Temporal clipping for the temporal equivalence of actions A temporal equivalence between the progression of the narrative and the presentation time actually means that every action by characters or objects is shown without gaps. For editing, this means that the actions are shown in the required order and in full length. The standard indicator for temporal continuity based on visual presentation is the action match, as already described above.4 The relevant strategy is as follows: 4 The reader is reminded that we exclude sound, which is the other indicator of temporal continuity between shots. 6: The representation of knowledge for automated editing E-Strategy 28 142 If sequence.action.tempform = equivalence then order parallel or serial actions as required and favour strategies based on action_match. The problem which may arise here relates to our content representation of actions, which is based on a temporal-symbolic description. This information states only at which frame a certain action begins and at which frame it ends. Naturally, a system can calculate how long that performance takes (24 frames = 1 sec). However, provided with the sequence structure and the content description, the system is not yet in a position to detect how long the presentation should be. This problem is addressed later, when we consider physical cutting. 6.4.6.4 Temporal clipping for action contraction As described in section 2.1.3, temporal contraction is a form of ellipsis. For editing, this means that the time taken by the portrayal will be less than that suggested in the story. For the most part, narrative-oriented ellipsis is provided in advance by the sequence. Imagine the following situations: • A sequence shows a character wash his hands, button his shirt, drink coffee and leave a house. The unwanted time is eliminated here by the content. The appropriate juxtaposition of shots can be achieved by the strategies introduced above. • A sequence shows stages in the career of a writer. The unwanted time is eliminated here by the content and the appropriate juxtaposition of shots can again be managed by the above strategies. • There are some clichéd presentations of time elimination, e.g. calendar leaves fluttering away, a shadow of an static object moving along a wall, clouds moving, clocks ticking, etc., which also represent content based elimination of time. All the above sequences embody the elimination of a considerable period of time. However, if the aim is to reduce a short period of time for a single action, then physical, elliptical editing can support the presentation. Imagine a sequence where a character must go from one location to another. Without using dissolves, there are two major ways of eliminating time: 6: The representation of knowledge for automated editing 143 • insert an event into the action, which shows something different, and then return to the previous event. For the above example, this may be portrayed as showing the character walking, then a door, and then the character already doing something in the flat. • show the character leave the frame, hold on the empty location, cut to the empty location of the new frame and let the character enter. As an example imagine a shot where the character begins to climb a flight of stairs. The shot ends with the character disappearing and a few frames of the empty stairs. The next shot first shows the stair case on a higher level and then the character coming up. The first example is again content related and can be covered by the editing strategies introduced thus far. The second of the above examples, however, requires the introduction of a new set of editing strategies, where E-Strategy 29 represents the constraint elements for the retrieval and ordering process, and E-Strategy 30 describes the temporal clipping. E-Strategy 29 If sequence.action.tempform = contraction and action is a single action then favour decomposed forms of presentation where the camera distance of shot A camera distance shot B E-Strategy 30 If sequence.action.tempform = contraction and action is portrayed decomposed character leaves frame then clip not later than 24 frames after the first frame without character in shot A and clip not later than 24 frames before the first frame where the character is in shot B. 6.4.6.5 Rhythmical shaping of a sequence So far we have examined various aspects of physical editing to support the visual presentation of temporal continuity. There are two more representational requirements of temporal clipping, both of which usually support the meaning of a video sequence. The first is related to the length of a particular shot. When we introduced our scheme for shaping the viewer's awareness space, we explained in detail the relationship between camera distance and the shot content. A similar relationship applies between the camera distance and the viewer's ability to take in the information provided by a shot. It is possible to take in the image in a short duration close-up shot in a relatively short time ( 2-3 seconds), whereas the full 6: The representation of knowledge for automated editing 144 perception of a long shot requires more time.5 Moreover, we explained above that the composition of shots may vary in number of subjects, number and speed of actions, and so on, which also influences the time taken to perceive the image in its entirety. Finally, the stage of a sequence in which a shot features also influences the time taken to perceive the entire image. For example, a long shot used in the motivation phase takes longer to appreciate, since the location and subjects need to be recognised, whereas in the resolution phase the same shot type may be shorter in duration, since in this case the viewer can orient himself or herself much more quickly. Based on the preceding discussion, the following editing strategies can be derived. These apply, like the previous strategies concerning temporal cutting, during the second phase of the editing process: E-Strategy 31 If camera distance of a shot close-up and then clip it to a length 60 Frames. E-Strategy 32 If close-up < camera distance of a shot < long sequence. kind = motivation then clip it to a length 108 Frames. E-Strategy 33 If close-up < camera distance of a shot < long sequence. kind = motivation number of characters is > 2 then clip it to a length 12*(Number of character -2) + 108 Frames. E-Strategy 34 If camera distance of a shot > medium-long sequence. kind ÿ realisation or resolution then clip it to a length 136 Frames. Strategies such as E-strategies 31 - 34 indicate the need to trim a shot. However, referring to the discussion of temporal equivalence, the above strategies may cause problems, as the actions portrayed may simply require more time than suggested by the rules. Imagine, for example, a long shot during the motivation phase, which shows a character walking down a path, picking a flower, continuing to walk, and finally sitting down on a bench. If all of these actions are required by the narrative, it is not 5 The time values used in all the following examples of editing strategies are based on estimates provided by the editors at the WDR. 6: The representation of knowledge for automated editing 145 possible to trim the shot exactly to a length of, say, 120 frames, since we would then lose some of the actions. However, it is still possible to clip a number of frames from the beginning or the end of the shot. In Figure 5.4, we demonstrated how actions can overlap. In Figure 5.5, we showed how the start and end frame for an action can be used to fulfil a particular narrative request for actions, even if larger parts of the shot must be removed. The same mechanism can be used to trim a shot. The necessary steps to be performed are to identify the startframe of the first relevant action, and to detect any overlap with the second relevant action. If such an overlap exists, then it is possible to cut away the section of the shot in which the first action is performed in isolation. The same mechanism applies to the end of the shot. It is, of course, important that the established spatial and temporal continuity between the shot and its predecessor and successor are still valid. Figure 6.7 describes the application of edge trimming for a shot of 140 frames. The shot should portray an actor walking and then sitting, but should not be longer than 108 frames. 0 140 eat walk sit talk Shot Annotations Figure 6.7 Trimming of a shot from 140 to 108 frames E-Strategies 35 and 36 represent examples of temporal clipping as discussed immediately above. E-Strategy 35 focuses on elimination of frames from shots that feature close camera work, and are involved in an action match. E-Strategy 35 If camera distance of a shot = extreme close-up or close-up and the type of action is moving and the number of frames for the shot is > 60 then Start at the last frame of the shot count 60 backwards cut the remaining frames to the startframe of the shot E-Strategy 36 focuses on frame elimination from a single shot that sequentially portrays a number of actions. 6: The representation of knowledge for automated editing E-Strategy 36 146 If close-up < camera distance < long and sequence.action.tempform = equivalence and number of frames > as calculated in E-Strategy 32 or 33 and number of performed actions in the shot > 3 then verify the frame overlap of first and second action as X verify the overlap of last and last - 1 action as Y cut away the pure frame numbers for the first action if X > 36 cut away the pure frame numbers for the last action if Y> 36 Thus, a system is now in the position to adjust the length of a shot to provide an appropriate perception time, without removing essential narrative elements. This means that the automated editing process, as described thus far, shapes the appearance of a sequence into a temporal rhythm which is merely content related. The remaining problem to be addressed is the shaping of the overall temporal appearance of a sequence with respect to a steady or dynamic pace, i.e. shaping the stylistic intention of a sequence by means of temporal clipping. To provide the system with mechanisms that can achieve the above task, it is necessary to adjust the shot length designed for the temporal perception of a shot with the pattern required for a dynamic slowing or accelerating of pace. However, at the current stage of our research, we merely state that this should be done, as we are not yet in a position to provide a usable scale to direct the slowing or accelerating of the pace. It is therefore not possible to describe the temporal influence of these mechanisms on the shaping of single shots. For a more complete scheme for the temporal shaping of sequences further research is required. 6.4.7 Conclusion Based on the assumption that the editing process supports narrative through appropriate visual and cinematographic presentation, we have introduced a novel scheme for automated video editing. In this, we were influenced by structures introduced by Bloch (1986) and Parkes (1989). We have shown how cinematic constraints for one editing style (continuity editing) can automatically be applied to narrative structure, by supporting the essential narrative aspects of context and form (see Figure 2.4). We have introduced a method of relating the intention of a narrative sequence to the shot presentation, and demonstrated how the knowledge structures and analysis mechanisms introduced can provide an automated editing system with the ability to visually guide the viewer of a video sequence through a narrative, so that he or she is in the position to classify information as relevant or purely descriptive. Furthermore, we introduced a scheme for the establishing and maintaining of content space over several shots, based partly on strategies for the visual decomposition of a character and/or related actions, and 6: The representation of knowledge for automated editing 147 partly on the decomposition of spatial relationships between characters and other characters and between characters and their screen position. Using this scheme, we showed how an automated editing system can react flexibly to narrative requests, by making use of the available visual material, even if such material does not directly represent the narrative specification. In addition, we briefly discussed how the same mechanisms not only support the creation, but also the interpretation, of film sequences. Moreover, we presented a novel approach for comparing the content space of different shots, which is important for providing an overall spatial continuity within a sequence of juxtaposed shots. Finally, we demonstrated how a two stage editing process, as presented in section 4.3, provides the ability to shape the visual presentation of a plot sequence by means of physical clipping, so that the temporal rhythm of the sequence supports the narrative intention. However, the above approach is but a small step towards intelligent automated editing. The majority of the strategies can be refined in various ways, and, as outlined in the previous sections, there are a number of problems yet to be solved, e.g. the influence of the introduction of additional cinematic codes on the plot realisation, as described in Figure 6.3. Other problems relate to the temporal aspects of automated editing. Earlier, we discussed the two-way relationship between the overall rhythmical structure of a sequence and the temporal shape of single shots, but we did not adequately address the influence of the speed of an action on the rhythmical structure. Finally, it is yet to be determined how successful the editing scheme would be in producing higher order narrative structures, i.e. episodes. Despite these shortcomings, the scheme presented is a small step towards capturing the conflicting relationship between metric, rhythmic and tonal montage as described by Eisenstein. 148 Chapter VII The representation of narrative and thematic knowledge The two preceding chapters discussed how film material and techniques can be represented so that they can be used in the cinematic portrayal of narrative events. We showed how various denotative aspects of video content can be described and thus made accessible. We presented particular mechanisms for the manipulation and physical shaping of video, and showed how these mechanisms can support the spatial and temporal credibility of the assembled video material. The editing model in section 4.3, and the analysis of the story generation process as described in chapters 2 and 3, make it clear that the development of a simplified linear structured narrative is part of the production stage described as post-production. This means that the linear flow of the plot, i.e., the introduction of characters and settings, and development of the plot through various situations and causally linked events, exist before the film itself can be created. Referring to Figure 2.4, it can be stated that the description of the plot is designed in such a way that it supports the actual portrayal of the narrative, i.e. in the cases considered by this thesis, the plot must be designed in the form of a screenplay (as discussed in section 6.4.1). The major components from which a narrative is built (previously specified in Figure 2.4) are events (actions and happenings), physical entities (characters and settings) and cultural codes. Thus, a narrative depends on knowledge of the world, which is processed and arranged in a particular way to provoke perceptual activity within the audience. A key influence on the effective automatic generation of the meaningful and emotionally stimulating portrayal of narrative events is to suspend the disbelief of the viewer (see also section 2.1.2 and 3.2.1). That is, the viewer must be able to imagine that the logical relationships between characters and their actions within a setting are real. 7: The representation of narrative and thematic knowledge 149 The aim of this chapter is to present an ontological representation describing the physical world and abstract mental and cultural concepts that can support an intelligent system in creating credible narrative sequences. The intention is not to equip the system with extensive common sense knowledge. Rather, we intend to create an intelligent system that generates credible film sequences for the narrative world they present. Referring to the genre discussion in chapter 2, it is justified, we believe, to argue that the narrated world is a stereotypical world, which only resembles the world as human beings know it. Thus, a shallow level of knowledge of intentions, emotions and human activity, the physics of human beings and the physics of the micro-world in which the characters act will be sufficient, for our purposes.1 Furthermore, the underlying representations and semantics of the "narrative world knowledge" must be related to the cinematographic structures and mechanisms introduced earlier, so that the translation between one representation and the other does not result in the loss of salient features of either representation. The theoretical background for the ensuing discussion has two main features. Firstly, we use the results of our discussion of the relationship between the iconic sign and the idea it represents, according to the creator's intention, in section 4.2.1.2. In this discussion, we argued that differing connotations can be attributed to the sign, depending on the circumstances and abductive presuppositions of the receiver at the time of perception, along with the various legitimated codes and subcodes the receiver uses as interpretational channels. At the same time, we introduced the paradigmatic and syntagmatic axes of meaning (Figure 4.1), Peirce's trichotomy of a sign as being either symbolic, iconic or indexical (Peirce, 1960), and Eco's structural analysis of the cinematic image (Eco, 1977; Eco, 1985), and his classification of the underlying code system for the triple articulation of an image. As the organisational structure for the above systems of cultural units we identified semantic fields, as described by Bordwell (1989, p. 106): '...a conceptual structure which organises potential meanings in relation to others'2. As discussed in section 4.2.1.2, a semantic field can be constructed according to various principles, i.e. clusters, doublets, proportional series and hierarchies. These structural elements form one part of the ontology. 1 For similar approaches to the representation of credible animated characters in interactive virtual worlds see Bates (1992, 1994), Bates et al. (1992), Hayes-Roth (1995), Hayes-Roth et al. (1995), Hayes-Roth et al. (1994). 2 See also Eco (1977, pp. 73 - 150). 7: The representation of narrative and thematic knowledge 150 The second theoretical influence on our approach to the representation of narrative knowledge arises from the discussions of narrative in chapter 2, and humour in chapter 3, which provide a basis for our approaches to temporality and causality. Before presenting our techniques for representing the physical characteristics of human beings and micro worlds, and of action and event structures, including the representation of human intentions and emotions, we first discuss approaches to knowledge representation that influence our own approach. 7.1 Approaches to knowledge representation 7.1.1 Quillian and semantic networks The notion of using semantic relations to provide background knowledge about the world was introduced by Quillian (1966; 1985)3. He presented a memory model that consists mainly of a large number of nodes and tokens interconnected by different types of associative links. In Quillian's model, a node is defined as an English noun, where the associative link refers directly to a configuration of other nodes that represent the meaning of the noun. A token, on the other hand, refers indirectly to another word concept, by having one special type of associative link that points to a concept's type node. By combining these individual semantic relations, Quillian's memory model establishes a network of semantic relations referring to a common word, which enables the system to draw inferences from word pairs such as "plant" and "man" represented by statements such as "A plant is not an animal structure." (Quillian, 1966, p. 253). 7.1.2 Miller, Bateman, Lenat and large databases of semantic relations More recently, the creation of large databases of semantic relations have been considered by Miller et al. (1993), Bateman et al. (1996) and Lenat & Guha (1990). Miller and his colleagues (Beckwith, Miller, & Tengi (1993), Fellbaum (1993), Fellbaum, Gross, & Miller (1993), Miller (1993), Miller, et al. (1993)) describe WordNet, a manually constructed semantic network that serves as an on-line lexical reference system, the design of which is inspired by psycholinguistic theories of human lexical memory. The vocabulary used in WordNet represents approximately 95,600 different word forms (51,500 words and 44,100 collocations) describing some 70,100 word meanings. WordNet is organised around categories of nouns, verbs, 3 See also Woods (1985). 7: The representation of narrative and thematic knowledge 151 adjectives and adverbs. The basic unit of WordNet is a synset, which represents a set of synonyms. Each synset contains different meanings of a word. The most important lexical relationship between word forms within WordNet are synonyms and antonyms, which are available for all categories. Other semantic relations between word meanings are hypernyms and hyponyms (e.g. maple is a hyponym of tree, and tree is a hyponym of plant). The semantic system of hypernyms and hyponyms supports the lexical inheritance system. Other semantic relations provided by WordNet are, for example, substance-of, part-of and member-of for nouns, causes and entails for verbs and pertains-to for adjectives. The Generalized Upper Model developed by Bateman et al. (1996; 1994), is intended to be a domain and task-independent general organisation of information in the context of text generation (English, German, Italian), but its linguistic model also suits Natural language Processing applications. The Generalized Upper Model occupies a level of abstraction between surface linguistic realisations and conceptual or contextual representations. It enables abstraction beyond the concrete details of syntactic and lexical representations, while still enabling linguistic realisations to be solidly founded on objective criteria. The upper model is organised in terms of generalised linguistically-motivated ontological categories, such as: • abstract specifications of process-type/relations and configurations of participants and circumstances (e.g., NONDIRECTED-ACTION, ADDRESSEE-ORIENTEDVERBAL-PROCESS, ACTOR, SENSER, RECIPIENT, SPATIO-TEMPORAL, CAUSAL-RELATION, GENERALIZED-MEANS), • abstract specifications of object types, e.g., for semantic selection restrictions (e.g., DECOMPOSABLE-OBJECT, ABSTRACTION, PERSON, SPATIALTEMPORAL), • abstract specifications of quality types, and the types of entities to which they may relate (e.g., BEHAVIOURAL-QUALITY, SENSE-AND-MEASURE-QUALITY, STATUS-QUALITY), • abstract specifications of combinations of events (e.g., DISJUNCTION, EXEMPLIFICATION, RESTATEMENT). Given the detail and consistency of both WordNet and the Generalized Upper Model, their organisation appears appropriate for the enforcing of ontological consistency in general domain modelling. 7: The representation of narrative and thematic knowledge 152 The goal of the Cyc system, developed by Lenat & Guha (1990), is to capture the common sense knowledge shared by most people, or in Lenat's words: 'The Cyc knowledge base is to be the repository of the bulk of the factual and heuristic knowledge, much of it usually left unstated, that comprises "consensus reality": the things we assume everybody already knows.' (Lenat & Guha, 1990, p. 28). Cyc uses a very large database of units, which either describe real-word objects, a type of process, a particular event or an abstract idea. Each unit is represented by a frame based data structure, comprised of slots, each of which has a corresponding value, where the value of a unit is always a list of individual entries. This large, fixed ontology supports rule based inference mechanisms, described as predicate calculuslike constraints. The combination of a semantic ontology structure and formal logical inference mechanisms requires a consistent and correct representation of all knowledge units, which excludes the possibility of contradictory knowledge representations, and so logically coherent microtheories were introduced, which can translate a representation from one context into another.4 A microtheory is internally consistent but can contradict other microtheories. Cyc is a key project in knowledge representation, as it represents an attempt to overcome the brittleness and domain specificity of other approaches, through extensive representation of general knowledge (i.e. about people, objects, substances, events, sets, ideas, relationships, etc.), with which the system can analogise. However, the belief that first-order logical mechanisms solve the problem of translating between different representational structures seems inadequate to the current author (in section 5.1.4 we addressed the shortcomings of Cyc representations and inference mechanisms if used to represent and retrieve images).5 Nevertheless, the Cyc system serves as a helpful guide as to how to represent people, objects and settings. 7.1.3 Haase's approach to memory-based representations FRAMER, developed by Haase (1994), is a knowledge representation library designed to provide a platform-independent persistent object facility to support large database 4 As argued by Minsky (1988), a feature of intelligent behaviour is the ability to manage contradictory representations. 5 See also Bobrow & Winograd (1985), and Winograd (1985). 7: The representation of narrative and thematic knowledge 153 functionality. The organisational structure of FRAMER is a non-deterministic prototype-based inheritance mechanism. The basic descriptive element in FRAMER is a frame, a class object which can have other frames (called annotations) subordinate to it (Figure 5.3 showed a typical Framer representation). Frames are grounded by pointing to other frames or domain objects, such as numbers, strings, vectors, procedures, bitmaps, etc. The annotation hierarchy and the prototype network allow the creation of semantic and episodic memory structures, where the behaviour of particular prototypes and their derivatives can be easily specified, which in turn eliminates the need for a second, independent constraint language such as Cycl (Lenat & Guha, 1990). One component of FRAMER is FRAXL, an extension language that allows simple expression of common search and iteration schemata (e.g. A is the inverse of B or C is a generalisation of D). Built on top of FRAMER is Mnemosyme (Chakravarthy, Haase, & Weitzman, 1992), an analogical knowledge representation system. Mnemosyme uses FRAMER's organisational structures, i.e. the annotation hierarchy, the prototype network and the ground pointer, to index and match examples of relations between descriptors under their common prototypes. In doing so, Mnemosyme offers a base ontology of differences and similarities between descriptors (prototypes) whose relational indices form the basis of matching and generalisation. Haase's method of developing semantic memory structures from episodic memory representations enables the creation of contradictory representations, through the indexing of variant examples under prototypes, which do not, in contrast to Cyc (Lenat & Guha, 1990), suffer from loss of information in the translation between different representational structures. The potential of FRAMER and Mnemosyme for the representation and retrieval of video is demonstrated in Davis' work (Davis, 1995), as described in section 5.1.5. However, it is the use of prototypes and the focus on differences and similarities which makes Haase's work relevant to our knowledge representation scheme. 7: The representation of narrative and thematic knowledge 154 7.1.4 Schank's conceptual dependencies and dynamic memory One of the major contributions to artificial intelligence research into the representation of action, story understanding and generation, and memory-based representation has been made by Schank and colleagues. The Schankian understanding of creativity in a cognitive system is centred on the ability to analyse stories effectively. The advantage of stories as a primary organisational structure of human cognition is that they can be indexed in multiple ways, due to numerous references, which in turn makes them multiply accessible and relevant. Initial attempts to implement Schank's insight into human cognition have been focused on the problem of programming computers to understand stories. Major achievements of this research are the introduction of: • conceptual dependencies, which describe human action through a small set of 12 composable primitives (e.g. ATRANS, PTRANS, MOVE, MTRANS, etc.) (Schank, 1972), • scripts, which describe stereotyped sequences of events in a particular context (situational, personal or instrumental) (Cullingford (1978), DeJong (1983), Schank & Abelson (1977)). • goals and plans, which represent high level structures that control understanding, and in particular story understanding (Carbonell (1978), Schank & Abelson (1977), Wilensky (1978)). Extensions and revisions of the script formalism have led to the development of dynamic memory theory (Schank, 1982), in which reminding, indexing and retrieval are understood as basic cognitive processes, each of them represented as a memory structure which itself functions as a representational and retrieval mechanism. Schank's theory of dynamic memory introduces memory concepts such as MOPs (Memory Organization Packets), TOPs (Thematic Organization Points) and TAUs (Thematic Abstraction Units), which gave rise to a number of new approaches to story understanding and generation (Dyer (1982), Kolodner (1984), Lebowitz (1980), Lehnert (1983), Lehnert et al. (1983), Schank (1982), Schank & Riesbeck (1981), Wilensky (1983a, 1983b, 1983c)). 7: The representation of narrative and thematic knowledge 155 From dynamic memory theory, Schank develops the notion of case-based reasoning, an alternative to rule-based reasoning (Kolodner (1993), Riesbeck & Schank (1989), Schank (1982), Schank, Kass, & Riesbeck (1994)). The aim of the case-based reasoner is to solve 'new problems by adapting solutions that were used to solve old problems' (Riesbeck & Schank, 1989, p.25). The task of generating a visually oriented narrative sequence can itself be thought of as a case-based problem (see the genre discussion in chapter 2), which involves tasks such as reasoning about, composition of, and adaptation of, the narrative sequence. Though Schank et al. have developed video databases for interactive training and education, they have merely used film in an accompanying role. Their systems react to the user's state of interest, by using internal story strategies to decide what type of story should be told, which then triggers a simple indexing scheme for the retrieval of the appropriate video (Burke (1993), Edelson (1993), Schank (1994), Smith et al. (1995)). A case-based reasoning approach which is designed particularly for the retrieval of video is described by Domeshek & Gordon (1995), Domeshek & Kolodner (1994), Gordon & Domeshek (1995), and was discussed in section 5.1.4. Our proposed knowledge representation is influenced particularly by scripts, TOPs, and TAUs, these being augmented in a network of detailed semantic fields of actions and subjects that is designed for the requirements and properties of the visual medium. Furthermore, we adopt certain features of case-based reasoning, though in a rudimentary way, since the advantages of case-based reasoning apply to higher narrational levels than those considered in this thesis. 7.2 Knowledge representation to support the creation of emotion provoking narrative sequences The semantic and episodic memory structures introduced in this section are designed mainly to support the generation of narrative structures at the sequence level, based on the results of chapter 2. However, the structures to be described must meet the particular requirements of a visual presentation, and thus serve to create the plot requirements as presented in section 6.4.1 (see also Figure 6.3), and the humour strategies presented in chapter 3. Finally, the design of the semantic and episodic memory structures must mirror the ontology of the representation of film content, as presented in section 5.2, so that retrieval of appropriate video material can be achieved. 7: The representation of narrative and thematic knowledge 156 Our proposed structures will describe events only to the level of the sequence. We will but outline the connection of events to larger narrative structures, e.g. episodes. Furthermore, there will be no indication as to how the initial intention (external reason) for a story can be created from the structures to be introduced, since this thesis does not deal with this problem. In fact, it will be shown in chapters 8 and 9 that the starting point for the generation of visual stories is provided to the system in form of a start shot. We begin the description of our approach to the representation of background knowledge by considering actions, since actions are the core functional elements within an event. Actions define other narrative elements, such as the intentions, or the moods, of characters, and the importance of objects or locations. Using the then established semantic network for actions, we enhance it with the introduction of event structures and the description of abstract mental and cultural concepts. 7.2.1 Actions Our ontology for the representation for video content, as described in section 5.2, is a structured textual representation of the semantic, temporal and relational features of video. The textual terms used in the ontology are generic, as any overly directive choice of labelling is to be avoided, as discussed in section 5.2.1. Thus, we also use generic terms for the representation of actions within common sense knowledge. Our organisation of actions adopts the approach taken in WordNet (Fellbaum, 1993), discussed above, which classifies verbs mainly on the basis on semantic criteria in the following domains: bodily care and function, change, cognition, communication, competition, consumption, contact, creation, emotion, motion, perception, possession and social interaction. In the ensuing discussion, we will focus on motion, in order to exemplify the conceptual structure action, and its role within the semantic net. 7.2.1.1 Conceptual structure In describing the representation of denotative aspects of video (see section 5.2.3.1) we pointed out that there is no need to represent motions down to their atomic units, such as representing "walking" as a cyclic repetition of "taking a step". Rather, we represent such a complex pattern of human motion by a term walking. However, during the discussion of the influence of action on continuity editing (section 6.4.4) we showed that the mere generic term of a pattern (e.g. walking) is not sufficient to solve the problem of decomposing a narrative sequence into different 7: The representation of narrative and thematic knowledge 157 shot types. As a solution to the problem of identifying different levels of detail in relation to the subject performing the action we suggested a tree structure, where the root represents the class (e.g. human) and the leaves represent subparts (e.g. fingers). Subpart and action are related. One feature of an action is the body part involved. For "walking" these are the feet, which are, due to the inheritance provided by the treestructure, automatically related to the legs and the body. However, there is not always a direct relationship between a body part and the performed action. Take, for example, the action "gliding", which is most likely to be related to skates rather than feet, or "eating", which can be performed with the fingers but usually involves cutlery. Since these objects are related to particular body parts in a performed_by relation they also inherit the same feature. Some motionary actions, such as "slipping", require certain objects of a particular physical appearance so that the action can happen. For example, to slip on something, the something must either be slippery (e.g. ice) or it must be round (e.g. marbles). In our attempt to support such divergent aims as objectivity and computational efficiency we suggest the feature related objects for the conceptual structure, which allows the designer to consider the retrieval requirements set by the representation of the video content. We are aware that the introduction of such a simplifying feature reduces the extent to which inferences can be drawn. On the other hand, the inference chain via the substance or shape, passing through size and finally concluding with the object would lead directly to a similar result. We believe that complicated structures should only be introduced if their necessity becomes apparent during system development. Actions are usually performed in locations. The set of objects relevant to an action may alter according to the location. We can slip, for example, on a number of things, e.g. a banana peel, marbles, soap, or ice. However, if the location is indoors we would, in a stereotypical case, not expect to slip on ice. Thus, the nature of the location is important to the conceptual structure of an action. Referring to the discussion of spatial continuity editing (sections 6.4.4 and 6.4.5, and Tables 5.5 and 5.6), we can further conclude that the conceptual structure of an action must provide information concerning the spatial relationship between body part and objects, as well as the relationship between objects and location. The representation described thus far covers only the physical aspects of an action. However, an action provides more information. We usually expect an action to be performed for a reason, and for it to lead to a certain result. In other words, the perception of a behavioural pattern triggers assumptions concerning the intention or 7: The representation of narrative and thematic knowledge 158 goal behind an action and guides our expectations, depending on the context, towards a specific outcome. Moreover, we also assume that the performance of an action leads to a stabilisation of, or a change in, the character's emotional state, e.g. the achievement of a goal may make the character happy. The humour strategies described in chapter 3 manipulate these expectation patterns to encourage the perceiver of a scene to laugh (see 3.2.4 on incongruity and 3.2.5 on derision). The gesture-action centred approach illustrated in section 5.2.3.1 relies on the viewer's expectations in making emotional states and intentions visible. The conceptional structure of an action therefore requires representational features describing the intention or goal of an action, a set containing actions which might be performed as a result of the action, and a description of the emotional state of an actor after the action is performed. Referring to the above requirements for action representation, Table 7.1 introduces a simplified conceptual structure for the action "slip", which describes physical and mental features of the action. Name slip Domain motion Nature of location outdoors Set of objects [banana_peel, dog_shit, soap, ice] Body part / related object [shoe] Location [road] Relation Location -> Object under Relation Object -> Body part under Intention [unintentional] Result actions [sit, lie, kneel, shake, look_back] Result mood [anger, rage, astonishment] Table 7.1 Conceptual structure for a representation of the action "slip" It must be stressed that the parameters of such a conceptual structure represent causal links to other conceptual structures of actions (e.g. Result actions) or conceptual links to objects (e.g. Set of objects), emotions (e.g. Result mood) or to definitions (e.g. Nature of location). A number of parameters of the conceptual structure presented in Table 7.1 have lists as their values. These lists represent the cultural and subjective influence of the designer on the conceptual structures (subjectivity in narrativity and humour are 7: The representation of narrative and thematic knowledge 159 discussed in chapter 2 and chapter 3), since the order represents the importance of a link (left to right representing decreasing importance).6 Associating a value with a link or unit in a structure is important, since the generation of meaningful narrative sequences relies on the detection of dominant information and, at least in cases for generating humour, on pattern breaking information. The valuing of links is also used in the hierarchical structures provided in the conceptual representation of objects and abstract mental and cultural concepts, where, firstly, the highest level in the hierarchy represents the most general and thus most unimportant structure, and, secondly, the first item within a hierarchical level represents a more important link than the following items. Thus, retrieval of relevant material for narrative story generation operates on the basis of the difference between link values within conceptual substructures or on the hierarchical distance between prototypes and their subtypes, where the importance of a subtype is detected by its position within the hierarchical level. 7.2.1.2 Semantic relations Conceptual structures for actions, as described in the previous section, form but one type of node in our proposed semantic network. Other nodes, such as conceptual structures for subjects and emotional states, will be described later in this section. The semantic relations between conceptual structures of actions support the application of the humour strategies described in chapter 3, and the retrieval mechanisms related to the editing strategies presented in chapter 6. Each action concept is part of a network that supports the following links: Synonym links During the discussion of the influence of action on continuity editing (section 6.4.4), we discussed problems associated with the different levels of detail that might occur while decomposing a sequence based on different shot types. Since our ontology for the representation of video content pays specific attention to the maintenance of objectivity, a close-up of a moving head cannot be described with a term like "walking" because the actual action is not visible. Thus, the description of the action in this case is likely to be "moving". However, providing a synonym relation between "walking" and "moving" would make the shot accessible. 6 The current author is aware of the fact that lists represent but one way of representing valued links. Another would be to introduce numerical values or additional hierarchically oriented text tags, as used in the semantic relation Subaction described in section 7.2.1.2. 7: The representation of narrative and thematic knowledge 160 Subaction links In discussing the establishing of derisive gags based on several actions performed by one character in the same temporal interval (represented by H-Strategies 16 - 21, in chapter 3), we pointed out that the crucial element is that of a conflict of shared resources, i.e. subactions. The example we gave was of the man who is reading the newspaper and at the same time is dipping a croissant into a cup of coffee. At a certain point, he dips the croissant into a jar of mustard. The conflict in this case is based on the shared subaction "looking". Hence, it is necessary to provide subaction links that point to actions performed concurrently to, and that form a conceptual unity with, a main action. However, it is not sufficient to place these types of links in order, to indicate their value to selection mechanisms. Since a conflict is only a conflict if the particular subaction is relevant for both main actions, subaction links are given tags from an arbitrary qualitative modal scale, e.g. necessary, non-essential, etc. Again, it is the decision of the implementor of the semantic network to allocate these tags. Opposition links Oppositions are the key elements in the generation of humour. They are required for the generation of exaggeration, incongruity and derision. Thus, opposition links point to antonymous actions and their functional form is that of doublets. Ambiguous links The possibility of interpreting an action or expression in two or more distinct ways provides an excellent source for the generation of meaningful narrative sequences, particularly those of humorous inclination (H-Strategies 6 - 8, in chapter 3). Hence, it is necessary to provide links that point to actions serving as subactions for other actions, e.g. for the actions "cry" and "laugh" a subaction might be "shake". Association links To increase the fullness and accuracy of our contextual representations we introduce this type of link, which points to actions that may be associated with the current action, e.g. “sit” and “listen”. We predict that the 5 types of semantic links between conceptual action structures defined above, in addition to the causal and conceptual links defined, will be sufficient to meet the representational requirements of a substantial portion of narrative generation strategies, beyond those we have described in chapter 3. 7: The representation of narrative and thematic knowledge 161 However, further research into the representation of semantic knowledge-based story generation must be carried out to verify this. Figure 7.1 describes a simplified part of a semantic network for the motion "walk". look move meet Road walk Pleasure listen glide Shoe Opposition link Synonym link Subaction link Association link Conceptual link Causal link slip collide Motion Conceptual structure action Other conceptual structures Figure 7.1 Semantic subnet for the action "walk" 7.2.2 Abstract concepts The conceptual structures for the physical representation of characters and objects were described previously (in section 7.1.3) as being located in a hierarchical prototype based structure, following the analogical knowledge representation developed by Haase (1994). The relevance of analogical reasoning based on differences and similarities for the retrieval of appropriate video material is described by Davis (1995), as outlined in section 5.1.1.5. The significance of analogy in the narrative generation process was discussed in section 3.3. In the preceding paragraph, the term "object" is meant in two ways; firstly, physical objects (e.g. a house or a car) and, secondly, abstract objects (e.g. time or justice). We make this distinction to allow for two different types of visualisation. If physical objects or characters appear in the causal chain of a plot, their attributes provide the necessary information to establish and maintain the sequence structure as described in Figure 6.3. If, on the other hand, an abstract object is required, then the internal cases for visualisations are activated. These cases form a collection of cases for visual clichés, such as montage ellipsis (passing time), e.g. leaves fluttering away, newspaper presses churning out an extra edition, shadows passing over a wall, and so forth. The form of such a conceptual structure is presented in Table 7.2., showing a simplified example for the conceptual structure of "time". 7: The representation of narrative and thematic knowledge Abstract concept name Representation structure [character/object, action] time [[shadows], [passing]] 162 Table 7.2 Simplified conceptual structure for the abstract object "time". A conceptual structure for an abstract object can also be used as an example case for generating new visual representations for the abstract object, by either using the character/object slot or the action slot as the basis for analogical inferencing. While we have not investigated introduction of learning strategies into our system, we are aware of the potential of such representations in this area. To support the task of generating emotionally stimulating narrative sequences, there are two remaining abstract concepts to be described: emotional and visual concepts. Both types of concept are related through actions. Since actions are the main way in which the perceiver can infer the actor's mood or intentions (as discussed in section 5.2.3.1), both concepts are strongly related. However, for the sake of clarity, each will be presented separately. 7.2.2.1 Emotions In chapter 3, we described how particular narrative strategies can arouse distinctive emotions in the perceiver of a scene, i.e. laughter. To support the generation of emotionally stimulating narrative sequences, it is necessary to provide a conceptual representation of emotions, such as rage, grief, love or happiness. In describing the representation of an action as a conceptual structure, we showed how the action is related to an emotion (see the "Result mood" slot in Table 7.1). The emotional concept related to the emotional type by the conceptual link takes the form of a doublet, as described in Table 7.3.7 Type specification Emotional token Rage [Displeasure, Distress] Table 7.3 Representation of an emotional doublet The type specification contains descriptions of an emotional result state related to a particular action. The emotional token represents an emotional classification, based on two hierarchically ordered classes, i.e. pleasure and displeasure, where each class in 7 Our model of emotions is inspired by the theory of Ortony, Clore, & Collins (1988), and Wolff's psychological analysis of gestures (Wolff, 1972). 7: The representation of narrative and thematic knowledge 163 itself denotes a hierarchical representation of emotional states, i.e. for pleasure (delight, ecstasy, euphoria) and for displeasure (unease, dissatisfaction, distress). Each of the states represents a link to visual concepts, which are described in the next section. The representation of emotions suggested here enables us to distinguish between moods, not only within a class (due to the hierarchical ordering of emotional states) but also between classes, which provides a means of detecting whether or not an emotional state should be up- or downgraded, an important factor in the creation of humour (as, for example, described in for derision in section 3.2.5). 7.2.2.2 Visualisations The following approach to the visualisation of emotions is similar to that of the visualisation of abstract objects described earlier. However, for emotions, the aim is slightly different, because we actually use a hierarchical descriptive set of positive and negative emotions as the basis for gesture or action oriented visualisations. From our representation of emotions as described in Table 7.3, we can create structures to represent a wide range of emotional types. Table 7.4 shows a visual concept for the emotional class "pleasure". Emotional class name Body part Action pleasure Head [lip, up] -- whistle Table 7.4 Simplified representation of the emotion class "pleasure" The structure in Table 7.4 reveals that an emotion is either represented as a gesture, which in its form mirrors the representation of actions in the video content (Table 5.3). An alternative way of representing an emotional state is by simply using an action as an index to that state. 7.2.3 Events Earlier in this thesis, we discussed narrative principles (sections 2.1 and 3.1) and showed that a plot is a causal-chronological entity of related events within a given temporal and spatial framework. As a basis for the mechanisms that create or interpret such a dynamic structure, we identified three types of schemata: prototypes (identifying types of person, actions, localities, etc.), templates (common story 7: The representation of narrative and thematic knowledge 164 formats) and procedures (to organise the search for appropriate relations of causality, time and space). Furthermore, in section 2.1.2.1 we argued that an event is a discrete structure of archetypal behaviour patterns and cultural stereotypes, consisting of preconditions (motivation), main-conditions (realisation) and post-conditions (resolution), where each condition comprises a number of dominant or free functional elements. Thus, it is the event structure which provides, at the sequence level at least, the logical and temporal order for the process of generating a narrative sequence. Referring to our comments concerning the representation of the temporal aspects of video content (section 5.1.2.1), our remarks concerning the cause and effect chain of the narrative structure in the automated editing process (section 6.4), and deriving from Schank's representation of events, we are now in a position to present our approach to representing event structure. Our event structure representation names the event, the number of actors or objects involved and their gender (the description "any" in Table 7.5 represents the fact, that the gender can be either male or female), the intention behind the event, the main actions of the actors involved, and a link to the next higher element within the story structure. The main action is divided into the three event stages, motivation, realisation and resolution, each containing a sequence of actions for each actor. Table 7.5 shows a simplified event representation for a "meeting". Name meeting Actor number 2 Gender Intention any any meet Motivation Realisation [walk] [wait] [look at] [look at] Resolution Episode [shake_hand] [shake_hand] date Table 7.5 Structure of an event, i.e. "meeting" The overall description of the mental situation in the event is provided by the parameter Intention, which establishes a conceptual link to the described action and thus inherits the related information concerning moods and emotions. The relevant actions for each actor in an event stage are sequentially organised. This means that there is either one action or a list of actions which are causally related. The relation between actions performed by different characters for each event stage is oriented towards the crosscutting technique in film editing, which alternates shots from one line of action in one place with other events in another place, and thus creates a sense of cause and effect by binding the actions together. The same technique is used here, but instead of binding shots we combine lists or single actions. 7: The representation of narrative and thematic knowledge 165 For example, imagine the event of a character attempting to obtain coffee from a coffee machine. Table 7.6 describes a simplified action representation for the three event stages of the "getting coffee" event. Motivation Realisation [approach] [[search_money, [] [wait]] [[process],[provide_cup+] Resolution insert_money+], [look_change, take_cup+, leave] [] Table 7.6 Actions in the event "getting coffee" During the motivation stage, the actor approaches the machine. In the realisation stage of the event, we may see the actor searching for, and then inserting, the change. Then we see the machine operating, then the actor waiting, and finally the machine providing the cup and the coffee. In the resolution stage of the event, we may see the character looking for change from the machine, then taking the coffee, and then finally leaving. Some of the actions performed by the actor and the machine are essential for the plot understanding and others are not. Those of importance are tagged (i.e. marked with "+" in Table 7.6), indicating that these actions must be presented for an unmistakable perception of the scene. We now see that it is the information gathered in the event structures which serves as the material for the sub-structures Characters and Actions of the Sequence structure, as described in Figure 6.3. However, the event stage, or actions related to the event stage required, depends on the external reason for telling the story, i.e. on the thematic and narrative strategy in use at the time. Assume, for example, that the above event of the character and the coffee machine is to be transformed into a gag. Depending on the context, we might need only the realisation part, and from the actions only those tagged with a +, to fire some of the rules described in section 3.3, so that a humorous sequence can be created (see the example jokes created by H-Strategies 22 and 23, described in chapter 3). Referring to our earlier discussion of the temporal and rhythmical relationships between shots, we can now also describe how the creation of a narrative-oriented ellipsis (as described in sections 2.3.1 and 6.4.6.4) can be supported. In chapter 3, we asserted that timing is essential for the creation of a visual joke, which usually means that the introduction and punch line should be kept short. Alternatively, imagine a sequence where a character washes his hands, buttons his shirt and leaves the room. The event structure, as described in table 7.5, supports such aims, since it provides an 7: The representation of narrative and thematic knowledge 166 indexing scheme for essential causal event elements. Thus, only the barest level of information can be selected, and, if necessary, depending on the narrative strategy, even a sub-selection of those elements can be provided. It should be made clear at this point in the discussion that the different event stages in combination with the related semantic network for actions, not only influence the logic of the ongoing story but also effect the choice of appropriate visual material during the generation process, as schematically represented in Figure 7.2. Motivation => Realisation => Resolution available video material scene relevant material Figure 7.2 The relation between the narrative logic and the choice of the visual material used to represent it Finally, the parameter Episode of the event structure establishes a link to prototypical episodes. Episodes represent the description of stereotypical chains of events. They are described by scripts that feature two parts, resources and cases. Resources are represented as descriptions of the number and type of characters involved, their appearance and the environment in which they act. The structure parameters characters, appearance and environment serve as pointers to related conceptual structures, which provide the detailed representations. Cases resemble a collection of events, organised according to the narrative stages motivation, realisation and resolution. Note that events in each narrative stage are also indexed as dominant or free. Episodes are grouped into four classes: social (date, party, etc.), public (e.g. a demonstration), business (e.g. an appointment, or a conference) and private (e.g. having sex). The classes and related prototypes of episodes support the application of story templates (genres), which determine: 7: The representation of narrative and thematic knowledge 167 • The theme of a plot • The relevant narrative primitives and their potential presentation. For example, a typical detective story introduces the characters (introduction - setting), then shows the murder (conflict) and then introduces the detective (introduction - point of view). The discussion of episodes, and their relationship to story templates, are mentioned here for the sake of completeness, since this thesis does not deal with the creation of such complex structures. However, it is important to suggest how the established event structure could be accommodated into a larger development environment, since it is events that form the building blocks of episodes and thus stories. Further research is needed to establish if this is indeed the case.8 Figure 7.3 describes a simplified subnet containing the structures discussed above. look move Conceptual structure action Conceptual structure physical object Date Meeting meet Road Emotional concept (doublet & class) Event Shoe walk Episode listen glide slip Rage collide Opposition link Synonym link Subaction link Association link Conceptual link Causal link Figure 7.3 Semantic subnet 8 For interesting and related approaches see Brooks (1995, 1996), Davenport (1994), Davenport & Murtaugh (1995), Galyean (1995), and Ryan (1991). Examples of advanced story development software are Dramatica (1996) and Storyline Pro (1993). 7: The representation of narrative and thematic knowledge 168 7.2.4 Conclusion The above scheme for the representation of events provides a coherent action-reaction dynamic, which supports both the external and internal reasons for telling a story (see section 2.1.1). Due to the introduction of the event stages motivation, realisation and resolution, the indexing of dominant action functions, and the ability to provide a number of cases for a particular event, the scheme is also sufficiently flexible to serve different types of narrative strategies. Throughout this thesis, we have argued that film, though based on common human content, provides its own reality. At the current stage of our research, we believe that our approach provides sufficient causal structure to support the automatic generation of credible narrative sequences. In terms of the editing strategies described in chapter 6, we also showed how actions of different event stages can be ordered in such a way that a sequence of actions can be created, where certain actions are inferred but not shown (as demonstrated by the Kuleshov examples in section 4.2.2.2). Thus, our approach to representing the knowledge of the media world (genres) in combination with related presentational and cultural knowledge (e.g. editing and emotional codes) seems more appropriate to support the viewer's cognitive process of sequence construction (inferential activity) than traditional AI representations of the world (common sense). Hence, we have not introduced spatio-temporal logic (Allen (1990, 1991), Del Bimbo et al. (1992, 1993), McDermott (1978, 1990), Parkes (1989a), and Shoham (1987)). However, we concede that our approach is specifically designed for the needs of the current task, and it may prove inadequate for other tasks, such as interpreting a video sequence. Further research on the inferencing schemes required to interpret video sequences would be worthwhile. Earlier in this thesis, we discussed the nature of film editing and narrativity, and drew attention to the cultural and subjective influences on both of these memory-based processes. Stories and images are vehicles by which thoughts are transmitted from one mind into another. Thus, how we understand is affected by what is in our mind. That is, image and scene understanding is different when there are different memory structures controlling the process. In a small way, the influence of subjectivity can be 7: The representation of narrative and thematic knowledge 169 found in our use of valued links and the ways in which our semantic nets are organised, which also reflects the cultural influence.9 A significant drawback of the above representation scheme could be argued to be the complexity of the representation and its implication for the cost of implementation, as any concept or event structure (which represents a concept in an episode) must be created before it can be assigned to a case in memory. However, the structures introduced above could, we believe, be understood more readily by the user than could schemes based on logical equations. At the current stage of our research, we have no experimental evidence as to how long it would take to teach users to create and maintain the knowledge structures we have specified. The design and implementation of environments to facilitate users in providing the necessary representations is a problem for future research, and is not considered further in this thesis. Chapters 5, 6 and 7 have introduced the structures and operational mechanism for representing video content, editing knowledge and narrative related "common sense" knowledge. The next chapter describes the architecture of AUTEUR, our implemented prototype based on the proposed representational structures. In chapter 9 we present example film sequences edited by AUTEUR, and describe how the system created those sequences. 9 However, a much stronger cultural influence can be found in the design of the editing and humour strategies, which at the current stage of the research clearly reflect European culture. 170 Chapter VIII AUTEUR: An architecture for automated video story generation The previous chapters provided the representational structures, and the narrative and editing strategies, which will enable an automated system to generate emotionally stimulating video stories at the level of events. The aim of this chapter is to demonstrate the applicability of the proposed ontologies and methodologies, by introducing the architecture of a prototypical system that can generate humorous visual stories. However, before we introduce our architecture, we first give a brief overview of existing systems for the automated generation of visual stories. 8.1 Related work Despite recent rapid developments in multimedia research, little attention has been given to the development of architectures to support automated generation of visual stories.1 Two important approaches, by Sack & Davis (1994) and Bloch (1986) are briefly discussed in this section. 8.1.1 Sack & Davis' video generator IDIC Sack & Davis (1994) describe a video generator called IDIC, which creates new nonverbal "Star Trek: The Next Generation" (STTNG) trailers from an archive of existing trailers for STTNG episodes. The theme of the trailer is specified by the user, who can chose from four basic narrative structures: threat, negotiation, fight and rescue. 1 AI research on multimedia, with respect to the automatic generation of presentations, has been more concerned with media such as text (Riesbeck & Schank, 1989) and graphics (André, 1995; Feiner & McKeown, 1991). Some contemporary AI research investigates the generation problem for video (Brooks, 1995; Brooks, 1996) or animation (Butz, 1995; Strassmann, 1994). 8: AUTEUR: An architecture for automated video story generation 171 IDIC's video generating architecture consists of two main components: a modified version of Newell and Simon's General Problem Solver (GPS) (Newell & Simon, 1990; Norwig, 1992) for planning the new trailer, and Media Streams (Davis, 1995), which is used to annotate and retrieve the appropriate footage. The GPS operators used by Sack and Davis are based on four goals (threaten, negotiate, fight and rescue). The structure of a GPS operator (a list of preconditions, an add-list and a delete-list) is used to represent the content requirements of the first scene of a sequence (preconditions), the same for the next scene of the trailer (addlist) and the type of action that can be ignored (action of operator). The action of the GPS operator is assumed to be inferred by the viewer in the cut between two scenes of the sequence. Figure 8.1 illustrates the structure of a GPS operator used in IDIC. ;fight -> threaten (make-op : action 'threaten-renewed-violence :preconds '(fight) :add-list '(threaten) :del-list '(fight)) Figure 8.1 GPS operator "threaten renewed violence" in IDIC (Sack & Davis, 1994, p.5) In order to satisfy a narrative goal, the GPS created by Sack and Davis applies sequences of operators. However, unlike a planning system, which is oriented towards short solutions, the story GPS selects the operator with the most unsolved preconditions, which provides more complex stories with unexpected turns. The story board produced by the GPS then serves as the basis for IDIC to retrieve the appropriate visual material. The available STTNG trailers, which serve as visual source, are annotated in such a manner that sequences of a trailer are indexed with appropriate GPS goals. IDIC's major contribution is to show that GPS-style planners are adequate tools for the automated generation of video stories. Due to their pre-conditions, post-conditions and the goal oriented action that serves as a linkage between the different operators, GPS planners function in a very similar way to shot transitions provided by cuts in a video sequence. Moreover, IDIC demonstrates that the goal satisfaction of the planner itself must be adjusted to the task, which in the case of story telling, is to provide longer and more convoluted event sequences. 8: AUTEUR: An architecture for automated video story generation 172 Though the re-implementation of GPS in IDIC yields an appropriate approach to generating coherent trailers, IDIC is unable to adequately deal with narrative continuity and complexity. For a trailer, it is not necessary to tackle issues such as spatial or temporal continuity (discussed in section 6.4), since the viewer accepts that the scenes presented are taken out of context. Hence, IDIC's matching mechanism of linking one GPS goal to each video segment is insufficiently powerful for creating credible sequences. Moreover, the keyword oriented approach of IDIC's video annotation does not allow modifications based on connotative combinations, since an action labelled "fight" or "negotiate" completely specifies a single meaning. Thus, more complex techniques than matching goals to video sequences are required, in order to relate multiple video annotations to the automatically generated story board. Related to the problem of matching video annotations to a story board, is the problem of creating an optimal presentation. Since IDIC's aim is to create event structures which are based on the fulfilment of one single goal, it is of no importance if several video segments can be matched against the goal. The system can use any of them. The structure of stories is more complex than that dealt with by IDIC. As argued in the preceding chapters, the appropriate match for the next video sequence not only requires an analysis of the intention of the current narrative state, in terms of the internal and external reason of the story, but also requires an analysis of the important stylistic features of preceding shots. The constant switching between the forward chaining development of the story and the backwards chaining required to specify the stylistic presentation of the sequence, cannot be achieved by a single-dimensional planner, such as IDIC. Thus, while IDIC's approach is promising, with respect to establishing the framework of a story (i.e. supporting the establishment of genres and themes), we argue, as in previous chapters, that planning a story and presenting it are two different, though related tasks, which require distinct strategies. Therefore, the aim should be to develop a multi-planner based architecture which combines mechanisms for managing narrative continuity and complexity on the action, event and episode level, along with mechanisms for controlling stylistic features related to narrative issues (e.g. themes and genres) and medium-oriented presentational aspects (e.g. editing techniques). 8: AUTEUR: An architecture for automated video story generation 173 8.1.2 Bloch's machine for audio-visual editing In section 5.1.1, we discussed Bloch's approach to the representation of video content. In section 6.3.2, we discussed his approach to automated editing. We now discuss the architecture for Bloch's audio-visual editing machine (Bloch, 1986). Bloch's audio-visual editing machine consists of two videodisc players, a video monitor and a SM190 mini computer. Both videodisk players play identical copies of the same videodisk, which holds sixty purpose-filmed shots of an actor engaged in simple physical actions, such as walking, picking up an object or looking at another actor. Visual continuity between shots is achieved by using the principle of "virtual editing", where one videodisc player displays a shot, while the other searches for the first frame of the next shot. Following the first shot, the second videodisc player plays the next. The control program for Bloch's editing machine is object-oriented, and distinguishes between five operators: choice, construction, appreciation/correction and projection. Each operator, effectively an object, consists of a collection of specialists for particular tasks in the class. The generation process for the video story, as described in section 6.3.2, uses the conceptual representation of a given simple story as a startpoint, and then steps through the stages of the editing model. Figure 8.2. shows the main stages of Bloch's editing model.2 • Choice • • • Construction If Appreciation is bad then Correction Projection Figure 8.2 Bloch's editing model (Bloch, 1986, p. 133) The tasks of the different operators from Figure 8.2 are: Choice This operator determines the type of segment needed and establishes Construction the style of the visual presentation (e.g. portray Gilles and Said looking in two shots, in the form "Gilles is looking to the right" and "Said is looking to the left"). Depending on the type of segment, different types of specialist are activated. For example, one specialist task is to present a simple 2 Translation by the current author 8: AUTEUR: An architecture for automated video story generation Appreciation 174 action by juxtaposing a large number of shots with the aim of providing the impression of fluid motion. This operator provides particular functions for evaluating motion, position and looks in shots. For example, two shots should be juxtaposed on the basis of positions of characters. This triggers an attempt to establish the shared context of both shots, such as whether the shots contain the same actors, and in the appropriate positions. If this is not the case, the join is indexed as faulty. Correction This operator provides specialists in the form of heuristics to improve faulty joins. For example, juxtaposition judged to be faulty due to lack of continuity in motion may be corrected by inserting an undetermined motion shot. Projection This operator organises the "virtual editing" of the videodiscs. The relevance of Bloch's architecture is twofold. Firstly, he introduces specialised procedures for the control of the editing process, which embody an editing model similar to our own, as described in section 4.3. However, Bloch's aim is neither to provide an automated story generator (the narrative is given), nor to provide mechanisms to react to cases where the required material is not directly available. Rather, Bloch's aim is to create the optimal presentation for the given story from the available video material. Hence, there is no need in his top level algorithm to iterate through choice, construction, appreciation and construction, as our model suggests. Bloch's algorithm applies correction only when the visual appreciation for a particular join is determined to be low.3 The second important feature of Bloch's architecture is the object-oriented nature of his editing model. This approach is of particular importance, as it enables the handling of the divergent narrative mechanisms, presentation mechanisms and representational structures, on the basis of communication between object classes. The object-oriented implementation of Bloch's architecture provides an extremely useful basis for tackling the problem of synchronising the story planning process with the generation of an acceptable visual presentation. 3 It has now become apparent, that, to Bloch, the generation of film sequences is seen in terms of montage rather than editing. This is why his representation of video content does not allow the further editing of the shot itself (for a detailed discussion of Bloch's representation of video content see section 5.1.1), which also influences his approach to automated video editing, as described in section 6.3.2. 8: AUTEUR: An architecture for automated video story generation 175 8.2 A proposed architecture for the editing of theme oriented video stories The aim of our proposed architecture is to establish a system that synchronises automatic story generation for visual media with the stylistic requirements of narrative and medium related presentation. In section 2.2, we discussed the relationship between the two main layers of a story, i.e. structure and content, and showed that both simultaneously serve the narrative purposes of form and substance, which is described in Figure 2.4. In order to achieve the required interaction between the structure and content levels, we identified a two layer planning system as an appropriate approach to generating both the plot and its visual presentation. This was contrasted with a grammar-oriented approach, which could support only plot generation. Furthermore, we showed that it is the substance and expression of content which establishes the impression of a well-formed plot and also enables the creation of distinguishable plots. The subsequent investigation of plot content, and its formal expression for the example of humour in chapter 3, led to strategies for the automated generation of such content. These strategies were underpinned by an ontological representation describing the physical story-world and abstract mental and cultural concepts in chapter 7. Moreover, our investigation of the film realisation of narrative led to the introduction, in section 4.3, of a procedural model of the editing process, our representation of video content and, in chapter 6, knowledge structures and related processing mechanisms to enable automated video editing. Our analysis of the editing process suggested that the process of arranging visual material is best considered to be a planning problem. In order to demonstrate the applicability of our described representational structures and narrative and editing strategies, we now present an architecture for a prototype. The architecture is realised in the current version of our experimental system AUTEUR (Artificial Intelligence Utilities for Thematic Film Editing using Context Understanding and Editing Rules), implemented in Sicstus Prolog on a SUN Sparc workstation. 8: AUTEUR: An architecture for automated video story generation 176 We first provide a general overview of the architecture, and then describe the components in detail. Chapter 9 describes humorous film scenes actually generated by AUTEUR. 8.2.1. Overview The aim of AUTEUR is to automatically generate a video story. The user provides AUTEUR with the identifier of a startshot and the thematic orientation of the event to be created. The thematic orientation for the presented examples is humour. A possible query might be go(10, humour), where 10 is the id of the startshot, and humour represents the theme. The overall tasks carried out by AUTEUR are described in Figure 8.3. 1 2 Analysis of the startshot in terms of establishing the appropriate thematic strategy for the event to be established. Development of the story in accordance with the thematic strategy 4 chosen. Establishing of the appropriate form of presentation for the thematic orientation and story content. Retrieval of the appropriate visual material. 5 6 Editing of the material. Presentation of the visual story. 3 Figure 8.3 Tasks performed by AUTEUR Referring to the editing model described in section 4.3, all steps between task 1 and 6 are performed repeatedly, since any of the steps may fail, and restructuring must then be carried out. Since the aim of AUTEUR is to perform the task of story generation and its visual presentation as autonomously as is possible, the design of the architecture is inspired by work on planning (see André (1995), Cawsey (1990), Hayes-Roth (1985), HayesRoth & Hayes-Roth (1990), Korf (1990), Newell & Simon (1990), Sacerdoti (1977, 1990, 1990b), Schank (1981), Smith & Witten (1991), Tate et al. (1990), Wilensky (1983a, 1983b, 1990), Wilkins (1990)), and ideas propagated by the research on Autonomous Agents (see Bates (1994), Hayes-Roth (1995), Maes (1991, 1994)). The proposed architecture, as described in Figure 8.4, consist of three major units: one unit embodies the DB for video material, the DB for video representation, and the Knowledge Base (the resource unit), the second unit covers the Editor and Retrieval 8: AUTEUR: An architecture for automated video story generation 177 System (the construction unit), and the third contains the Development Tools and the Interface (the development unit). The resource unit reflects, in its structure, our assumptions concerning representation (see sections 5.2, 7.2 and 6.4) and contains the actual visual material (i.e. video clips), the content annotations for each video clip, and a knowledge base holding the ontological representation of the "media-world", mental and cultural concepts, and some knowledge related to abstract filmic and editing representations (e.g. the spatial relationships between two shots, represented as a two dimensional array - see Table 6.2). Retrieval DB Video Edito DB Video Structure Knowledge Semantic Content Interface Visual Visual Filmic and knowledg Development conceptual information Figure 8.4 Proposed architecture for the creation of a visual story of emotional impact The construction unit of the architecture embodies a controller module (the Editor in Figure 8.4), which consists of two separate planning systems. The Structure Planner deals with structural details of the plot and its visual presentation. The Content Planner deals with the actual plot content and its stylistic presentation. The Retrieval System serves as the link between the controller module, the resource unit and the user interface. The development unit reflects our discussion of semi-automated video content annotation systems (see section 5.3) and the creation and maintenance of representational knowledge structures (i.e. the editing rules and the semantic net). 8: AUTEUR: An architecture for automated video story generation 178 From the discussion in previous chapters, it should be clear that this part of the architecture is merely theoretical. We include it only for completeness. We now discuss the components of the architecture and their respective functionality. We also discuss the interface between the components. We begin with the resource unit, and then describe the construction unit. 8.2.2. The video database The video database is a collection of digitised video material. Our approach to the representation of video content allows for the use of video clips of arbitrary content and length. However, it is expected that the clips will usually refer to a single domain. Furthermore, due to the lack of appropriate tools for automated video annotation (as discussed in section 5.3), we assume that the shots will usually be short in duration and restricted in their range of actors, actions and objects. Which of the standard formats for digital video is chosen depends on the capturing tools provided by the development unit. Our current implementation of the video database uses the MPEG format. For experimental purposes, the video database is currently restricted to approximately 40 shots, each between 1 and 15 seconds in length. The shots are purpose-filmed for our research and feature simple physical actions of an actor or object, but in a normal physical environment. 8.2.3. The video representation The content representation for the available video clips uses the structured textual approach for the categories cinematographic devices, character, object, action, composition and setting, as described in section 5.2. The annotations were created in MicroEmacs. An intelligent, semi-automated graphical interface providing a simple video editing suite and an intelligent text editor, which supports the user with appropriate, and preferably pre-filled forms for the category to be annotated, would have been useful. This is not so much because the annotation process is time consuming, but rather because it is tedious. Hence, it is not surprising that quite some time was spent on finding misspellings, or values, which were attached to the wrong attribute. Nevertheless, the representational content structures proved to be manageable in this form. 8.2.4. The Knowledge Base The Knowledge Base contains the conceptual structures for characters, objects, actions, events, episodes, and abstract concepts such as emotions and visualisations, 8: AUTEUR: An architecture for automated video story generation 179 in the form of a semantic net, as described in section 7.2. Since AUTEUR is implemented in Sicstus Prolog, the value mechanism for links in the semantic net are held in the form of lists, as described in sections 7.2.1.1 and 7.2.1.2. 8.2.5. The Editor - a controller module The architecture of the editor embodies the separation of the two main story layers, defined in section 2.2, i.e structure and content. Each layer is provided with its own planning system. The content planner is additionally assisted by two specialised planners, i.e. the Visual Designer and the Visual Constructor. The communication between the different planning systems is based on the memory structures SequenceStructure (shown Figure 6.3) and Location-Memory-Structure (shown in Figure 6.5). 8.2.5.1 The Structure Planner In the discussion of themes in chapter 2, we showed that the external point influences the scene structure. The task of the Structure Planner (Figure 8.4) is to organise the strategies for realising the required theme. The Structure Planner is involved in the creation process from the outset, by providing the analysis of the given startshot, which in turn is used to establish the first Sequence-Structure. This analysis, based on the startshot description in the video representation, provides information about the number of actors, groups of actors, objects and groups of objects. Each of these units is related to particular information about their actions, i.e. sequences of actions, actions performed concurrently (i.e. performed during the same sequence of frames), and single actions. The information acquired is stored in the Setting, Subject, and Action fields of the first SequenceStructure. Since the mood of characters is important in visual humour, the startshot analysis also attempts to derive information about this. The analysis uses heuristics referring to visual expressions or the speed of an action (discussed in sections 7.2.2.1 and 7.2.2.2), e.g. a smile supports the assumption of pleasure. The result is a mood description, such as "pleasure", combined with a numeric value that represents the system’s certainty that this mood is suggested by the chosen material. The next task of the Structure Planner is to use the results of the startshot analysis to establish the appropriate thematic strategy (see task 1 in Figure 8.3). The discussion of humour in chapter 3 revealed that two of the five identified humour primitives, i.e. incongruity and derision, have implications for the construction of humorous events. Recall that each of these primitives covers various humorous concepts (see section 8: AUTEUR: An architecture for automated video story generation 180 3.2.4 for incongruity and 3.2.5 for derision). The process of providing the general direction of the visual gag is based on strategies, which are described in chapter 3 as supportive (Table 3.1). As the first example of a supportive strategy take H-Strategy 1, which says: An action forms the most suitable subject for a joke, then an actor, then an object and finally a location. Hence, the system will first investigate options concerning actions. As a further example of a supportive strategy consider H-Strategy 14, which states: For a single action, mischief is easier to achieve than ambiguity. If the available shot portrays a single action, the overall aim of the story construction will be to generate a derisive joke. The preparation process ends with the completion of the Sequence-Structure, by instantiating the fields Kind, Intention, Form and Appearance. The field Kind is instantiated with the current narrative stage of the event, which is either motivation, realisation or resolution (see section 2.1.2.1). The decision for a narrative stage is driven by the relevant supporting humour strategies, in combination with the results of the startshot analysis. In the preparation phase it may be the case that the startshot provides all the necessary information to set the scene. For example, the relevant strategies for derisive jokes focus on single actions and the mood of a character. Assume that, firstly, only one action is represented in the video representation of the startshot, and, secondly, that the certainty factor for the mood is sufficiently high (e.g. the value is equal to 1). Such a case would cause the system to instantiate Kind with the value Realisation, rather than Motivation. It must be stressed that the same evaluation mechanisms are used to determine the use of the remaining narrative stages, except that in this case it is the previously used shots which are analysed, rather than the startshot. The instantiation of Intention and Form is based on the actor, action and location analysis of the startshot. For example, if one of the results of the startshot analysis is that the joke is based on an action, Intention will be instantiated with action, Form will be instantiated with the chosen strategy type (e.g. misfortune), and the strategy name (e.g. H-Strategy 4). Since the overall theme for the sequence is humour, the system automatically attaches the value "accelerate" to Appearance (the reason is described in sections 3.2.1 and 3.2.2, which discuss the humour primitives readiness and timing). Other themes, e.g. tragedy, might result in values being chosen for the Appearance slot. 8: AUTEUR: An architecture for automated video story generation 181 Once the essential structural elements for the event to be created have been declared, the Structure Planner supports the construction process, by providing the Content Planner with additional information concerning the current event phase. The decision process of the Structure Planner is based on feedback from the Content Planner. The Content Planner provides information about the meaning of the story, the motivated event, mood, action, actors involved, etc. The Structure Planner compares the actual event provided by the Content Planner with the suggested content in the original Sequence-Structure. The results of this comparison serve as a guide to establishing the Sequence-Structure for the following event phase. For example, a new character may have been introduced due to motivational content requirements, and must now be included in the storyline. This might affect the presentational structure in terms of changing from a single event presentation to a parallel event presentation. A further example is when the Content Planner produces a storyline based on the given Sequence-Structure, but there is insufficient visual material to portray the story. A way out of this dilemma is to apply an alternative strategy. The Structure Planner may, for example, switch from misfortune to stupidity, because the existing material supports such a change of strategy. In such cases, it is the responsibility of the Structure Planner to initiate the reorganisation of the material, for example by revisiting an earlier creation phase (e.g. moving from realisation to motivation), which may cause the re-establishment of the startshot, insofar as it has already been edited. As the aim of our research, at the current stage, is not to achieve the automatic generation of complex story structures such as episodes or features, the tasks of the Structure Planner are fairly limited. In particular, the preparation phase is where this planner does most of its work. However, we assume that the introduction of more complex story structures would require additional meta-compositional rules, similar to the supportive strategies for humour. The establishment of genre structures, which are highly related to the development of a theme, as described in section 2.1.1, would be particularly suitable. In this case, the Structure Planner would need to be developed further, and may need components such as a Theme Analyser and a Genre Analyser (analogous to the Visual Designer and Visual Constructor which support the Content Planner). 8.2.5.2 The Content Planner While the Structure Planner is concerned with the external point, or theme, of the video story, the Content Planner is concerned with its internal point. The Content Planner specifies the content of the relevant event phase, and is therefore responsible 8: AUTEUR: An architecture for automated video story generation 182 for the actual application of particular strategies. Hence, the Content Planner uses the humour strategies that were described in chapter 3 as constructive. Depending on the strategy and its specification provided by the Scene Planner, the story planner uses the conceptual structures gathered from the semantic net of the Knowledge Base (shown in Figure 8.4) to attempt to construct a coherent scene. Take H-Strategy 4 as an example: H-Strategy 4 If the action portrays an intention (goal), interrupt the action, in a way that is expected by the character, so that the goal is unfulfilled and the character's mood is downgraded or he suffers in some way. (Mischief + Schadenfreude + Superiority) The strategy is designed to create a derisive joke (Mischief + Schadenfreude + Superiority). For its motivation phase it requires a goal-oriented action and a mood which can be downgraded. The realisation requires an interruption of the action, which hinders the success of the goal. The resolution requires the representation of a downgraded mood, or an event or action that represents the suffering of the actor. However, recall from Figure 7.2 that as the story progresses through the time phases, the choice of content elements decreases. The Content Planner consists of three independent planners, one for each creation phase, to fulfil the above strategy. The motivation, realisation, and resolution planners are essentially specialists for the content generation of their respective event phase. Each planner uses the event-related Sequence-Structure and the semantic net of the Knowledge Base as primary sources for their story planning processes. Additionally, each collaborates with the Visual Designer and the Visual Constructor to establish the visual presentation of their respective part of the scene. For example, assume that a favourable mood and necessary action (e.g. "walk") are available from the startshot. The first event phase to be created is thus the realisation. The goal of the realisation planner is to interrupt an action which is expected by the character. An interruption requires the planner to find an oppositional action for the action actually being performed. Hence, the planner traverses the oppositional links (as described in section 7.2.1.2) of the semantic net, from the conceptual structure of the current action (i.e. "walk"), in such a way that the highest valued link (e.g. "slip") is chosen first. The connected conceptual structure serves as the content material for the realisation phase. The Content Planner constructs a query which may request a 8: AUTEUR: An architecture for automated video story generation 183 story element showing the character slipping. This query is sent to the Visual Designer, which may, or may not, arrive at a suitable visual solution. If the Visual Designer does not succeed, the Content Planner continues the investigation of the semantic net, by first processing any unexplored oppositional links. If this also proves unsuccessful, the synonym links from the actual action are processed, and then the oppositional links from the connected conceptual structures for actions are processed. If these also prove to be unsuitable, the synonym links from the action itself are traversed so that the oppositional links from the synonymous actions can also be processed. In cases where no solution exists, the Structural Planner is informed that a change of strategy is required. However, recall that there are usually several versions of a particular strategy, which provide the system with the ability to create stronger or weaker jokes (compare, for example, H-Strategy 2 and H-Strategy 4 in chapter 3). Analogous to the descending order of the constraints for the content retrieval query, the first applicable event phase planner will attempt to establish the strongest content direction for a joke (i.e. using the content strategy with the highest number of associated content primitives) and then gradually weaken the visual gag as necessary. Thus, H-Strategy 4 would be chosen first, followed by H-Strategy 2. This strategic behaviour reflects the aim of producing the best possible joke in the circumstances. A further important task of the Content Planner is to determine the point of view from which the story is seen to be taking place. For example, the Content Planner must decide whether it is appropriate to tell the story from a particular character's perspective. Such a decision is influenced by the Form field in the Sequence-Structure of the related event phase. The aim of AUTEUR is to stimulate an emotional reaction in the viewer of the story, so the general goal is to tell the story from the viewer's perspective. However, H-Strategy 4 indicates a change in point-of-view, i.e. the expectation by the character that the attempted action may not succeed in its aims. Such cases will cause a point of view change from the third person narrative style to a character reaction. The influence is manifested in a change to the Form field (from third-person to first-person) and will have consequences on the work performed by the Visual Designer, which now has to search for shots that succeed as eye-line matches, instead of shots that provide continuity on the basis of motion. Finally, it is the Content Planner which evaluates the constructed visual joke. While establishing the event phases, the Content Planner calculates a numeric value that represents the extent to which the content and stylistic requirements in the different event phases have been fulfilled. The evaluation heuristic used is based on the 8: AUTEUR: An architecture for automated video story generation 184 evaluation function described in section 3.2.3. The calculated evaluation index is then compared with a percentile system, resulting in a the rating of the visual gag.4 8.2.5.3 The Visual Designer The Visual Designer supervises the retrieval of the required video material. The content based query (order of shots) is provided by either the motivation, realisation or resolution planners of the Content Planner, depending on the current generation phase. The aim of the Visual Designer is to collect the most appropriate video material in terms of content and style. The architecture of the Visual Designer is based on our discussion of the space, action and content requirements of continuity editing (section 6.4). The mechanisms used are based on the representation of the permissible relationships between shots (shown in Figure 6.2), the relationships between camera distances, the conceptual structures for subjects, i.e. characters and objects (shown in Figure 6.3), and editing strategies, as described in sections 6.4.1 - 6.4.5. The structures representing the relationship between shots and the relationship between particular denotative aspects of the video content and structural elements of the conceptual structures of subjects (i.e. characters and objects), are part of the Knowledge Base (shown in the overview of the architecture given in Figure 8.4). The editing rules are part of the Visual Designer. Thus, the resources used by the Visual Designer during the process of establishing a visual representation for a particular event phase are: • the related Sequence-Structure and connected with it the Location-Memory- Structure (see Figure 6.5), • the Knowledge Base of editing knowledge (e.g. the spatial relationships between two shots, represented as a two dimensional array, as described in Table 6.2) and the semantic net, • the data base containing the video representation. An example of the use of these resources is the reaction of the Visual Designer when it is unable to retrieve the visual material specified by the Content Planner. In such 4 The percentile system is defined as : 90 - 100 % => good 80 - 89 % => reasonable 70 - 79 % => weak 60 - 69 % => very weak 0 - 59 % => completely unfunny 8: AUTEUR: An architecture for automated video story generation 185 cases the Visual Designer uses the strategy of query decomposition, as described in sections 6.4.3 and 6.4.4. The output from the Visual Designer is a shot list for the related event phase, representing the content query in the most appropriate visual terms, which is then transferred via the Content Planner to the Visual Constructor, as described in the next section. Note that, since the Sequence-Structure for all event phases and the LocationMemory-Structure are kept in memory, each joke is remembered as a case, represented by the steps involved in its generation. The Visual Designer can therefore use these as example cases, or to avoid recreating the same joke. 8.2.5.4 The Visual Constructor The Visual Constructor receives an annotated shot list from the Visual Designer, and actually performs the detailed joining of the specified shots. The Visual Constructor operates at the cutting level. This can mean that a shot is truncated, if it is too long for the required purpose (e.g. two seconds exposure to a close-up is often sufficient for the viewer to appreciate what is being shown). Cuts may also be motivated if only part of a particular shot is required. This can be particular important to maintain continuity, for example, in the case of inserts that break the flow of action, but where the actual screening time of the insert must be considered. The strategies used for such operations are based on our discussion of the temporal and rhythmic relationships between two shots, in section 6.4.6. The output from the Visual Constructor is an ordered list of the shot identifiers, along with frame numbers for each shot. This list specifies the scene that is to be displayed. 8.2.6 The Retrieval System and Interface The Retrieval System adapts the final stream of shot ids and frame numbers (stored in the List_of_used_shots, from Figure 6.5) into a file specifying the actual presentation of the video story at the event level. The file lists the appropriate MPEG files along with associated start and end frames, which can be shown in a small "projection" window on the workstation. The presentation environment is written in X and uses SUN Video Technology/XI Library. At its current stage of development, the interface features two windows. There is an X window, where the user enters the query to start the story generation process, e.g. go(10, humour), and a window in which the video is actually displayed. 8: AUTEUR: An architecture for automated video story generation 186 The simplicity of the interface reflects the status of AUTEUR as being purely a research platform. However, during the above description of the architecture we outlined some tools that would be required to make AUTEUR user friendly. Our research project is necessarily focused on the representational and structural aspects of the automated generation of video stories. We thus defer considerations of the user interface to a later date. 8.3 Conclusion Based on the representational knowledge structures for video content and narrative and thematic knowledge, the narrative strategies for humorous stories at the event level, and the techniques for automated editing discussed in the previous chapters of this thesis, we have introduced the overall architecture of AUTEUR, a system for the automated generation of emotionally stimulating visual stories at the level of events. We introduced the three major units and their components, i.e. construction (Retrieval System and Editor), resources (Data Base for video material, Data Base of Video representation and the semantic Knowledge Base) and maintenance (Development tools and User Interface). Furthermore, we described the tasks of each component and the interaction between them. Finally, we showed how the representation structures generated by AUTEUR, i.e. the Sequence-Structures for the different event phases, and the Location-Memory-Structure, can be used to assist the generation process by serving as example cases, which help to avoid producing the same visual gag twice. The next chapter discusses examples of humorous film stories actually created by AUTEUR. 187 Chapter IX The operation of AUTEUR: Show me a joke In this chapter, we consider three examples of humorous films actually produced by AUTEUR, to demonstrate the operation of the architecture described in the preceding chapter. We begin with an example of visual humour that is based on a single action by one character. The major feature of this example is the decomposition of action and character appearance. The second example highlights the generation of a visual gag that involves addressing the problem of parallel actions performed by a character, and also how AUTEUR avoids producing the same joke twice. The third example describes the development of a joke based on the interaction between an object and a character. Each example is discussed according to the relevant generation phases, i.e. preparation, motivation, realisation and resolution. A detailed description of example 1 is given, while examples 2 and 3 are described in less detail. It is extremely difficult to present a sufficiently detailed description of the operation of a system such as AUTEUR. The following examples have therefore been simplified. For example, we provide only a sketch of the content representations involved (see Tables 5.1 - 5.6). The Appendix contains an illustration of a typical generation trace produced by AUTEUR, which represents the analytical development of a story. Note also, that actual shots of film are used by AUTEUR, but that these are described below in the form of a single representative image from each shot 9: The operation of AUTEUR: Show me a joke 188 9.1 The banana skin joke Suppose that the query by the user is go(12, humour), where 12 represents the ID of the shot shown in Figure 9.1. Figure 9.1 Startshot for the banana skin joke 9.1.1 Preparation phase As described in the previous chapter, the Structure Planner first analyses the content representation of the given start shot, in terms of number of characters, objects or groups, related actions, moods and location. The construction of the mood is based on visual expressions and actions (type, speed, order) related to an actor (see also 7.2.2). The result is a list of possible moods, such as pleasure, hurry, etc., each tagged with a certainty value that must be higher than 0.35 for a mood to be considered as an element for the list of assumed moods. Figure 9.2 presents an example of the type of information which may be extracted by the Structure Planner. Shotid: 12 Shotkind: long Actors: Single (1, [Frank]), Group (0,[]) Objects: Single (2, [path, meadow]), Group (1,[trees]) Actions: sequence:no, parallel:no, single action: walk Mood: [Frank [ pleasure+0.5, hurry+0.5]] Figure 9.2 Result of startshot analysis for the banana skin joke The Structure Planner uses the information described in Figure 9.2 to establish the Sequence-Structure (for an example of such a structure, see Figure 6.3) for the current generation phase. 9: The operation of AUTEUR: Show me a joke 189 The first decision made concerns the selection of the most suitable humour strategy. As there is only one person, one action and there is nothing ambiguous about the action (one action in a long shot) AUTEUR will suggest misfortune as the humour type for the joke (see H-Strategy 14). The next step for the Structure Planner is to determine the phase of construction. Since the concept of misfortune requires a mood deterioration, the Structure Planner evaluates the mood of the relevant character. Both moods, "pleasure" and "hurry" as indicated by Figure 9.2, do not provide the required certainty value of 0.75 that would indicate that the mood of the character is clearly perceptible from the shot, and thus, the mood is indexed as "to be motivated". Since there is but one character, performing one action only, there is no need to motivate the action. However, since the conceptual structure of "walk" can be associated with a larger logical sequence, e.g. a meeting, the motivation of an event is also suggested (according to H-Strategy 16 of chapter 3). On completion of the preparation phase, the first Sequence-Structure may be instantiated as shown in Figure 9.3. Sequence-Structure Kind Intention motivation mood+event Form Appearance Setting Subjects Action misfortune accelerate Single (2, [path, meadow]), Group (1,[trees]) [Frank [pleasure+0.5, hurry+0.5] [Frank, [walk]] Figure 9.3 Motivation Sequence-Structure for the banana skin joke The Structure Planner now informs the Content Planner to proceed, by sending a list of supportive humour strategies (e.g. H-Strategy 15). 9.1.2 Motivation phase The first task of the Content Planner is to identify the appropriate strategy for the humour type, which is misfortune (see the field Form, in Figure 9.3). Based on the set of supportive humour strategies provided by the Structure Planner, the Content Planner evaluates the available information concerning the visual material. HStrategy 15 suggests that a more complex strategy will increase the chance of 9: The operation of AUTEUR: Show me a joke 190 producing a good joke. For a single character performing a single action the appropriate strategy is H-Strategy 41. In the light of the chosen strategy, the Content Planner attempts to create a motivation for a mood and an event. The first aim is to create a visual representation that suggests that the character either feels pleasure, or is in a hurry. The second aim is to establish an event that corresponds with the chosen mood. AUTEUR first attempts to establish a suitable mood. The Knowledge Base contains a number of mood concepts, along with related actions (such as those shown in Tables 7.3 and 7.4). For example, pleasure may be associated with smiling, whistling, and picking flowers. Since the first action in the list of associations represents the strongest visualisation of the mood, AUTEUR uses smiling for its first attempt to establish a visual representation of "pleasure". Traversing the associative links of "smiling" and "walk", AUTEUR infers that a person can walk and smile at the same time. As a result, a query is sent to the Visual Designer to retrieve an appropriate visual representation for "Frank walks and smiles". The query also specifies the underlying intention for the material (i.e. highlighting a mood) and the ID of the shot to which the motivation is related (i.e. 10). To solve the task of finding appropriate visual material for the query, the Visual Designer uses two knowledge structures: the Knowledge Base, i.e. the spatial relationships between two shots, represented as a two dimensional array, as shown in Table 6.2, and the conceptual relationships between visual space and narrative functionality, represented in form of clauses (see Figure 6.4). From the representational structure for the cinematic devices in shot 12 (see also Table 5.1), which is stored in the DB of Video representations, the Visual Designer detects that the start shot is of type "long". From the conceptual relationship between visual space and narrative functionality, the Visual Designer infers that motivation favours detail, which is related to a decrease of space. This information results in the Visual Designer exploring the array of spatial relationships between shots. Suppose that the Visual Designer could neither retrieve shots of the most appropriate type (e.g. medium), nor build a bridge (according to E-Strategy 2, in section 6.4.2). The Visual Designer must now rely on a fallback option, which allows the join of a "long" shot and "close-up" shot for purposes such as the motivation of a mood. Now, assume that the Visual Designer can retrieve a number of shots from the DB of Video 1 A more appropriate strategy is actually H-Strategy 13. However, we have choosen HStrategy 4 to promote ease of presentation. 9: The operation of AUTEUR: Show me a joke 191 representations, which are annotated with the required action, "walk", for the given character (Frank), and which also represent a facial expression of a smile, as shown in Figures 9.4, 9.5, 9.6. Figure 9. 4 - 9.6 Three possible motivation shots for the banana skin joke The representation of each of the shots is compared with that of the shot to be joined (i.e. 12) on the basis of spatial continuity (appearance of character, spatial relations between character and object, appearance of setting - including such features as lighting, location, season, daytime, and so forth, etc.), action match (e.g. comparison of direction), and temporal continuity (e.g. speed of action). Furthermore, each of the shots is analysed on the basis of stylistic features. Since the intention of the join is a motivation, a zoom-in is stylistically desirable. Based on the evaluation process, the Visual Designer chooses the shot represented by Figure 9.5, as its content description indicates that it most obviously presents the mood, is consistent with the direction of action, and represents a zoom-in. The successful establishing of the mood "pleasure" enables the Content Planner to specify an event by constructing an appropriate goal for the character. Thus, the Content Planner traverses the causal links provided by the attribute Intention of the conceptual structure for "walk", to detect usable event structures (see Table 9.1). One such goal might be to meet another person. Thus, the Content Planner attempts to construct the representation for a meeting by using the relevant conceptual structure. Possible event structures might be a male meeting a female for a date, and so on. Name meetin g Actor number 2 Gender Intention any any meet Motivation Realisation [walk] [wait] [look at] [look at] Resolution Episode [shake_hand] [shake_hand] date Table 9.1 Conceptual structure for the event "meeting" Since the Content Planner is currently operating in the "motivation phase", it also targets the motivation parts of the event structure "meeting", such as one person waiting and the other walking, or, perhaps, both characters walking. In our example, 9: The operation of AUTEUR: Show me a joke 192 the action of one character is already specified, i.e. "walk", so AUTEUR searches in the motivation field of the conceptual structures for "meeting", for a corresponding action performed by the other character. Assume that the conceptual structure described in Table 9.1 is the one being chosen by the Content Planner. Thus, the corresponding action is "wait", which can be performed by a character of either sex. The above information is used by the Content Planner to send a query to the Visual Designer, which this time has to retrieve a shot of either a female or a male character, who waits (or its synonyms) in a similar surrounding to that provided by the start shot (see Figure 9.1). The Visual Designer may suggest the shot represented by Figure 9.7. Figure 9.7 Event shot for the banana skin joke Due to the chosen strategy (H-Strategy 4), the Content Planner orders the actions according to their appearance in the shot sequence (event action, character action (startshot and motivation)) and transfers this information to Visual Designer. The task of the Visual Designer is to decide how the shots should be joined. The juxtaposition between "Event shot" and "Start shot" is a simple join of two long shots in the given order. The juxtaposition of "start shot" and "motivation shot" is to be performed as an insert, since the combination of the motivational intention for the join, in combination with the particular combination of shot types (close-up motivates long shot), highlights this solution. The actual juxtaposition is performed by the Visual Constructor. The insertion and related trimming for the motivation shot is carried out according to E-Strategies 31 36. The Content Planner now generates a Location-Memory-Structure (see Figure 6.5), in which the ids of the shots in their established order are stored (9.7, 9.1, 9.5, 9.1 the numbers represent the above Figures). Since all content and stylistic requirements for the visual presentation of the motivation phase have been achieved, the Content Planner marks the evaluation value for the motivation as successful. Finally, the 9: The operation of AUTEUR: Show me a joke 193 Content Planner indicates the status of the ongoing story to the Structure Planner, by providing a list such as that shown in Figure 9.8. Strategy used meaning of the story: motivated event: mood of main character: action main character: walk number of characters: 2 H-Strategy 4 date meeting [Frank, pleasure+1.0] Figure 9.8 Status of the banana skin joke after the end of the motivation phase The Structure Planner uses the story status, as described in Figure 9.8, to decide on the next phase in the generation process. The comparison between the Sequence-Structure for the motivation phase (see Figure 9.3), and the status of the motivation phase (see Figure 9.8) reveals, that a new character, the waiting man, has been introduced. Since both characters are still apart, and the new character is passive (i.e. waiting), the inference drawn by the Structure Planner is that the joke is still to be based on the action of the main character. However, the introduction of the new character changes the single-person environment into a person-person environment, which is indicated by a change of the Intention field for the motivation Sequence-Structure from event to parallel_event. Moreover, the fields Setting, Subjects and Action in the motivation SequenceStructure need to be updated with information about the second character. Since no change in strategy is indicated, the humour type (misfortune) remains, and the established strategy (H-Strategy 4) is added. The next activity by the Structure Planner is to decide on the next phase of the generation process. Since the emphasis continues to be on one character, there is no change in action, and there is no change in the strategy type, the Structure Planner suggests a realisation phase and then instantiates the relevant Sequence-Structure, as specified in Figure 9.9. Sequence-Structure Kind Intention Form Appearance Setting Subjects Action realisation action misfortune, H-Strategy 4 accelerate [Single (2, [path, meadow]), Group (1,[trees]), [Frank [pleasure+1.0] Figure 9.9 Realisation Sequence-Structure for the banana skin joke 9: The operation of AUTEUR: Show me a joke 194 Finally, the Structure Planner instructs the Content Planner to continue the generation process. 9.1.3 Realisation phase The first task of the Content Planner is to retrieve the name of the action or event which forms the basis of the joke. This is indicated by the uninstantiated Action field of the realisation Sequence-Structure (see Figure 9.9). To retrieve the required action, the Content Planner uses the realisation requirements provided by H-Strategy 4. The aim of H-Strategy 4 is to the violate a character’s goal under two constraints. Firstly, that the mishap should be simple, and secondly, that the mishap should be expected by the character. Thus, the Content Planner first retrieves the action (walk) of the main character (Frank) from the motivation Sequence-Structure. The violation requirement of HStrategy 4 causes the Content Planner to traverse the outgoing opposition links of the conceptual structure "walk". The oppositional links are chosen because the concept of "interrupt" is related to the concept of "perform opposition". The links lead to conceptual structures for oppositional actions for "walk", such as fall, slip, stumble, or collide with. In decreasing order of the importance of the opposition links, the Content Planner attempts to instantiate an oppositional concept. Take the conceptual structure of slip as an example, which is described in Table 9.2. Name slip Domain motion Nature of location outdoors Set of objects [banana_peel, dog_shit, soap, ice] Body part / related object [shoe] Location [road] Relation Location -> Object under Relation Object -> Body part under Intention [unintentional] Result actions [sit, lie, kneel, shake, look_back] Result mood [anger, rage, astonishment] Table 9.2 Conceptual structure for a representation of the action "slip" An important task for the Content Planner is to detect if the result mood of "slip" can fulfil the required mood detoriation. Since anger, rage and astonishment are related 9: The operation of AUTEUR: Show me a joke 195 to the emotional token of the opposite classification type of pleasure, i.e. displeasure (as discussed in section 7.2.2.1), the mood detoriation can be established. Thus, "slip" is a feasible action for foiling the action "walk". The Content Planner can now present the Visual Designer with a number of queries, considered in descending order of suitability, such as: • find a shot where the actor slips, where the object the character slips on is • • • found in the startshot find a shot where a body part slips on an object, where the object is found in the startshot find a shot where the actor performs an action that is associated to slip, and a shot showing an object that is also associated with slip find a shot where the actor performs a slip related action. Suppose that the first query above is not satisfied, but that the second is successful. The next step for the Content Planner is to satisfy the expectation constraint of HStrategy 4, which requires that the character is aware of the object. Since none of retrieved shots portrays the line of sight of the character, the Content Planner uses the substructure "Bodygesture" from the action representation of the actor (see Table 5.3), taken from the last shot of the previous Sequence-Structure, in which the character appears. The comparison of the Relation Object -> Bodypart (see Table 9.2) with the line of sight of the Bodygesture (see Table 5.3) reveals, that the character, Frank, is actually not looking to the ground and thus there is no reason for the Content Planner to assume, that the character is aware that he is about to slip on an object, and thus, H-Strategy 4, which requires an expected mishap, is not applicable. The Content Planner could now continue the search for another suitable opposition action. However, assume that the Content Planner investigates weaker mishap strategies, such as H-Strategy 2, which requires an unexpected mishap. This strategy corresponds with the material of the motivation phase and would fulfil the constraint of unexpectedness, since in the shots of the motivation phase the character is not looking towards the ground. As a result, the Content Planner changes the strategy ID in the realisation Sequence-Structure from H-Strategy 4 to H-Strategy 2, but marks the switch as "imp-up-strat", which indicates, that the joke can be improved by using a higher order strategy. This index is not particularly significant for the generation process, but may become important when, at later stages, the system is asked to account for its generation and evaluation process. It may be the case, for example, 9: The operation of AUTEUR: Show me a joke 196 that the user does not approve of the supplied joke, which may further require that the system can suggest improvements that could be made. Additionally, the Content Planner attempts to apply the constructive H-Strategies related to misfortune, which might improve the joke. An example is H-Strategy 5, which states that "If the intention of the joke is derisive, reveal the point in advance. (enhanced Schadenfreude)". As a result, the Content Planner sends a query to the Visual Designer, requesting a detail shot of the object the character is to slip on, as the spectator will then anticipate the mishap and this is predicted to increase the potential success of the joke. Assume that this query is successful, and the retrieved shot is that shown on the left of Figure 9.10. The Visual Designer then analyses the content and style of the potential material (in terms of spatial and temporal continuity), following which the Visual Constructor specifies the detailed joining of the material. The final outcome is a two-shot scene as suggested by the stills in Figure 9.10. Figure 9.10 Realisation part for the banana skin joke, generated out of two shots The Content Planner now evaluates the realisation part of the joke. Since all requirements of the strategy could be fulfilled, the evaluation value is in the range "good". Finally, the Content Planner updates the Location-Memory-Structure, and then indicates the status of the realisation phase to the Structure Planner, by providing a structure such as that shown in Figure 9.11. Strategy used meaning of the story: motivated event: mood of main character: action main character: slip number of characters: 1 H-Strategy 2+(imp-up-strat) mishap slip on banana_peel [Frank, pleasure+1.0] Figure 9.11 Status of the realisation phase of the banana skin joke 9: The operation of AUTEUR: Show me a joke 197 The Structure Planner once again compares the status of the generation phase (realisation) with the related Sequence-Structure (see Figure 9.9). Since the action for the main character is not indicated in the Sequence-Structure, the Structure Planner updates it with the action provided by the status information (i.e. "slip"). Furthermore, the strategy type must be updated, since the strategy has changed. The next step for the Structure Planner is to decide if a resolution stage is required. Following the concept of misfortune, the Structure Planner investigates the content representation of the shots generated in the realisation phase with respect to the portrayal of a mood change, either by showing a reaction or a gesture. Since such information cannot be found in the constructed material, the Structure Planner suggests a resolution phase, and then instantiates the relevant Sequence-Structure, resulting in a structure such as that shown in Figure 9.12. Sequence-Structure Kind Intention Form Appearance Setting Subjects Action resolution action misfortune, H-Strategy 2 accelerate [Single (2, [path, meadow]), Group (1,[trees]), [Frank [pleasure+1.0] slip Figure 9.12 Resolution Sequence-Structure for the banana skin joke Finally, the Structure Planner instructs the Content Planner to continue with the generation process. 9.1.4 Resolution phase Using the conceptual structure for slip (see Table 9.2) the Content Planner constructs a request for video material that portrays an appropriate reaction by the character. The reaction is composed by considering possible resulting states and, if pertinent, the relevant object. For the case of slip and banana peel one possibility is to request a shot of the character lying on the ground, or a shot of the character looking angrily down at either the banana peel or something not represented in the shot, or a shot of the character simply looking angry, and so on (see Result actions and Result moods in Table 9.2). The strategy for choosing between alternatives is based on a preference of reactions to moods, and among competing potential results, the choice is based on the value of the relevant link. Assume that the Content Planner sends the following 9: The operation of AUTEUR: Show me a joke 198 request to the Visual Designer: provide a shot where the character "Frank" looks back at the object (banana_peel). In order to ensure a consistent filmic style throughout the scene, the Visual Designer attempts to satisfy the request of the Content Planner in terms of stylistic aims and the shot history (Location-Memory-Structure). This can be important, for example, if earlier decisions concerning editing techniques are to be repeated (examples being the consistent use of close-ups for highlighting, or cutting without bridging). Since the realisation phase of the generation predominantly makes use of shot types between "medium" and "close-up", the Visual Designer attempts to answer the query with a shot from within this range of types. Figure 9.13 shows a frame from a shot that realises this aim with respect to the existing chosen material. Figure 9.13 Retrieved shot for the realisation phase of the banana skin joke Once the Visual Constructor has established the join, the Content Planner evaluates the resolution part of the joke. Since a reaction, rather than a gesture, is shown, the realisation is valued as "good", even though the reaction itself is the weakest within the set of reaction links, i.e. the last element of the list (see Result action in Table 9.2). The Content Planner then applies the evaluation values gathered for each generation phase to evaluate the humour level of the joke. Since each of the generation phases produces the required content, though with varying degrees of success, the value of the stylistic units is high enough for a good ranking. The overall verdict is therefore "good". However, due to the need to downgrade H-Strategy 4 to the simpler H-Strategy 2, the originality is assessed as "average". Following the Content Planner’s indication of the successful termination of the joke generation process, the Structure Planner seeks options for developing further jokes from the existing story line, or attempts to provide the specification of an appropriate conclusion to the scene. Thus, the Structure Planner will first suggest the repeated application of the misfortune strategy. For our meeting example, this would mean that events should be 9: The operation of AUTEUR: Show me a joke 199 generated where the main character continues to attempt to reach the meeting after slipping on the banana skin, and is subjected to further mishaps, such as falling over a bench, missing a bus, failing to hail a taxi, and so on. Let us assume, for simplicity, that no visual material corresponding to such situations is available. The Structure Planner will now perform a cross-comparison of the established Sequence-Structures and the Location-Memory-Structure, to ensure that all generation goals have been satisfied. For the above example, the comparison between the Sequence-Structures and the conceptual structure of the event "meeting" reveals that the generated video material neither provides the realisation nor the resolution stage for a meeting. Thus the Structure Planner instructs the Content Planner to generate: a) a realisation and resolution phase for a meeting of the characters; or b) a resolution phase for a meeting of the characters; or c) a decomposed version of either of cases (a) or (b). Assume that each of the above requests is unsuccessful. The Structure Planner then determines that the shot in Figure 9.7 is superfluous material. This will initiate the sending of a re-editing plan to the Content Planner. The first step for the Content Planner is to introduce the Visual Constructor to erase the shot from the Location-Memory-Structure. The Content Planner then adjusts the fields Setting, Subjects and Action in the relevant Sequence-Structures. In the given example, only the motivation Sequence-Structure is affected, since no other sequence structure contains information about the character to be removed. Finally, the Content Planner re-calculates the evaluation value of the joke, which, for the given example, remains the same. The final version of the banana skin joke, which is roughly 20 seconds long, is suggested by the stills in Figure 9.14 (read from left to right, top to bottom). 9: The operation of AUTEUR: Show me a joke 200 Figure 9.14 The banana skin joke generated by AUTEUR 9.2 The lamp post joke For the second example, assume that the Content Planner has indicated a successful completion of the banana skin gag, as described above. As already mentioned, the Structure Planner next pursues options for developing further jokes from the existing story line. Thus, the Structure Planner uses the banana skin joke as a case basis for the creation of other jokes, guided by the repetition and exaggeration strategies presented in chapter 3 (H-Strategies 9 - 12). 9.2.1 Preparation phase The first task for the Structure Planner, in the preparation phase of a repetition, is to find an appropriate startshot. Hence, an analysis of the motivation Sequence-Structure of the banana skin joke is performed, to retrieve information about the main actor, his or her actions, and the relevant location. The Structure Planner uses the results of this information retrieval process to generate a find-startshot query, which, for our example, may be: "Find a shot with Frank walking or performing a similar action or related action, in a setting similar to the one of shot 12, without returning shot 12". This query is sent to the Content Planner, which instructs the Visual Designer to retrieve a shot based on the provided content requirements. Figure 9.15 shows a frame from a shot that meets the content requirements. 9: The operation of AUTEUR: Show me a joke 201 Figure 9.15 Startshot for the lamp post joke Based on the retrieved startshot for the new sequence, the Structure Planner performs the content analysis. The result of this is a structure such as that shown in Figure 9.16. Shotid: 25 Shotkind: medium Actors: Single (1, [Frank]), Group (0,[]) Objects: Single (2, [newspaper]), Group (0,[]) Actions: move]. sequence:no, parallel: [read, move], single action: [read, Mood: [Frank [ pleasure+ 1.0, hurry+0.5]] Figure 9.16 Result of startshot analysis for the lamp post joke The task of the Structure Planner is to now establish the first Sequence-Structure for the current generation phase, and a crucial choice is that of the next appropriate humour strategy to be applied. Since there is already a case representing a joke (the banana skin joke) stored in memory, the Structure Planner first of all attempts to analyse the applicability of the previously used strategy. However, a comparison of the results from the current startshot analysis (see Figure 9.12) with the motivation Sequence-Structure of the banana skin joke reveals that both startshots differ in the type of actions shown, i.e. the current startshot offers parallel actions. Thus, the Structure Planner infers that the humour type "mishap" is, in the given circumstances, inapplicable. The Structure Planner next attempts to establish the appropriate humour strategy, based on supportive strategies such as H-Strategies 17 - 19, which suggest the creation of relationships between parallel actions, either by constructing a wider context which can accommodate the actions, or by establishing a contextual 9: The operation of AUTEUR: Show me a joke 202 relationship based on conflict. From H-Strategy 18, which is devoted to the construction of a conflict oriented relation, the Structure Planner launches a request to the Content Planner, to investigate if the actions (i.e. walk and read) performed by Frank in shot 25 (Figure 9.15) are mutually conflicting. A conflict oriented relation between random actions is based on the assumption that the related actions share resources (e.g. subactions), and that these shared resources are of importance for all of the actions involved, which is indicated by qualitative modal scales such as necessary, useful, etc. (as discussed in sections 3.3 and 7.2.1.2). Hence, the Content Planner investigates the subaction links leaving the conceptual structures of "reading" and "moving" (or the subaction links of their synonyms), to detect if both actions share the same sub-action(s), and if the potential links are tagged with a qualitative modal scale greater than or equal to "useful". An outcome of this investigation might be that both "read" and "walk" share the same subaction, i.e. "looking", where the link from "read" is tagged with "necessary", and the link from "walk" is tagged with "useful". The Content Planner orders the actions according to the importance of the subaction (i.e. [read,walk]) and returns this list to the Structure Planner. Since the list is non empty, the Structure Planner assumes that a conflict can be established, and can now apply H-Strategy 19, which suggests that a conflict relation between actions indicates derisive humour. Thus, the humour type for the joke to be created is misfortune. The next task for the Structure Planner is to determine the construction phase. As misfortune requires a resulting mood deterioration, the Structure Planner evaluates the mood of the relevant character. As "pleasure" is already marked with a certainty value of 1.0 with respect to the content of the shot (25 - Figure 9.15), the Structure Planner rejects the necessity to motivate the mood. Since the shot in Figure 9.15 contains parallel performed actions, it may be necessary to highlight one of these actions. This is investigated by the Structure Planner. The analysis of the current startshot (see Figure 9.16) indicates that the shot type is "medium". The Structure Planner investigates the conceptual relationship between camera distance and hierarchical representation of subjects (see Table 6.3), and so determines that a "medium" shot favours the appearance of particular body parts. Hence, the Structure Planner compares the list of actions provided by the startshot analysis with the conflict-list provided by the Content Planner. The result of this comparison is that "read" features in both lists whereas the other two actions ("move" 9: The operation of AUTEUR: Show me a joke 203 and "walk") each appear in only one list. Hence, the Structure Planner concludes that "read" is already highlighted. As conflict is used to establish the joke, no further event motivation is required. This means that neither the action, mood nor event need to be motivated. Thus, the Structure Planner decides that the motivation phase can be omitted, and the generation phase for the Sequence-Structure will be "realisation". However, constructing a realisation Sequence-Structure requires the identification of the appropriate H-Strategy. H-Strategy 20 suggests that the realisation phase for jokes based on conflict should emphasise the stronger of the parallel actions (i.e. read), but base the joke on the weaker one (i.e. walk). Thus, the joke to be created is single action oriented. To transform the action status, i.e. parallel into single, the Structure Planner once again attempts to analyse the applicability of previously used humour strategies. This time, it turns out that H-Strategy 2 is indeed applicable. According to H-Strategy 12, a repetition strategy, H-Strategy 2 can be repeated, as so far only one joke has been generated. On completion of the presentation phase, the Structure Planner first generates a Location-Memory-Structure, so that the startshot can be stored, then sets the evaluation value for the motivation phase to "good", and finally generates the first Sequence-Structure for the current joke, which may result in a structure such as that shown in Figure 9.17. Sequence-Structure Kind Intention Form Appearance Setting Subjects Action realisation action misfortune+H-Strategy 2 accelerate [Single (1, [newspaper]), Group (0,[])], [Frank [pleasure+1.0] walk Figure 9.17 Realisation Sequence-Structure for the lamp post joke The Structure Planner then instructs the Content Planner to proceed with the generation of the realisation phase. 9.2.2 Realisation phase In discussing the representation phase of the banana skin joke, we explained the major mechanisms for generating the core parts of a joke. However, there is a slight 9: The operation of AUTEUR: Show me a joke 204 difference in the behaviour of the Content Planner in the case of example 2, as the system is in repetition mode. In general, the Content Planner performs as described in section 9.1.3. The Content Planner uses the realisation requirements provided by H-Strategy 2 on the action "walk", which means traversing the opposition links of the conceptual structure for "walk", to retrieve the action or event which forms the basis of the joke. However, since the system is now in repetition mode, the Content Planner also considers already generated jokes, so that they will not be repeated. Thus, before the Content Planner attempts to instantiate an oppositional concept, it compares the potential action of the oppositional concept with the action of the resolution phase of similarly structured jokes, and only if both actions differ is the potential concept investigated further.2 For our example, this means that the conceptual structure for "slip" will be ignored, because there already exists a similar joke. Let us assume that, in this case, the conceptual structure for "collide with" represents the retrieved oppositional action for "walk". Assume also that no visual material exists to show the character, or a part of the character, colliding with a suitable object. Finally, assume that the system can retrieve an image of an object, which accords with the location and the intention of the requested action (e.g. a lamp post). The result of the generation process for the realisation phase may produce a final outcome of "character Frank collides with a lamp post while walking", which is represented by a medium shot which zooms into a lamp post, as suggested by the still in Figure 9.18 Figure 9.18 Realisation part for the lamp post joke Though the system is in a position to provide a credible visual presentation, the actual action (i.e. colliding) could not be represented visually. Thus, the realisation requires that the viewer infers the collision. This fails to fulfil the requirements of timing and 2 In cases where no resolution is available, since the realisation already includes the required information, the Content Planner retrieves the information from the content representation of the relevant shot. 9: The operation of AUTEUR: Show me a joke 205 readiness (discussed in sections 3.2.1 and 3.2.2) for the punch line. This is the basis of the Content Planner’s evaluation of the realisation part of the joke, as shown by the following inference: no presentation of the action, but a visual presentation of a related object => evaluate the event phase as poor. The technical details of the construction of the remainder of the realisation phase for the lamp post joke are identical to those described for the realisation of the banana skin joke, such as the comparison between Sequence-Structure and joke status, and the generation of the resolution Sequence-Structure. 9.2.3 Resolution phase For the resolution phase, assume that the same generation mechanisms can be applied as described for the banana skin joke. Note that the system retrieves a similar resolution shot to the one for the banana skin joke (see Figure 9.13), except that, in this case, the shot realises the different body-object relation, by showing the character looking upwards while turning round. The final version of the lamp post joke is suggested by the stills in Figure 9.19. Figure 9.19 The lamp post joke generated by AUTEUR The Content Planner evaluates the resolution part of the joke as "good", since a reaction is shown, rather than a gesture. However, because the most important event phase, i.e. the realisation phase, failed to generate an acceptable visualisation, the overall verdict on the above joke is "poor". Nevertheless, the stylistic impression is evaluated as better than average, mainly due to the compact presentation of the motivational aspects in one shot. 9.3 The bus joke For our final example, assume that the Content Planner has indicated a successful completion of the "lamp post" joke. Furthermore, suppose that the system allows only one repetition of a joke based on the same action, which means that from now on 9: The operation of AUTEUR: Show me a joke 206 jokes on "walking" are not supported, unless they are related to a different humour strategy. Finally, recall that the system continues to be in repetition mode. 9.3.1 Preparation phase The first task for the Structure Planner is to find an appropriate startshot. The analysis of the motivation Sequence-Structures of the previously generated joke provides usable information for the potential character (i.e. Frank) and the location (i.e. outdoors), but fails for the retrieved action which is, in each case, "walk", and cannot be used, as stated above. In response to the need to explore alternative links, the Structure Planner uses the conceptual link "domain" of the conceptual structure "walk" ("domain" is discussed in section 7.2.1), which leads to the conceptual structure of "motion". This concept embodies a set of links to actions or events which share an abstract criteria (e.g. motion = the change of location of a body in regard to another body or a reference system). A simplified set of actions for motion might be [fly, drive, swim, using_transport]. The first three elements of this list represent actions. AUTEUR initially attempts to construct a joke based on one of these actions, using the processes described above (i.e. try to establish a startshot featuring the existing main character and related action, analyse strategies used for their re-useability, and so forth). Let us assume that each of these attempts is unsuccessful. When the Structure Planner is confronted with the event using_transport, a request is send to the Content Planner to retrieve an event-structure for using_transport, which agrees with the information gathered about the main character and location. The Content Planner may retrieve an event structure such as that described in Table 9.3. Name using_ transport Actor number 2 Gender Intent Motivation Realisation ion bus move [come] [stop] male/female [stand] [catch] Resolution Episode [leave] [sit in] [] Table 9.3 Structure of an event "using_transport" Since the event structure provides a motivation, realisation and resolution phase, it provides the Structure Planner with a template for the generation process. 9: The operation of AUTEUR: Show me a joke 207 The Structure Planner is once again in the preparation phase, and in repetition mode, so the first task is to find an appropriate startshot. The Structure Planner uses the sequentially organised actions of the motivation phase (see Table 9.3) to generate a number of retrieval queries, such as: • Find a shot of the character Frank standing or performing a similar action or related action, in a setting related to a bus, with a bus approaching; or • Find a shot of an approaching bus or a bus performing a similar or related action, and a shot of the character Frank standing or performing a similar action or related action, in a setting related to a bus. These queries are sent, in descending order of importance, to the Content Planner, which in each case first retrieves and adds, if possible, descriptions of appropriate objects for the environment (e.g. [bus stop]), by analysing the conceptual structure for "bus", before instructing the Visual Designer to retrieve a shot based on the given content requirements. Figure 9.20 shows two frames of a sequence the Content Planner may return. Figure 9.20 Startshot sequence for the bus example The Structure Planner performs a content analysis for each of the startshots retrieved for the new sequence. The task of the Structure Planner is now to establish the first Sequence-Structure for the generation of the current joke. The choice of an appropriate humour strategy is based on the comparison of the current analysis of the startshot sequence with the motivation Sequence-Structures of existing jokes. The result is that, due to the similarity in action and character number between the two current startshots and previously used startshots, H-Strategy 2 can be reused. The repetition is directed according to H-Strategy 12. Due to the fact that the same character features as the butt in each of the previous generated jokes, the Structure Planner decides, that for the current joke, it is also this character's action on which the joke is to be based. However, due to the fact that the Structure Planner is actually using the conceptual 9: The operation of AUTEUR: Show me a joke 208 structure for the event "using_transport" as the generation template, the action is embedded in an event and thus the Intention of the joke is "action+event". Since two shots have been provided, both causally linked through the logic of the event phase motivation, and each shows one subject performing one action, the Structure Planner decides that the Form of the joke must be "composed". The required actions are provided, the shot type for both shots is "medium", and the event is established, so the Structure Planner infers that no motivation phase is necessary. On completion of the presentation phase, the Structure Planner generates the Location-Memory-Structure, instantiates the evaluation value for the motivation phase as "good", and finally generates the Sequence-Structure for the current joke, as shown in Figure 9.21. Sequence-Structure Kind Intention Form Appearance Setting Subjects Action realisation action+event misfortune+H-Strategy 2 accelerate [Single (3, [bust-stop, street, tree]), Group (,[])], [Frank [pleasure+1.0] catch Figure 9.21 Realisation Sequence-Structure for the bus joke The Structure Planner then instructs the Content Planner to proceed with the generation of the realisation phase. 9.3.1 Realisation phase The task of the Content Planner is essentially to establish a misfortune related to the action of "catch". The Content Planner therefore attempts to find an opposition action for "catch", which may be "miss". Analysing the conceptual structure of "miss", the Content Planner detects that one meaning of "missing" requires an object which moves away from a character. The object may be of the type [taxi, bus, plane] and the related motion may be of the type "drive" or "leave". The Content Planner uses the above information for the action "miss" to request from the Visual Designer a shot showing a bus moving away, in an environment similar to the one provided in the startshot sequence. Assume that the Visual Designer retrieves an appropriate shot. 9: The operation of AUTEUR: Show me a joke 209 The next step for the Content Planner is to establish the action of the character. The result of traversing the opposition links of "miss", might be the action "look after". The Content Planner sends a query to the Visual Designer, asking for a shot showing the character in the same location as in the startshot sequence, performing a similar action and where the direction of the character’s sight matches the direction of the bus. Assume that such a shot can be retrieved. On completion of the realisation phase, a possible outcome may be the shot sequence suggested by the stills in Figure 9.22. Figure 9.22 Realisation sequence for the bus joke 9.3.3 Resolution phase For the resolution, assume that the same generation mechanisms as described for the previous jokes are applied. However, in this example, the system retrieves a shot showing a gesture (i.e. the facial expression of anger) rather than the performance of an action, for the resolution. The final version of the bus joke is suggested by the stills in Figure 9.23 (read from left to right, top to bottom). Figure 9.23 The bus joke as generated by AUTEUR 9: The operation of AUTEUR: Show me a joke 210 As the content of each of the three phases, motivation, realisation and resolution, in itself, is exactly as required, the joke is valued as "good" by the Content Planner. Note that the event of missing a bus is not, in itself, necessarily funny. However, AUTEUR generates the joke because the system is in repetition mode. Thus the joke is that, once again, the same character experiences a mishap. Assume that no further repetitions can be generated. The final task of the Structure Planner is to instruct the Retrieval System to present the generated streams of shot ids to the user, in the order of their generation, i.e. the banana skin joke, followed by the lamp post joke, and finishing with the bus joke. 9.4 Conclusion The examples presented above should provide the reader with a suitable impression of the operation of AUTEUR. Though AUTEUR is but a prototypical system, we have demonstrated here that it is capable of generating visual jokes based on single and parallel actions for one character, or on single action interaction between two active characters. We have also demonstrated the interaction between the multiple planners involved in the editing process. Finally, we demonstrated the automatic evaluation of a joke according to a humour scale. The AUTEUR system, and its theoretical foundations, should be regarded as a platform that demonstrates the potential for automated thematic film editing in restricted, yet complex domains. Despite its complexity, AUTEUR is capable of generating only a restricted range of humorous film sequences. However, the complexity is a necessary basis for improving the system so that it can provide more sophisticated humour, and, eventually, is capable of dealing with additional themes to that of humour. We refer to these points in the next chapter. 211 Chapter X Achievements and conclusions 10.1 Achievements This thesis has presented a planner-based approach to the application of video semantics and theme representation in the automated editing of visual stories at the level of events. In order to understand the dynamics of editing, the current author carried out a “knowledge elicitation” exercise that involved studying and interviewing editors at work in their own environment, i.e. the cutting rooms of the WDR1. In common with those involved in the enterprise of knowledge elicitation in other domains, I found that the expertise of the editors could not be codified in the form of simple rules. The complexity involved in the video editing process is obscured by the fact that those involved appear to manage it effortlessly, though many different influences, and knowledge at varying levels and detail, are involved. Some of the complexity of the task is reflected in the different influences to be found in this thesis, and in the architecture of AUTEUR. A central contribution of my research is the novel application of theories of cinema, humour and narrativity. In particular, these theories have informed my development of: • A textual representation which describes semantic, temporal and relational features of video in hierarchically organised structures, which overcomes the 1 WDR (Westdeutscher Rundfunk - Köln) is the largest television broadcasting centre in Germany. 10: Achievements and conclusions 212 limitations of key word-based approaches. The essential categories for the proposed ontology are action, character, object, relative position, screen position, geographical space, functional space and time. • A set of 26 humour strategies, which combine the logic of narrative structures with functional operators of comic primitives. The identified primitives within humour are grouped into two classes: supportive primitives (exaggeration, timing and readiness) and constructive primitives (incongruity and derision). Furthermore, I described a simple, but novel, mechanism to evaluate jokes. • A simplified model of the editing process, which covers the juxtaposition of takes, shots and scenes, for the rough cut stage. • A set of 37 editing strategies, which introduce novel schemes for the appropriate visual and cinematographic presentation of a narrative event. The strategies support continuity editing, and focus on the essential narrative aspects, context and form. I described a mechanism for relating the intention of a narrative to its presentation in shot form. Furthermore, I introduced an analytic mechanism with which a system can visually guide the viewer of a video sequence, so that he or she can identify information as relevant or purely descriptive. Moreover, I introduced a scheme to establish and maintain spatial and temporal continuity over several shots, based partly on the visual decomposition of a character and/or actions, and partly on the decomposition of spatial relationships between characters and between characters and their screen positions. Finally, I demonstrated strategies which shape the temporal rhythm of a sequence by means of physical clipping. • An ontological representation of narrative elements such as actions, events, and emotional and visual codes, based on a semantic net of conceptual structures related via six types of semantic links (e.g. synonym, subaction, opposition, ambiguity, association, conceptual). A coherent action-reaction dynamic is provided by the introduction of three event phases, i.e. motivation, realisation and resolution. In order to demonstrate the applicability of the above representations, I implemented an experimental system AUTEUR, which achieves a limited degree of success in producing humorous film sequences. The architecture of AUTEUR embodies the synchronisation of structural and stylistic requirements for automatically generating 10: Achievements and conclusions 213 visual stories, by providing a planning system for each overall development level (i.e. structure and content). 10.2 Conclusions The AUTEUR system and its theoretical foundation are best regarded as a platform that demonstrates the feasibility of automated thematic film editing in restricted, yet complex, domains. We showed how automated film editing can support automated help, and, despite the implementational complexity of the system, that the extracted principles will also provide a source of help to designers and workers in other related fields. However, there are a number of problems yet to be solved, and though most of them were touched upon in this thesis, it is useful to revisit the major issues, and reflect on the strength and weaknesses of this work. The structured textual approach to the representation of video content provides an objective representation and thus does not restrict possible connotative meanings of the material. At this moment in time, the structures provided appear to be sufficiently rich to describe complex film specific features, and the denotative aspects of film, though some of the structures are rudimentary, such as the representation of gestures and the representation of groups of subjects. In theory, the current representation enables the annotation of video material of arbritary length and content. It should be noted that the notion of arbitrary re-use of video material is illusory, since in general, the computational effort involved in comparing large numbers of highly varied content based descriptions to establish continuity would result in an unacceptable degradation of system performance. However, if we focus on domain dependent applications, a suitable selection of material would reduce this complexity. For example, we may ensure that our database contains only shots in which: • a small set of characters are found in similar locations, • the locations are simple, • the actions are available from different angles, point of views, etc. The strongest qualitative drawback of the representation is the absence of sound. At the current stage of my research, I cannot predict the extent to which the introduction of sound would necessitate the re-design of representational structures for video content, but it is obvious that at least the structures for "actor action", "object action", and "setting" would require modification. Nevertheless, sound would enhance AUTEUR's abilities as a generator of meaningful sequences. 10: Achievements and conclusions 214 AUTEUR is a research platform, and as such achieves limited success in automatically generating video material to suggest a given theme. The current version of AUTEUR produces only a restricted range of humorous scenes, predominantly of the so-called slapstick style, e.g. "slipping on a banana skin". However, AUTEUR achieves this in ways that take account of knowledge of filmic acceptability. The discussion of the cinematic image showed that a very large number of codes are involved in the meaning of images. The current version of AUTEUR makes use of but a few of these, such as: • emotional codes, which are represented as conceptual structures (e.g. for rage or pleasure), and their visualisation in the form of gestures of body, face, hands and limbs (e.g. smiling, nodding, pointing), or actions (e.g. whistling as an indication of pleasure); • cinematic codes, as exemplified by the relationship between camera distance and hierarchical representations of subjects, in combination with conceptual relationships between filmic devices and narrative functionality. Additionally, there are the spatial relationships between shot distances; • cultural codes, in that our representation of jokes, editing and story structures, reflect the humour, film and narrative schemes specific to European culture. Each of the above code systems could be dealt with more extensively. For example, emotional codes could be improved by extending our action-gesture-centred approach, used in the representation of video content, with a more complete representational scheme for hand and body gestures. Useful augmentation to the cinematic codes would be colour codes (e.g. bright colours support the impression of a good mood, and there is relationship between colours and certain abstract concepts). Further refinements of the cinematic codes could be achieved by including additional cinematographic aspects of video content (e.g. camera angle, or shot contrast) or denotative aspects of video content (e.g. season, structures of objects - form, colour, size, etc.), and linking these to conceptual structures of abstract concepts. It must be stressed that these amendments would not necessitate in great change to the existing representation structures, but would greatly improve AUTEUR's inferencing capabilities. Spatial and temporal continuity editing is relatively well provided for by our editing scheme. There is a need, however, to improve the temporal aspects of physical editing, and, in particular, to adequately address the influence of the speed of action 10: Achievements and conclusions 215 on the rhythmical structure of a sequence, which is important in slowing and accelerating the pace of a sequence. At this moment in time, I make no claim as to the general applicability of the humour techniques embodied in AUTEUR, particularly since so far they cater only for the interaction between, at most, two characters. In order to enhance AUTEUR's ability to automatically generate a meaningful humorous sequence, further work should be devoted to the definition of richer strategies for particular styles of visual humour (e.g. incongruous humour), or more sophisticated humour strategies, such as transforming the behaviour of a character from human into automaton or vice versa. With the larger set of humour strategies, it would then be necessary to perform more experiments involving many more conceptual descriptions in a more extensive semantic net, along with a much larger video database. This would enable us to assess performance of our system in a more realistic situation. The real challenge, however, is the consideration of larger narrative structures than single events. The generation of higher order narrative structures, e.g. episodes, requires more consideration of such stylistic features as genres and different themes, such as "tragedy", and I assume that this will also lead to a need for an increase in the number of supportive and constructive thematic strategies. Thus, to use AUTEUR as a basis of a truly flexible system for the generation of meaningful thematic sequences on any narrative level, major amendments to the Structure Planner and Content Planner would be needed. Future work should also focus on the incorporation of tools to support video annotation, the provision of conceptual representations and the editing of thematic and film editing strategies. 10.3 Postscript This thesis has shown that an artificial system can, in some respects, autonomously create emotionally stimulating visual events, and it is my hope that the work undertaken will provide input into research into the automated generation of video scenes and research into the interpretation and analysis of video. The future may see the emergence of systems that can actively influence our creative work with and improve our understanding of, still and moving images. If this thesis helps to solve some of the problems associated with the development of such systems, then it will have achieved much more than the author dared to hope. 216 Bibliography Aguierre Smith, T. G., & Davenport, G. (1992). The Stratification System. A Design Environment for Random Access Video. In ACM workshop on Networking and Operating System Support for Digital Audio and Video, San Diego, California Aguierre Smith, T. G., & Pincever, N. C. (1991). Parsing Movies In Context. In Proceedings of the Summer 1991 Usenix Conference, (pp. 157-168). Nashville, Tennessee. Aigrain, P., & Joly, P. (1994). The automatic real-time analysis of film editing and transformation effects and its applications. Computer & Graphics, 18(1), 93 - 103. Aigrain, P., Joly, P., & Longueville, V. (1995). Medium Knowledge-Based MacroSegmentation of Video into Sequences. In M. Maybury (Ed.) (pp. 5-16), IJCAI 95 Workshop on Intelligent Multimedia Information Retrieval. Montréal: August 19, 1995 Allen, J. F. (1983). Maintaining Knowledge Communications of the ACM, 26(11), 832-843. about Temporal intervals. Allen, J. F. (1990). Towards a General Theory of Action and Time. In J. Allen, J. Hendler, & A. Tate (Eds.), Readings in Planning (pp. 464 - 519). San Mateo: Morgan Kaufmann Publishers. Allen, J. F. (1991). Time and Time Again: The Many Ways to Represent Time. International Journal Of Intelligent Systems, 6, 341-355. Andersen, S., & Slator, B. M. (1990). Requiem for a theory: the 'story grammar' story. Journal of Experimental and Theoretical Artificial Intelligence, 2(3), 253 275. André, E. (1995). Ein planbasierter Ansatz zur Generierung multimedialer Präsentationen., Ph.D., Sankt Augustin: INFIX, Dr. Ekkerhard Hundt. André, E., & Rist, T. (1994). Multimedia Presentations: The Support of Passive and Active Viewing. In AAAI Spring Symposium on Intelligent Multi-Media Multi-Modal Systems, (pp. 22 - 29). Stanford University: AAAI. Bibliography 217 André, E., & Rist, T. (1995). Generating Coherent Presentations Employing Textual and Visual Material. Artificial Intelligence Review, Special Volume on the Integration of Natural Language and Vision Processing, 9(2 - 3), 147 - 165. Andrew, D. (1984). Concepts in Film Theory. Oxford: Oxford University Press. Aristotle (1968). Poetics, Introduction, Commentary and Appendixes by D.W. Lucas. Oxford: Oxford university Press. Arnheim, R. (1956). Art and Visual Perception: A Psychology of the creative eye. London: Faber & Faber. Arnheim, R. (1983). Film as Art. London: Faber & Faber. Ashley, J., Barber, R., Flickner, M., Hafner, J., Lee, D., Niblack, W., & Petkovic, D. (1995). Automatic and Semi-Automatic Methods for Image Annotation and Retrieval in QBIC. In W. Niblack & R. Jain (Ed.) (pp. 24 - 35), Storage and Retrieval for Image and Video Databases II . San Jose, California, February 9 - 10, 1995: SPIE. Balázs, B. (1972). Theory of the Film. London: Dennis Dobson. Barthes, R. (1967). Elements of Semiology. New York: Hill and Wang. Barthes, R. (1974). S/Z with a preface by Richard Howard. London: Jonathan Cape. Barthes, R. (1977). Image, Music, Text - Essays selected and translated by Stephen Heath. London: Fontana Press. Bateman, J. A., Henschel, R., & Rinaldi, F. (1996). The Generalized Upper Model 2.0 (Technical Document: http://www.darmstadt.gmd.de/publish/komet/genum/newUM. html). GMD-Darmstadt, Germany. Bateman, J. A., Magnini, B., & Rinaldi, F. (1994). The Generalized Italian, German, English Upper Model. In Proceedings of the ECAI94 Workshop: Comparison of Implemented Ontologies, Amsterdam. Bates, J. (1992). The Nature of Characters in Interactive Worlds and the Oz Project (Technical Report No. CMU-CS-92-200). School of Computer Science Carnegie Mellon University. Bates, J. (1994). The role of Emotion in Believable Agents. Communications of the ACM, 37(7), 122 - 125. Bates, J., Loyall, A. B., & Reilly, W. S. (1992). An Architecture for Action, Emotion and Social Behaviour. (Technical Report No. CMU-CS-92-144). School of Computer Science Carnegie Mellon University. Bazin, A. (1967a). The Ontology of the Photographic Image. In H. Gray (Eds.), What is Cinema - essays selected and translated by Hugh Gray (pp. 9 - 16). Berkeley: University of California Press. Bazin, A. (1967b). What is Cinema? Volume I. Berkeley: University of California Press. Bazin, A. (1971). What is Cinema? Volume II. Berkeley: University of California Press. Bibliography 218 Beattie, J. (1776). On laughter and ludicrous composition. In Essays (pp. 583 - 706). Edinburgh: Creech. Beckwith, R., Miller, G. A., & Tengi, R. (1993). Design and Implementation of The WordNet Lexical Database and Searching Software . (ftp://clarity.princeton.edu/pub/wordnet/5papers.ps). Cognitive Science Laboratory, Princeton University. Bergson, H. (1956). Laughter. In W. Sypher (Eds.), Comedy - Introduction and Appendix by Wylie Sypher (pp. 61 - 190). New York: Doubleday. Bettetini, G. (1973). The Language and Technique of the Film. The Hague: Mouton Publishers. Binstead, K., & Ritchie, G. (1994a). An implemented model of punning riddles (Research Paper No. 690). Department of Artificial Intelligence, University of Edinburgh. Binstead, K., & Ritchie, G. (1994b). A symbolic description of punning riddles and its computer implementation (Research Paper No. 688). Department of Artificial Intelligence, University of Edinburgh. Black, J. B., & Bower, G. H. (1980). Story understanding as problem solving. Poetics, 9, 223 - 250. Black, J. B., & Wilensky, R. (1979). An evaluation of story grammars. Cognitive Science, 3, 213 - 230. Bloch, G. R. (1986) Elements d’une Machine de Montage Pour l’Audio-Visuel. Ph.D., Ecole Nationale Supérieure Des Télécommunications. Bloch, G. R. (1988). From Concepts To Film Sequences. In RIAO 88, (pp. 760 767). MIT Cambridge MA.: March 21-24, 1988. Bobrow, D. G., & Winograd, T. (1985). An Overview of KRL: A Knowledge Representation Language. In R. J. Brachman & H. J. Levesque (Eds.), Readings in Knowledge Representation (pp. 263 - 285). San Mateo, California: Morgan Kaufmann Publishers. Bordwell, D. (1985). Narration in the Fiction Film. London: Methuen. Bordwell, D. (1986). Classical Hollywood Cinema: Narrational Principles and Procedures. In P. Rosen (Eds.), Narrative, Apparatus, Ideology - A Film Theory Reader (pp. 17 - 34). New York: Columbia University Press. Bordwell, D. (1989). Making Meaning - Inference and Rhetoric in the Interpretation of Cinema. Cambridge, Massachusetts: Harward University Press. Bordwell, D., & Thompson, K. (1993). Film Art - An Introduction. New York: McGraw-Hill. Bremmer, J., & Roodenburg, H. (1991). A Cultural History of Gesture - from Antiquity to the Present Day. Cambridge, MA: Polity Press in association with Basil Blackwell. Bibliography 219 Brooks, K. (1995). Agent Stories. In Working Notes of the AAAI-Spring Symposium T95, Interactive Story: Systems Plot and Character, Stanford University. Brooks, K. (1996). Do Story Agents Use Rocking Chairs? The Theory and Implementation of One Model for Computational Narrative (Internal unpublished report MIT-Media Lab Interactive Cinema Group). Brunovska Karnick, K. (1995). Commitment and Reaffirmation in Hollywood Romantic Comedy. In K. B. K. &. H. Jenkins (Eds.), Classical Hollywood Comedy (pp. 123 - 146). New York: Routledge. Burch, N. (1981). Theory of Film Practice. Princeton, New Jersey: Princeton University Press. Burke, R. (1993). Intelligent Retrieval of Video Stories in a Social Simulation. Journal of Educational Multimedia and Hypermedia, 2(4), 381 - 392. Butler, S., & Parkes, A. (1996). Film Sequence Generation Strategies for generic Automatic Intelligent Video Editing. To appear in the special issue “Intelligent Studio and Film Production” of Applied Artificial Intelligence. H. Kitano (ed.). Butz, A. (1995). BETTY - Ein System zur Planung und Generierung informativer Animationssequenzen (Document No. DFKI-D-95-02). Deutsches Forschungszentrum für Künstliche Intelligenz GmbH. Carbonell, J. (1978) Subjective Understanding: Computer models of belief systems. Ph.D., Yale University. Carroll, J. M. (1980). Toward a Structural Psychology of Cinema. The Hague: Mouton Publishers. Cawsey, A. (1990). Generating Explanatory Discourse. In R. Dale, C. Mellish, & M. Zock (Eds.), Current Research in Natural Language Generation (pp. 75 - 101). London: Academic Press. Chakravarthy, A., Haase, K. B., & Weitzman, L. (1992). A uniform Memory-based Representation for Visual Languages. In B. Neumann (Ed.), ECAI 92 Proceedings of the 10th European Conference on Artificial Intelligence, (pp. 769 - 773). Wiley, Chichester: Springer Verlag. Chakravarthy, A. S. (1994). Toward Semantic Retrieval of Pictures and Video. In C. Baudin, M. Davis, S. Kedar, & D. M. Russell (Ed.), AAAI-94 Workshop Program on Indexing and Reuse in Multimedia Systems, (pp. 12 - 18). Seattle, Washington: AAAI Press. Charniak, E., & McDermott, D. (1985). Introduction to Artificial Intelligence. Reading, Massachusetts: Addison-Wesley Publishing Company. Chatman, S. (1978). Story and Discourse: Narrative Structure in Fiction and Film. New York: Ithaca. Chomsky, N. (1965). Aspects of the theory of syntax. Cambridge, MA: MIT Press. Bibliography 220 Clark, D. R., & Sandford, N. (1986). Semantic Descriptors and Maps of Meaning for Videodisc Images. Programmed Learning and Educational Technology, 23(1), 84 90. Colby, B. N. (1973). A partial grammar of Eskimo folktales. American Anthropologist, 75, 645 - 62. Cullingford, R. (1978) Script application: Computer understanding of newspaper stories. Ph.D., Yale University. Davenport, G. (1994). Seeking Dynamic, Adaptive Story Environments. IEEE MultiMedia, 1(3), 9 - 13. Davenport, G., Aguierre Smith, T., & Pincever, N. (1991). Cinematic Primitives for Multimedia. IEEE Computer Graphics & Applications (7), 67-74. Davenport, g., & Murtaugh, M. (1995). ConText: Towards the Evolving Documentary. In ACM Multimedia 95 - Electronic Proceedings. San Francisco, California: November 5-9, 1995. http://ic.www.media.edu/icPublications/gdlist.html Davis, M. (1993). Media streams: An iconic visual language for video annotation. Telektronikk, 89(4), 59 - 71. Davis, M. (1995) Media Streams: Representing Video for Retrieval and Repurposing. Ph.D., MIT. DeJong, G. (1983). An Overview of the FRUMP System. In W. G. Lehnert. &. M. H. Ringle (Eds.), Strategies for Natural Language Processing (pp. 149 - 197). Hillsdale, New Jersey: Lawrence Erlbaum Associates. Del Bimbo, A., Vicario, E., & Zingoni, D. (1992). A Spatio-Temporal Logic for Sequence Coding and Retrieval. In IEEE Workshop on Visual Languages, (pp. 228 231). Seattle, Washington: IEEE Computer Society Press. Del Bimbo, A., Vicario, E., & Zingoni, D. (1993). Sequence Retrieval by Contents through Spatio Temporal Indexing. In IEEE Symposium on Visual Languages, (pp. 88 - 92). Bergen, Norway: IEEE Computer Society Press. Domeshek, E. A., & Gordon, A. S. (1995). Structuring Indexing for Video. In J. Lee (Ed.), First International Workshop on Intelligence and Multimodality in Multimedia Interfaces: Research and Applications.. Edinburgh University: July 13 - 14, 1995. Domeshek, E. A., & Kolodner, J. L. (1994). End-User Indexing of Design Lessons. In C. Baudin, M. Davis, S. Kedar, & D. M. Russell (Eds.), AAAI-94 Workshop Program on Indexing and Reuse in Multimedia Systems, (pp. 119 - 125). Seattle, Washington: AAAI Press. Don, A. (1990). Narrative and the Interface. In B. Laurel (Eds.), The Art of Human Computer Interaction (pp. 383 - 391), Reading, Massachusetts: Addison-Wesely. http://www.abbedon.com/Project/wemake.html Bibliography 221 Dramatica (1996). Screenplay Systems Inc. Developed by Melanie Anne Philips & Chris Huntley. Software architecture by Stephen Greenfield. http://www.well.com/user/dramatica/index.html Dyer, M. G. (1982) In-Depth Understanding: A Computer Model for Integrated Processing for Narrative Comprehension. Ph.D. Thesis, Yale University, New Haven, CT. Eastman, M. (1937). Enjoyment of Laughter. London: Hamish Hamilton. Eco, U. (1976). Articulations of the Cinematic Code. In B. Nichols (Eds.), Movies and Methods (pp. 590 - 607). Berkeley: University of California Press. Eco, U. (1977). A Theory of Semiotics. London: The Macmillan Press. Eco, U. (1985). Einführung in die Semiotik. München: Wilhelm Fink Verlag. Edelson, D. C. (1993). Socrates, Aesops and the Computer: Questioning and Storytelling with Multimedia. Journal of Educational Multimedia and Hypermedia, 2(4), 393 - 404. Efron, D. (1972). Gesture, Race and Culture. The Hague: Mouton. Eisenstein, S. M. (1948). Film Sense, edited and translated by Jay Leyda. London: Faber and Faber Ltd. Eisenstein, S. M. (1951). Film Form: Essays in Film Theory, edited and translated by Jay Leyda. London: Dobson Books Ltd. Eisenstein, S. M. (1970). Film Essays and a Lecture, edited by Lay Leyda. New York: Praeger. Eisenstein, S. M. (1988). Selected Works: Writings 1922 - 1934. London: BFI Publishing. Eisenstein, S. M. (1991). Selected Works: Towards a Theory of Montage. London: BFI Publishing. Feiner, S. K., & McKeown, K. R. (1991). Automating the Generation of Coordinated Multimedia Explanations. IEEE Computer, 24(10), 33 - 41. Fellbaum, C. (1993). English Verbs as a Semantic Net. (ftp://clarity.princeton.edu/pub/wordnet/5papers.ps). Cognitive Science Laboratory, Princeton University. Fellbaum, C., Gross, D., & Miller, K. (1993). Adjectives in WordNet (ftp://clarity.princeton.edu/pub/wordnet/5papers.ps). Cognitive Science Laboratory, Princeton University. Freud, S. (1960). Jokes And Their Relation To The Unconscious. London: Routledge & Kegan Paul Ltd. Frith, U., & Robson, J. E. (1975). Perceiving the language of film. Perception, 4, 97 103. Bibliography 222 Galyean, T. A. (1995). Narrative Guidance. In AAAI Spring Symposium on Interactive Story Systems: Plot and Character. Stanford University: Gehring, W. D. (1986). Screwball Comedy - A Genre of Madcap Romance. New York: Greenwood Press. Gibson, J. J. (1950). Perception of the Visual World. Boston, Ma: Houghton Mifflin. Gibson, J. J. (1971). The Information available in Pictures. Leonardo(4), 27 - 35. Golden, L., & Hardison, O. B. (1968). Aristotle’s Poetics: A Translation and Commentary for Students of Literature. Englewood Cliffs, N.J.: Prentice-Hall. Gordon, A. S., & Domeshek, E. A. (1995). Conceptual Indexing for Video Retrieval. In M. Maybury (Ed.)(pp. 23-38), IJCAI 95 - Workshop on Intelligent Multimedia Information Retrieval. Montréal, Canada: August 19, 1995. Graham, A. (1983). What is wrong with story grammars? Cognition, 15, 145 - 154. Gregory, J. R. (1961) Some Psychological Aspects of Motion Picture Montage. Ph.D. Thesis, University of Illinois. Gregory, R. L. (1971). The Intelligent Eye. London: Weidenfeld&Nicolson. Greimas, J. (1983). Structural Semantics: An Attempt at a Method. Lincoln: University of Nebraska Press. Güttinger, F. (1984). Der Stummfilm in Zitat der Zeit. Frankfurt: Deutsches Filmmuseum Frankfurt am Main. Haase, K. (1994). FRAMER: A Persistent Portable Representation Library. In ECAI 94 European Conference on Artificial Intelligence, (pp. 732- 736). Amsterdam, The Netherlands. Hampapur, A., Jain, R., & Weymouth, T. E. (1995a). Indexing in Video Databases. In Storage and Retrieval for Image and Video Databases II, (pp. 292 - 306). San Jose, California, 9 - 10 February 1995: SPIE. Hampapur, A., Jain, R., & Weymouth, T. E. (1995b). Production Model Based Digital Video Segmentation. Multimedia Tools and Applications, 1, 9 - 46. Hayes, P. J. (1990). The Frame Problem and Related Problems in Artificial Intelligence. In J. Allen, J. Hendler, & A. Tate (Eds.), Readings in Planning (pp. 588 - 595). San Mateo, California: Morgan Kaufmann Publishers. Hayes-Roth, B. (1985). A Blackboard Architecture for Control. Artificial Intelligence, 26, 251 - 321. Hayes-Roth, B. (1995). Agents on Stage: Advancing the State of the Art of AI. In C. S. Melish (Ed.), IJCAI-95 International Joint Conference on Artificial Intelligence, (pp. 967 - 971). Montréal, Canada: August 10 - 25, 1995. Hayes-Roth, B., Brownston, L., & Sincoff, E. (1995). Directed Improvisation by Computer Characters (Technical Report No. KSL-95-04). Stanford University. Bibliography 223 Hayes-Roth, B., & Hayes-Roth, F. (1990). A Cognitive Model of Planning. In J. Allen, J. Hendler, & A. Tate (Eds.), Readings in Planning (pp. 245 - 262). San Mateo: Morgan Kaufmann Publishers. Hayes-Roth, B., Sincoff, E., Brownston, L., Huard, R., & Lent, B. (1994). Directed Improvisation (Technical Report No. KSL-94-61). Stanford University. Hobbes, T. (1650). Human Nature. London: John Bohn. Hochberg, J. E. (1978). Perception. Englewood Cliffs, New Jersey: Prentice-Hall. Isenhour, J. P. (1975). The Effects of Context and Order in Film Editing. AV Communication Review, 23(1), 69 - 80. Iser, W. (1989). Prospecting: From Reader Response in Literary Anthropology. Baltimore, Maryland: The John Hopkins University Press. Iser, W. (1993). The Fictive and the Imaginary: Charting Literary Anthropology. Baltimore, Maryland: The John Hopkins University Press. Jakobson, R., & Halle, M. (1980). Fundamentals of Language. The Hague: Mouton Publishers. Janowitz, M., & Street, D. (1966). The Social Organization of Education. In P. H. Rossi & B. J. Biddle (Eds.), The New Media and Education. Chicago: Aldine. Jenkins, H. (1992). Textual Poachers: Television Fans & Participatory Culture. New York: Routledge. Jordan, T. H. (1975). The Anatomy Of Cinematic Humor. New York: The Revisionist Press. Kant, I. (1951). Critique of Judgement. New York: Haffner. Katz, S. D. (1991). Shot by Shot - Visualizing from concept to screen. Stoneham, MA: Michael Wiese Productions in conjunction with Focal Press. Kendon, A. (1981). Nonverbal Communication, Interaction, and Gesture - Selections from Semiotica. The Hague: Mouton Publishers. Kennedy, J. M. (1974). A Psychology of Picture Perception. San Francisco: JosseyBass. Knitsch, W., & van Dijk, T. A. (1978). Towards a model of text comprehension and production. Psychological review, 85, 363 -394. Kolodner, J. L. (1984). Retrieval and Organizational Strategies in Conceptual Memories. Hillsdale, N.J.: Lawrence Erlbaum. Kolodner, J. L. (1993). Case-Based Reasoning. San Mateo, California: Morgan Kaufmann. Korf, R. E. (1990). Planning as Search: A Quantitative Approach. In J. Allen, J. Hendler, & A. Tate (Eds.), Readings in Planning (pp. 566 - 577). San Mateo: Morgan Kaufmann Publishers. Bibliography 224 Kracauer, S. (1960). Theory of Film: The Redemption of Physical Reality. New York: Oxford University Press. Kuleshov, L. (1974). Kuleshov on Film - Writing of Lev Kuleshov. Berkeley: University of California Press. La Fave, L. (1972). Humour judgements as a function of reference groups and identification classes. In J. H. Goldstein & P. E. McGhee (Eds.), The Psychology of Humour (pp. 195 - 210). New York, London: Academic Press. La Fave, L., Haddad, J., & Maesen, W. A. (1976). Superiority, Enhanced SelfEsteem, and Perceived Incongruity Humour Theory. In T. Chapman & H. Foot (Eds.), Humor and Laughter: Theory, Research and Applications (pp. 63 - 91). New York: John Wiley & Sons. Lakoff, G. P. (1972). Structural complexity in fairy tales. The study of man, 1, 128 190. Lebowitz, M. (1980) Generalization and memory in an integrated understanding system. Ph.D., Yale University. Lehnert, W. G. (1983). Plot Units: A Narrative Summarization Strategy. In W. G. Lehnert &. M. H. Ringle (Eds.), Strategies for Natural Language Processing (pp. 375 - 412). Hillsdale, New Jersey: Lawrence Erlbaum Associates. Lehnert, W. G., Dyer, M. G., Johnson, P. N., Yang, C. J., & Harley, S. (1983). BORIS - An Experiment in In-Depth Understanding of Narratives. Artificial Intelligence, 20, 15 - 62. Lemon, L. T., & Reis, M. J. (1965). Russian Formalist Criticism - Four Essays. Lincoln: University of Nebraska Press. Lenat, D. B., & Guha, R. V. (1990). Building Large Knowledge-Based Systems Representation and Inference in the Cyc Project. Reading, MA.: Addison-Wesley. Lenat, D. B., & Guha, R. V. (1994). Strongly Semantic Information Retrieval. In C. Baudin, M. Davis, S. Kedar, & D. M. Russell (Ed.), AAAI-94 Workshop Program on Indexing and Reuse in Multimedia Systems, (pp. 58 - 68). Seattle, Washington: AAAI Press. Leventhal, H., & Safer, M. A. (1977). Individual Differences, Personality, and Humour Appreciation: Introduction to Symposium. In A. J. Chapman & H. C. Foot (Eds.), It’s a funny Thing, Humour (pp. 335 - 349). Oxford: Pergamon. Lévi-Strauss, C. (1968). Structural Anthropology Volume 1, translated by Clair Jacobson, Brooke Grundfest Schoepf. London: Allen Lane. Lévi-Strauss, C. (1977). Structural Anthropology Volume 2, translated by Monique Layton. London: Allen Lane. Lippmann, W. (1934). Public Opinion. New York: The MacMillan Company. Mackay, W. E., & Davenport, G. (1989). Virtual Video Editing in Interactive Multimedia Applications. Communications of the ACM, 32(7), 802 - 810. Bibliography 225 Maes, P. (1991). Designing Autonomous Agents, Theory and Practise from Biology to Engineering and back. Cambridge, MA: MIT Press. Maes, P. (1994). Modeling Adaptive Autonomous Agents. Journal of Artificial Life, 1(1/2), url: http://pattie.www.media.mit.edu/people/pattie/cv.html#publications. Mandler, J. M., &. Johnson, N. S. (1977). Remembrance of Things Parsed: Story Structure and Recall. Cognitive Psychology, 9, 111 - 151. Mann, T. (1980). Doktor Faustus - Das Leben des deutschen Tonsetzers Adrian Leverkühn erzählt von einem Freunde. Frankfurt am Main: S. Fischer Verlag. Marr, D. (1982). Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. San Francisco: Freeman. Mast, G. (1979). The Comic Mind - Comedy and the Movies. Chicago: The University of Chicago Press. McAleese, R. (1985). Knowledge and Information Mapping and Interactive Video. Aberdeen: University Teaching Centre, University of Aberdeen. McCarthy, J., & Hayes, P. J. (1990). Some philosophical problems from the standpoint of artificial intelligence. In J. Allen, J. Hendler, & A. Tate (Eds.), Readings in Planning (pp. 393 - 435). San Mateo: Morgan Kaufmann Publishers. McDermott, D. (1978). Planning and Acting. Cognitive Science, 2, 71 - 109. McDermott, D. (1990). A Temporal Logic For Reasoning About Processes and Plans. In J. Allen, J. Hendler, & A. Tate (Eds.), Readings in Planning (pp. 436 - 463). San Mateo: Morgan Kaufmann Publishers. Metz, C. (1974). Film Language: A Semiotic Of The Cinema. New York: Oxford University Press. Metz, C. (1976a). Current Problems of Film Theory: Mitry's L'Esthétique et Psychologie du Cinema. Vol. II. In B. Nichols (Eds.), Movies and Methods (pp. 568 578). Berkeley: University of California Press. Metz, C. (1976b). On the Notion of Cinematographic Language. In B. Nichols (Eds.), Movies and Methods (pp. 582 - 589). Berkeley: University of California Press. Metz, C. (1982). Story/Discourse. In The Imaginary Signifier - Psychoanalysis and the cinema (pp. 91 - 98). Bloomington: Indiana University Press. Miller, G. A. (1993). Nouns in WordNet: A Lexical Inheritance System (ftp://clarity.princeton.edu/pub/wordnet/5papers.ps). Cognitive Science Laboratory, Princeton University. Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. (1993). Introduction to WordNet: An On-line Lexical Database (ftp://clarity.princeton.edu/pub/wordnet/5papers.ps). Cognitive Science Laboratory, Princeton University. Bibliography 226 Mills, M., Cohen, J., & Wong, Y. Y. (1992). A Magnifier Tool for Video Data. In CHI’92, (pp. 93 - 98). Monterey, CA: ACM Press. Mindess, H. (1971). Laughter and Liberation. Los Angeles: Nash. Minsky, M. L. (1988). The Society of mind. London: Picador. Monaco, J. (1981). How To Read A Film. New York: Oxford University Press. Monro, D. H. (1951). Argument of Laughter. Melbourne: Melbourne University Press. Nagasaka, A., & Tanaka, Y. (1992). Automatic video indexing and full-search for video appearance. In E. Knuth & I. M. Wegener (Eds.), Visual Database Systems (pp. 113 - 127). Amsterdam: Elsevier Science Publishers. Neale, S., & Krutnik, F. (1990). Popular Film and Television Comedy. London: Routledge. Newell, A., & Simon, H. A. (1990). GPS, A Program that Simulates Human Thought. In J. Allen, J. Hendler, & A. Tate (Eds.), Readings in Planning (pp. 59-66). San Mateo: Morgan Kaufmann Publishers. Norwig, P. (1992). Paradigms of Artificial Intelligence Programming: Case Studies in Common Lisp. San Mateo: Morgan Kaufmann Publishers. Oldham, G. (1992). First Cut: Conversations with Film Editors. Berkeley: University of California Press. Olson, E. (1968). The Theory of Comedy. Bloomington: Indiana University Press. Oomoto, E., & Tanaka, K. (1993). OVID: Design and Implementation of a VideoObject Database System. IEEE Transactions On Knowledge And Data Engineering, 5(4), 629-643. Ortony, A., Clore, G. L., & Collins, A. (1988). The Cognitive Structure of Emotions. New York: Cambridge University Press. Parkes, A. P. (1987). Towards a Script-Based Representation Language for Educational Films. Programmed Learning and Educational Technology, 24(3), 234 246. Parkes, A. P. (1989a) An Artificial Intelligence Approach to the Conceptual Description of Videodisc Images. Ph.D. Thesis, Lancaster University. Parkes, A. P. (1989b). The Prototype CLORIS system: Describing, Retrieving and Discussing Videodisc Stills and Sequences. Information Processing and Management, 25(2), 171 - 186. Parkes, A. P. (1989c). Settings and the Settings Structure: The Description and Automated Propagation of Networks for Perusing Videodisk Image States. In N. J. Belkin & C. J. van Rijsbergen (Ed.), SIGIR ’89, (pp. 229 - 238). Cambridge, MA: Parkes, A. P. (1992). Computer-controlled video for intelligent interactive use: a description methodology. In A. D. N. Edwards &. S.Holland (Eds.), Mulimedia Interface Design in Education (pp. 97 - 116). New York: Springer-Verlag. Bibliography 227 Parkes, A. P., & Self, J. (1988). Video-Based Intelligent Tutoring Of Procedural Skills. In ITS-88, (pp. 454 - 461). Montréal: June 1 -3, 1988. Peirce, C. S. (1960). The Collected Papers of Charles Sanders Peirce - 1 Principles of Philosophy and 2 Elements of Logic, Edited by Charles Hartshorne and Paul Weiss. Cambridge, Massachusetts: The Belknap Press of Harvard University Press. Pentland, A., Picard, R., Davenport, G., & Welsh, B. (1993). The BT/MIT Project on Advanced Tools for Telecommunications: An Overview (Perceptual Computing Technical Report No. 212). MIT. Pentland, A. P., Picard, R., Davenport, G., & Haase, K. (1994). Video and Image Semantics: Advanced Tools for Telecommunications (Technical Report No. 283). MIT. Petric, V. (1987). Constructivism in Film. Cambridge: Cambridge University Press. Piaget, J. (1970). Structuralism - edited by Chaninah Maschler. New York: Basic Books Inc. Picard, R. W., & Liu, F. (1994). A new Wold ordering for image similarity (Technical Report No. 237). MIT. Picard, R. W., & Minka, T. P. (1995). Vision texture for annotation. Multimedia Systems, 3(1), 3 - 14. Pinhanez, C. S., & Bobick, A. F. (1995). Intelligent Studios: Using Computer Vision to Control TV Cameras. In J. Bates., B. Hayes-Roth & H. Kitano (Ed.), IJCAI-95 Workshop on AI and Entertainment and AI/Alife, (pp. 69 - 76). Montréal, Canada: August 19. Price, G. (1973). A grammar of story: An introduction. The Hague: Mouton. Propp, V. W. (1968). Morphology of the Folktale. University of Texas Press. Pudovkin, V. I. (1968). Film Technique And Film Acting. London: Vision Press Limited. Quillian, M. R. (1966). Semantic Memory. In M. Minsky (Eds.), Semantic Information Processing (pp. 227 - 270). Cambridge, Mass.: MIT Press. Quillian, M. R. (1985). Word Concepts: A Theory and Simulation of Some Basic Semantic Capabilities. In R. J. Brachman & H. J. Levesque (Eds.), Readings in Knowledge Representation (pp. 97 - 118). Los Altos: Morgan Kaufmann. Rabinger, M. (1989). Directing - Film Techniques and Aesthetics. Boston: Focal Press. Raphael, B. (1971). The Frame Problem in Problem Solving Systems. In N. Findler & B. Meltzer (Eds.), Artificial Intelligence and Heuristic Programming (pp. 159 169). New York: American Elsevier. Raskin, V. (1985). Semantic Mechanisms of Humor. Dordrecht: D. Reidel Publishing Company. Bibliography 228 Reisz, K., & Millar, G. (1969). The Technique of Film Editing. New York: Focal/Hastings House. Ricoeur, P. (1985). Time and Narrative. Chicago: The University of Chicago Press. Riesbeck, C. K., & Schank, R. C. (1989). Inside case-based reasoning. Hillsdale, New Jersey: Lawrence Erlbaum Associates. Rosenblum, R., & Karen, R. (1979). When The Shooting Stops, The Cutting Begins. New York: Da Capo Press, Inc. Rothbart, M. K. (1976). Incongruity, Problem-Solving and Laughter. In T. Chapman & H. Foot (Eds.), Humor and Laughter: Theory, Research and Applications (pp. 37 54). New York: John Wiley & Sons. Rothbart, M. K., & Pien, D. (1977). Elephants and Marshmallows: A Theoretical Synthesis of Incongruity-Resolution and Arousal Theories of Humour. In A. J. Chapman & H. C. Foot (Eds.), It’s a funny Thing, Humour (pp. 37 - 40). Oxford: Pergamon. Rumelhart, D. E. (1975). Notes on a schema for stories. In D. G. Bobrow & A. Collins (Eds.), Representation and Understanding (pp. 211 - 236). New York: Academic Press. Rumelhart, D. E. (1977). Understanding and summarizing brief stories. In D. Laberge & S. J. Samuels (Eds.), Basic processes in reading: Perception and comprehension (pp. 265 - 303). Hillsdale, N.J.: Lawrence Erlbaum Associates. Russel, K., Starner, T., & Pentland, A. (1995). Unencumbered Virtual Environments. In J. Bates., B. Hayes-Roth & H. Kitano (Ed.), IJCAI-95 Workshop on AI and Entertainment and AI/Alife, (pp. 58 - 62). Montréal, Canada: August 19, 1995. Ryan, M.-L. (1991). Possible Worlds, Artificial Intelligence and Narrative Theory. Bloomington: Indiana University Press. Sacerdoti, E. D. (1977). A Structure for Plans and Behaviour. New York: Elsevier. Sacerdoti, E. D. (1990a). The Nonlinear Nature of Plans. In J. Allen, J. Hendler, & A. Tate (Eds.), Readings in Planning (pp. 162 - 170). San Mateo: Morgan Kaufmann Publishers. Sacerdoti, E. D. (1990b). Planning in a Hierarchy of Abstraction Space. In J. Allen, J. Hendler, & A. Tate (Eds.), Readings in Planning (pp. 98 - 108). San Mateo: Morgan Kaufmann Publishers. Sack, W. (1993). Coding News And Popular Culture. In The International Joint Conference on Artificial Intelligence (IJCA93) Workshop on Models of Teaching and Models of Learning. Chambery, Savoie, France. Sack, W., & Davis, M. (1994). IDIC: Assembling Video Sequences from Story Plans and Content Annotations. In IEEE International Conference on Multimedia Computing and Systems. Boston, Ma: May 14 - 19, 1994. Sack, W., & Don, A. (1993). Splicer: An Intelligent Video Editor (Unpublished Working Paper). Bibliography 229 Salomon, G., & Cohen, A. A. (1977). Television formats, mastery of mental skills, and the acquisition of knowledge. Journal of Educational Psychology, 69, 612 - 619. Sandewall, E. (1972). An Approach to the Frame Problem and Its Implementation. In B. Meltzer & D. Mitchie (Eds.), Machine Intelligence 7. Edinburgh: Edinburgh University Press. Saussure, F. d. (1966). Course in General Linguistics - edited by Charles Balley, Albert Sechehaye and Albert Riedlinger. New York: McGraw-Hill. Schank, R. C. (1972). Conceptual Dependency: A theory of natural language understanding. Cognitive Psychology, 3, 552 - 631. Schank, R. C. (1982). Dynamic memory. New York: Cambridge University Press. Schank, R. C. (1991). Case-based teaching: Four experiences in educational Software Design. (Technical Report No. 7). Institute for Learning Sciences, Northwestern University. Schank, R. C. (1994). Active Learning through Multimedia. IEEE MultiMedia, 1(1), 69 - 78. Schank, R. C., & Abelson, R. (1977). Scripts, Plans, Goals And Understanding. Hillsdale, New Jersey: Lawrence Earlbaum Associates. Schank, R. C., Kass, A., & Riesbeck, C. (1994). Inside Case-Based Explanation. Hillsdale, N.J.: Lawrence Erlbaum Associates. Schank, R. C., & Riesbeck, C. K. (1981). Inside Computer Understanding. Hillsdale, New Jersey: Lawrence Erlbaum Associates. Schopenhauer, A. (1966). The World As Will And Representation. New York: Dover Publications, Inc. Schumm, G. (1993). Feinschnitt - die verborgene Arbeit an der Blickregie. In H. Beller (Eds.), Handbuch der Filmmontage - Praxis und Prinzipien des Filmschnitts (pp. 224 - 225). München: TR-Verlagsunion. Segre, C. (1979). Structure and Time - Narration, Poetry, Models. Chicago: The University of Chicago Press. Shoham, Y. (1987). Temporal Logics in AI: Semantical and Ontological Considerations. Artificial Intelligence, 33(1), 89 - 104. Shultz, T. R. (1976). A Cognitive-Developmental Analysis of Humor. In T. Chapman & H. Foot (Eds.), Humor and Laughter: Theory, Research and Applications (pp. 11 36). New York: John Wiley & Sons. Siddons, H. (1968). Practical Illustrations of Rhetorical Gesture and Action. New York: Benjamin Blom, Inc. Smith, B. K., Agganis, A., & Reiser, B. J. (1995). Children and Artificial Life Revisited. In J. Bates., B. Hayes-Roth & H. Kitano (Ed.), IJCAI-95 Workshop on AI and Entertainment and AI/Alife, (pp. 6 - 13). Montréal, Canada: August 19, 1995. Bibliography 230 Smith, T. C., & Witten, I. H. (1991). A Planning Mechanism for Generating Story Text. Literary and Linguistic Computing, 6(2), 119 - 126. Sowa, J. F. (1984). Conceptual Structures: Information Processing in Mind and Machine. Reading, MA: Addison-Wesley Publishing Company. Spottiswode, R. J. (1955). A Grammar Of The Film - an analysis of film technique. London: Faber & Faber. Stein, N. L., & Glenn, C. G. (1979). An analysis of story comprehension in elementary school children. In R. O. Freedle (Eds.), New directions in discourse processing Norwood, N.J.: Ablex Pub. Corp.. Sternberg, M. (1978). Expositional Modes and Temporal Ordering in Fiction. Baltimore: The Johns Hopkins University Press. Storyline Pro (1993). Truby’s Writer Studio. Developed by John Truby. http://hollywoodnetwork.com/hn/shopping/kiosk/wcs40.htm Strassmann, S. (1994,). Semi-Autonomous Animated Actors. In National Conference on Artificial Intelligence (July 31 - August 4),(pp. 128 - 134). Seattle, Washington: AAAI Press. Striedter, J. (1971). Russischer Formalismus: Texete zur allgemeinen Literaturtheorie und zur Theorie der Prosa. München: Fink. Suleiman, S. R. (1983). Authoritarian Fictions: The Ideological Novel As a Literary Genre. New York: Columbia University Press. Suls, J. (1977). Cognitive and Disparagement Theories of Humour: A Theoretical and Empirical Synthesis. In A. J. Chapman & H. C. Foot (Eds.), It’s a funny Thing, Humour (pp. 41 - 45). Oxford: Pergamon. Suls, J. M. (1972). A two-stage model for the appreciation of jokes and cartoons: An information-processing analysis. In J. H. Goldstein & P. E. McGhee (Eds.), The Psychology of Humour (pp. 81 - 100). New York, London: Academic Press. Tate, A., Hendler, J., & Drummond, M. (1990). A Review of AI Planning Techniques. In J. Allen, J. Hendler, & A. Tate (Eds.), Readings in Planning (pp. 26 49). San Mateo: Morgan Kaufmann Publishers. Thorndyke, P. W. (1977). Cognitive structures in comprehension and memory of narrative discourse. Cognitive Psychology, 9, 77 - 100. Tonomura, Y., Akutsu, A., Taniguchi, Y., & Suzuki, G. (1994). Structured Video Computing. IEEE MultiMedia, 1(3), 34 - 43. Tosa, N., & et al. (1995). Network Neuro-Baby with robotic hand. J. Bates., B. Hayes-Roth & H. Kitano (Ed.), IJCAI-95 Workshop on AI and Entertainment and AI/Alife, (pp. 48 - 53). Montréal, Canada: August 19, 1995. Tudor, A. (1974). Image And Influence. London: George Allen & Unwin Ltd. Ueda, H., Miyatake, T., Sumino, S., & Nagasaka, A. (1993). Automatic Structure Visualization for Video Editing. In ACM & IFIP INTERCHI ’93, (pp. 137 - 141). Bibliography 231 Ueda, H., Miyatake, T., & Yoshizawa, S. (1991). IMPACT: An Interactive NaturalMotion-Picture Dedicated Multimedia Authoring System. In Proc ACM CHI ’91 Conference on Human Factors In Computing Systems, (pp. 343-450). van Dijk, T. (1972). Some aspects of text grammars: a study in theoretical linguistics and poetics. The Hague: Mouton. Wilensky, R. (1978) Understanding goal-based stories. Ph.D., Yale University. Wilensky, R. (1983a). Planing and Understanding - A Computational Approach to Human Reasoning. Reading, Massachusetts: Addison-Wesley Publishing Company. Wilensky, R. (1983b). Points: A Theory of the Structure of Stories in Memory. In W. G. Lehnert & M. H. Ringle (Eds.), Strategies for Natural Language Processing (pp. 345 - 376). Hillsdale, New Jersey: Lawrence Erlbaum Associates. Wilensky, R. (1983c). Story grammars versus story points. The Behavioral and Brain Sciences, 6(4), 579 - 623. Wilensky, R. (1990). A Model for Planning in Complex Situations. In J. Allen, J. Hendler, & A. Tate (Eds.), Readings in Planning (pp. 263 - 274). San Mateo: Morgan Kaufmann Publishers. Wilkins, D. E. (1990). Domain-independent Planning: Representation and Plan Generation. In J. Allen, J. Hendler, & A. Tate (Eds.), Readings in Planning (pp. 319 335). San Mateo: Morgan Kaufmann Publishers. Winograd, T. (1985). Frame Representations and the Declarative/Procedural Controversy. In R. J. Brachman & H. J. Levesque (Eds.), Readings in Knowledge Representation (pp. 357 - 370). San Mateo, California: Morgan Kaufmann Publishers. Wolff, C. (1972). A Psychology of Gesture. New York: Arno Press. Woods, W. A. (1985). What’s in a link: Foundations for Semantic Networks. In R. J. Brachman & H. J. Levesque (Eds.), Readings in Knowledge Representation (pp. 218 - 241). San Mateo, California: Morgan Kaufmann Publishers. Wulff, H. J. (1993). Der Plan macht’s. In H. Beller (Eds.), Handbuch der Filmmontage - Praxis und Prinzipien des Filmschnitts (pp. 178 - 189). München: TRVerlagsunion. Yeung, M. M., Yeo, B., Wolf, W. & Liu, B. (1995). Video Browsing using Clustering and Scene Transitions on Compressed Sequences. In Proceedings IS&T/SPIE ’95 Multimedia Computing and Networking, San Jose. SPIE (2417), 399 413. Zhang, H., Gong, Y., & Smoliar, S. W. (1994). Automated parsing of news video. In IEEE International Conference on Multimedia Computing and Systems, (pp. 45 54). Boston: IEEE Computer Society Press. Zhang, H., Kankanhalli, A., & Smoliar, S. W. (1993). Automatic Partitioning of FullMotion Video. Multimedia Systems, 1, 10 - 28. Zillmann, D., & Cantor, J. R. (1976). A Disposition Theory of Humour and Mirth. In T. Chapman & H. Foot (Eds.), Humor and Laughter: Theory, Research and Applications (pp. 97 - 115). New York: John Wiley & Sons. 232 Filmography Abbott and Costello meet the Mummy Charles Lamont USA - 1955 Airplane! Jim Abrahams, David Zucker & Jerry Zucker Sergei M. Eisenstein Louis Lumière Robert Zemeckis Woody Allen Sergei M. Eisenstein Walter Ruttmann USA - 1980 USSR - 1938 France - 1895 USA - 1990 USA - 1971 USSR - 1925 Germany - 1927 Penny Marshall Stephen Herek USA - 1988 USA - 1989 Ridley Scott James Parrott Howard Hawks Orson Wells Richard Linklahr Spike Lee Stanley Kubrick USA - 1982 USA - 1930 USA - 1938 USA - 1941 USA - 1993 USA - 1989 GB - 1963 John Hughes Charles Chaplin Harold Ramis Chris Columbus USA - 1986 USA - 1925 USA - 1993 USA - 1990 Alexander Nevsky Arroseur arrosé, Le Back to the Future III Bananas Battleship Potemkin Berlin, die Symphonie der Großstadt Big Bill and Ted's Excellent Adventure Blade Runner Brats Bringing up Baby Citizen Kane Dazed and Confused Do the right thing Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb Ferris Bueller's Day Off Gold Rush, The Groundhog Day Home Alone Filmography 233 Idle Class, The Immigrant, The Kabinett des Dr. Caligari, Das Last Emperor, The Charles Chaplin Charles Chaplin Robert Wiene Bernado Bertolucci USA - 1921 USA - 1917 Germany - 1919 China, Italy, UK - 1987 Germany - 1924 USA - 1936 GB - 1983 Letzte Mann, Der Modern Times Monty Python’s the Meaning of Life Mr. Deeds Goes to Town Naked Naked Gun, The Naked Gun 2 1/2, The Night at the Opera, A October Pulp Fiction Rear Window Rosemary’s Baby Shame Shoulder Arms Snow White and the Seven Dwarfs Sunrise Tin Toy Toy Story Trainspotting Take the money and run Wayne’s World Friedrich Wilhelm Murnau Charles Chaplin Terry Gilliam & Terry Jones Frank Capra Mike Leigh David Zucker David Zucker Sam Wood Sergei M. Eisenstein Quentin Tarantino Alfred Hitchcock Roman Polansky Ingemar Bergman Charles Chaplin Walt Disney & David Hand USA - 1936 UK - 1993 USA - 1988 USA - 1991 USA - 1935 USSR - 1927 USA - 1994 USA - 1954 USA - 1968 Sweden - 1968 USA - 1918 USA - 1937 Friedrich Wilhelm Murnau John Lasseter John Lasseter Danny Boyle Woody Allen Penelope Spheeris USA - 1927 USA - 1988 USA - 1995 GB - 1995 USA - 1969 USA - 1992 234 Appendix The following is a generation trace produced by AUTEUR (see chapter 9). (dolphin)~/auteur4>sicstus˝SICStus 2.1 #7: Tue Mar 16 09:53:11 GMT 1993˝ ?-[start]. {consulting /tmp_mnt/home/fn/auteur4/start.pl...} {/tmp_mnt/home/fn/auteur4/start.pl consulted, 170 msec 7184 bytes} yes ?- system. yes ?- go(10,humour). Start instantiation : ============== Instantiate databases: {consulting/tmp_mnt/home/fn/auteur4/db_shot.pl...} {/tmp_mnt/home/fn/auteur4/db_shot.pl consulted, 870 msec 40800 bytes} {consulting/tmp_mnt/home/fn/auteur4/dicsem.pl...} {/tmp_mnt/home/fn/auteur4/dicsem.pl consulted, 140 msec 10048 bytes} {consulting/tmp_mnt/home/fn/auteur4/diccon.pl...} {/tmp_mnt/home/fn/auteur4/diccon.pl consulted, 160 msec 11152 bytes} {consulting/tmp_mnt/home/fn/auteur4/dicvisual.pl...} {/tmp_mnt/home/fn/auteur4/dicvisual.pl consulted, 90 msec 7488 bytes} Load modules: {consulting/tmp_mnt/home/fn/auteur4/startshot_analysis.pl...} {/tmp_mnt/home/fn/auteur4/startshot_analysis.pl consulted, 650 msec 21008bytes} {consulting/tmp_mnt/home/fn/auteur4/story_planner.pl...} Appendix 235 {/tmp_mnt/home/fn/auteur4/story_planner.pl consulted, 460 msec 14400 bytes} {consulting/tmp_mnt/home/fn/auteur4/scene_planner.pl...} {/tmp_mnt/home/fn/auteur4/scene_planner.pl consulted, 520 msec 22496 bytes} {consulting/tmp_mnt/home/fn/auteur4/scene_analyser.pl...} {/tmp_mnt/home/fn/auteur4/scene_analyser.pl consulted, 380 msec 11232 bytes} {consulting/tmp_mnt/home/fn/auteur4/scene_creator.pl...} {/tmp_mnt/home/fn/auteur4/scene_creator.pl consulted, 340 msec 6096 bytes} {consulting/tmp_mnt/home/fn/auteur4/motivation.pl...} {/tmp_mnt/home/fn/auteur4/motivation.pl consulted, 2020 msec 51200 bytes} {consulting/tmp_mnt/home/fn/auteur4/realisation.pl...} {/tmp_mnt/home/fn/auteur4/realisation.pl consulted, 1040 msec 27744 bytes} {consulting/tmp_mnt/home/fn/auteur4/resolution.pl...} {/tmp_mnt/home/fn/auteur4/resolution.pl consulted, 700 msec 14448 bytes} {consulting/tmp_mnt/home/fn/auteur4/repetition.pl...} {/tmp_mnt/home/fn/auteur4/repetition.pl consulted, 350 msec 19696 bytes} Start instantiation : ============== Instantiate constants / variables : Startshot : 10˝ Theme : humour perform : [] Used jokes: [] filenumber: 1 Start material organiser : ================= [walk+1+32˝ Analysisset : [1,0,1,0,[frank,[[],[],[walk+1+32]]],[pleasure/1/32+0.5],[[path/1/32],[],[]]],[],[],[]]] Successfully finished: Material organiser. Start scene creation : =============== MOTIVATION Analysisset : [1,0,1,0,[frank,[[],[],[walk+1+32]]],[pleasure/1/32+0.5],[[path/1/32],[],[]]],[],[],[]]] Interpretation : [] Planlist : [event+s_action+[misfortune,ambiguty,stupidit] Shotlist : [] Plan : event+s_action+misfortune Event : meeting Overall idea is : date Possible mood shots : pleasure ->[13/29/37,14/47/58,16/118/140,18/1/32,21/1/32, 22/15/27,22/1/32,23/1/27,24/1/27,25/1/27, 29/1/18, 6/63/72] Appendix 236 Possible action shots : search not necessary since it is a s_action. Shotlist-Action / action / editing kind : [10/1/32] walk [no˝ The mood shot to be added : 22/1/27 Shotlist-Mood / mood / editing kind : [10/1/18,22/1/27,10/29/32] pleasure insert The shot to be added for the event : 25/1/27 Shotlist-Event / even/ editing kind : [25/1/27,10/1/18,22/15/27,10/29/32] meeting+1 join1 REALISATION Analysis set : [1,0,1,0,[frank,[[],[],[walk+1+32]],[pleasure/1/32+0.5],[[path/1/32],[],[]]],[],[],[]] Interpretation : [[walk,[no],[10/1/32]],[pleasure,[insert],[10/1/18,22/15/27,10/29/32]],[meeting+1,2,[j oin1],[25/1/27,10/1/18,22/15/27,10/29/32],25] Planlist : event+s_action_actors+misfortune+unexpectedness Shotlist : [25/1/27,10/1/18,22/15/27,10/29/32] try to create event person-person oriented joke: : Could not use any of the known strategies for realisation plan. Try something else. REALISATION Analysis set : [1,0,1,0,[frank,[[],[],[walk+1+32]],[pleasure/1/32+0.5],[[path/1/32],[],[]]],[],[],[]] Interpretation : [[walk,[no],[10/1/32,[pleasure,[insert],[10/1/18,22/15/27,10/29/32]],[meeting+1,2,[joi n1],[25/1/27,10/1/18,22/15/27,10/29/32]],25] Planlist : event+s_action+misfortune+unexpectedness Shotlist : [25/1/27,10/1/18,22/15/27,10/29/32] try to create an event single action person joke : Could not use any of the known strategies for realisation plan. Try something else. REALISATION Analysis set : [1,0,1,0,[frank,[[],[],[walk+1+32]],[pleasure/1/32+0.5],[[path/1/32],[],[]]],[],[],[]] Interpretation : [[walk,[no],[10/1/32,[pleasure,[insert],[10/1/18,22/15/27,10/29/32]],[meeting+1,2,[joi n1],[25/1/27,10/1/18,22/15/27,10/29/32]],25] Planlist : event+s_action+misfortune+unexpectedness Shotlist : [25/1/27,10/1/18,22/15/27,10/29/32] try to create a single action person joke : Actionshotlist : [[],[4/49/56,5/16/34],[]] Possible shots for highlighting the object : [1/7/43,2/51/74] Appendix 237 The shot to be added : 1/7/43 The realisation sequence is : [1/13/43,4/49/56] The main part is Shot / Shotkind / Joinkind : [4/49/56 medium join1 It reads / action / bodypart / object / location: slip+1 shoe banana path Realisationlist : [10/1/18,22/15/27,10/29/32,1/13/43,4/49/56] RESOLUTION Analysis set : [1,0,1,0,[frank,[[],[],[walk+1+32]],[pleasure/1/32+0.5],[[path/1/32],[],[]]],[],[],[]] Interpretation : [[walk,[no,[10/1/32]],[pleasure,[insert],[10/1/18,22/15/27,10/29/32]],[no,1,[join1],[10 /1/18,22/15/27,10/29/32]],[slip+1,shoe,banana,path,join1,[1/13/43,4/49/56]],75] Planlist : s_action+misfortune+unexpectedness Shotlist : [10/1/18,22/1/27,10/29/32,1/13/43,4/49/56] Possible shots for the resolution : [6/63/72] The shot to be added : 6/63/72 The final sequence is : [10/1/18,22/15/27,10/29/32,1/13/43,4/49/56,6/63/72] AUTEUR values the created joke as good. REPETITION =========== [walk+1+32,read+1+32,scratch+1+25,poke+1+25] The new startshot is :24 [[walk,read]+1+32,[scratch,poke]+1+25],[walk+1+32,read+1+32,scratch+1+25,poke +1+25][[read,walk]+1+32]] MOTIVATION Analysis set : [1,0,2,0,[frank,[[],[[read,walk]+1+32],[walk+1+32,read+1+32,scratch+1+25,poke+1+ 25]],[pleasure/1/32+0.5,hurry/1/32+0.5],[[newspaper/1/32,path/1/32],[],[]]],[],[],[]] Interpretation : [] Planlist : [parallel+[misfortune,stupidity]] Shotlist : [] Plan : parallel+misfortune Chosen action : read Possible mood shots : pleasure -> search not necessary since mood is defined Possible action shots : search not necessary since it is a s_action. REALISATION Analysis set : [1,0,2,0,[frank,[[],[[read,walk]+1+32],[walk+1+32,read+1+32,scratch+1+25,poke+1] ,25]],[pleasure/1/32+0.5,hurry/1/32+0.5],[[newspaper/1/32,path/1/32],[],[]]],[,[],[]] Interpretation : [[walk,[no,[24/1/27]],[pleasure,[no,[24/1/27]],[no,1,[[no]],[24/1/27]],25] Appendix Planlist Shotlist 238 : s_action+misfortune+unexpectedness : [24/1/27] try to create a single action person joke : Objectshotlist : [[],[],[3/1/11]] Possible shots for the object : [3/1/11] The realisation sequence is : 3/1/11 The main part is Shot / Shotkind / Joinkind : 3/1/11 medium join1 It reads / action / bodypart / object / location: collide+2 body lamppost path Realisationlist : [24/1/27,3/1/11] RESOLUTION Analysis set : [1,0,2,0,[frank,[[],[[read,walk]+1+32],[walk+1+32,read+1+32,scratch+1+25,poke+1] ,25]],[pleasure/1/32+0.5,hurry/1/32+0.5],[[newspaper/1/32,path/1/32],[],[]]],[,[],[]] Interpretation : [[walk,[no,[24/1/27]],[pleasure,[no,[24/1/27]],[no,1,[[no]],[24/1/27]],[collide+2,lamp post,body,no,path,join1,[3/1/11]],45] Planlist : s_action+misfortune+unexpectedness Shotlist : [24/1/27,3/1/11] Possible shots for the resolution : [29/1/18,30/1/18] Possible shots for the resolution : [29/1/18,30/1/18] Possible shots for the resolution : [29/1/18,30/1/18] Possible shots for the resolution : [29/1/18,30/1/18] Possible shots for the resolution : [29/1/18,30/1/18] Possible shots for the resolution : [29/1/18,30/1/18] Possible shots for the resolution : [29/1/18,30/1/18] Possible shots for the resolution : [7/76/86] The shot to be added : 7/76/86 The final sequence is : [24/1/27,3/1/11,7/76/86] AUTEUR values the created joke as poor. REPETITION =========== Try to create something using the higher concept : movement. The action used is : fly. Try to create something using the higher concept : movement. The action used is : drive. Try to create something using the higher concept : movement. The action used is : swim. Try to create something using the higher concept : movement. The action used is : using_transport.˝[stand+1+27] The new startshot is now : 25. Appendix 239 The plan is : scen+using_transport+s_action+[misfortune,ambiguty,stupidity] MOTIVATION Analysis set : [1,0,1,0,[frank,[[],[],[stand+1+27]],[pleasure/1/27+1.0],[[busstop/1/27],[],[]]],[],[],[]] Interpretation : [] Planlist : [scen+using_transport+s_action+[misfortune,ambiguty,stupidity] Shotlist : [] Plan : scen+using_transport+s_action+misfortune Possible mood shots : pleasure -> search is not necessary since mood is defined. The first scenario shot to be added : [25/1/27˝ The second scenario shot to be added : 26/1/27 Shotlist-Scenario / actor1 / action / actor2 / action / Shotlist : frank stand bus come [25/1/27,26/1/27] REALISATION Analysis set : [1,0,1,0,[frank,[[],[],[stand+1+27]],[pleasure/1/27+1.0],[[busstop/1/27],[],[]]],[],[],[]] Interpretation : [using_transport,[frank+[stand],[25/1/27]],[bus+[come],[26/1/27]],30] Planlist : scen+using_transport+s_action+misfortune+schar Shotlist : [25/1/27,26/1/27] The shotlist for the realisation : 28/1/27 Scenario reads : actor1 / action : bus leave RESOLUTION Analysis set : [1,0,1,0,[frank,[[],[],[stand+1+27]],[pleasure/1/27+1.0],[[busstop/1/27],[],[]]],[],[],[]] Interpretation : [using_transport,[frank+[stand,[25/1/27],[bus+[come],[26/1/27]],[bus+[leave],[28/1/2 7],70] Planlist : scen+using_transport+s_action+misfortune+schar Shotlist : [25/1/27,26/1/27,28/1/27] The shotlists (action/mood) for the resolution : [29/1/18][30/1/18] Scenario reads : actor1 / action / mood : frank look anger AUTEUR values the created joke as good. REPETITION =========== No further repetition possible. Appendix 240 The executable file is called auteurfile1. The content is : ˝#!/bin/sh˝/usr/local/video/bin/edit_movie /usr/local/video/data/fnbus1.Mpeg1 Mpeg1 1 27 /usr/local/video/data/fn-bus2 1 27 /usr/local/video/data/fnbus3.Mpeg1 Mpeg1 1 27 /usr/local/video/data/fn-bus4.Mpeg1 Mpeg1 1 18 /usr/local/video/data/fn-bus5.Mpeg1 Mpeg1 1 18 That's all folks. The executable file is called auteurfile2. The content is : ˝#!/bin/sh˝/usr/local/video/bin/edit_movie /usr/local/video/data/fnwalk2.Mpeg1 Mpeg1 1 27 /usr/local/video/data/fn-lamppost.Mpeg1 Mpeg1 1 11 /usr/local/video/data/fn-lamppost.Mpeg1 Mpeg1 76 86 /usr/local/video/data/fnbus1.Mpeg1 Mpeg1 1 27 /usr/local/video/data/fn-bus2 1 27 /usr/local/video/data/fnbus3.Mpeg1 Mpeg1 1 27 /usr/local/video/data/fn-bus4.Mpeg1 Mpeg1 1 18 /usr/local/video/data/fn-bus5.Mpeg1 Mpeg1 1 18 That's all folks. The executable file is called auteurfile3. The content is : ˝#!/bin/sh˝/usr/local/video/bin/edit_movie /usr/local/video/data/fnwalk1.Mpeg1 Mpeg1 1 18 /usr/local/video/data/walk1.Mpeg1 Mpeg1 15 27 /usr/local/video/data/fn-walk1.Mpeg1 Mpeg1 29 32 /usr/local/video/data/fnbanana1.Mpeg1 Mpeg1 13 43 /usr/local/video/data/fn-lamppost.Mpeg1 Mpeg1 49 56 /usr/local/video/data/fn-walk2.Mpeg1 Mpeg1 1 27 /usr/local/video/data/fnlamppost.Mpeg1 Mpeg1 1 11 /usr/local/video/data/fn-lamppost.Mpeg1 Mpeg1 76 86 /usr/local/video/data/fn-bus1.Mpeg1 Mpeg1 1 27 /usr/local/video/data/fn-bus2 1 27 /usr/local/video/data/fn-bus3.Mpeg1 Mpeg1 1 27 /usr/local/video/data/fnbus4.Mpeg1 Mpeg1 1 18 /usr/local/video/data/fn-bus5.Mpeg1 Mpeg1 1 18 That's all folks. Successfully finished: scene creation.
© Copyright 2025 Paperzz