AUTEUR: The Application of Video Semantics and Theme

AUTEUR: The Application of Video Semantics and
Theme Representation for Automated Film Editing
Submitted for the Degree of Ph.D.
August 1996
by
Frank-Michael Nack (M.Sc.)
AUTEUR: The Application of Video Semantics and Theme Representation for
Automated Film Editing
Submitted for the Degree of Ph.D.
August 1996
by
Frank-Michael Nack (M.Sc.)
Abstract
This thesis represents a planner based approach to the application of video semantics
and theme representation to the automated editing of visual stories at the level of
events. The research draws on film theory (in order to define methods to perform
automated editing, to model the fundamental units of the image and the conceptual
relationships between image, shot and sequence); on narrative and humour theory (to
attempt to automatically generate emotion provoking and credible film narrative); and
on Artificial Intelligence (planning, story generation, and knowledge representation).
The aim of the research is to define techniques to automatically assemble video
sequences that realise a given theme, ultimately to be used to assist in the presentation
or interpretation of video. This thesis introduces representational structures for the
semantic, temporal and relational features of video; representations and strategies to
support automated editing, in combination with a simplified model of the film editing
process; and representations for narrative structures such as actions, events, and
emotional codes. These representations and strategies form the basis of an intelligent
prototype system,
AUTEUR, which generates humorous, non-verbal video
sequences. AUTEUR is also described in this thesis.
Für meine Rasselbande
Annika, Merlin und Nuredin
I
Acknowledgements
First, and above all, I would like to thank my supervisor Alan Parkes, who allowed
me to follow my research ideas, while supporting me with insightful discussions and
suggestions. Alan, you taught much more than how to become a researcher, and I am
thankful that your financial support made it possible for me to attend conferences.
Also, I am profoundly grateful for your humour, which cheered me up when teutonic
gloom darkened into Götterdämmerung. It was a pleasure working with you.
Many thanks to my examiners - Philippe Aigrain and Paul Brna, who both agreed to
take this extra work on despite their loaded schedules. An extra sprinkling of
chocolate on top of it for Paul, for sharing his deep insight in Prolog with me and
many, many hours of enlightened discussions, patient advice and encouragement.
I wish to thank Lancaster University and SECAMS (School of Engineering,
Computing and Mathematics Sciences) for supporting my research through their
sponsorship.
I am very grateful to the WDR for supporting my work by offering access to their
practical editing sessions. I wish also to express thanks to everyone from EB- und
Filmbearbeitung for the friendly atmosphere I enjoyed there. Some people deserve
special mention: Mrs. Edith Perlaki and Mrs. Britta Sörensen who took more time out
of their busy schedules to discuss film and comment on their work than I dared
expect, and Mrs. Gailenheuser and Mrs. Zeusch-Fren for giving me an insight into the
editing of news, trailers and smaller features. Finally, many thanks to Mrs. Unverdroß
who used her administrative talents to allow me to work free of worries.
I owe a lot to Séverine Menu and Philippe Robin who helped me to translate Gilles
Bloch's thesis into English.
Many thanks to my office mate, Sean, a faithful friend and intellectual companion.
Acknowledgements
II
There are no words that are enough to express my thanks to my friends for their love
and devotion along this journey: Yasemin, Sakis, Dankmute, Angeliki, Raquel,
Séverine, Mark, Tamsin, Paola, Stuart, Marco, Sarah, John, Michael, Nora, Lars,
Judith and Jürgen. You have given me the best years of my life.
Most of all, I thank my parents, who have never ceased to encourage and support me
since the beginning. I know their love and dedication is the key to what I have
become - and I am most grateful to have them as a role model for optimism and
integrity.
III
Contents
CHAPTER 1 Introduction............................................................................................ ..1
1.1 Automated video editing: a scenario ................................................................ ..2
1.2 Methodologies ................................................................................................. ..3
1.2.1 Artificial Intelligence (AI)................................................................. ..4
1.2.2 Film theory ....................................................................................... ..4
1.2.3 Narrative theory ................................................................................ ..5
1.2.4 Humour theory.................................................................................. ..5
1.3 The thesis: an overview ................................................................................... ..5
CHAPTER 2 Narrativity............................................................................................ ..8
2.1 Narrative principles ......................................................................................... ..9
2.1.1 Theme............................................................................................... 11
2.1.2 Order................................................................................................. 13
2.1.2.1 The event............................................................................ 14
2.1.2.2 The relation between events................................................ 15
2.1.3 Time ................................................................................................. 17
2.1.4 Space ................................................................................................ 19
2.2 From structure versus content to structure and content..................................... 19
CHAPTER 3 Humour.................................................................................................. 22
3.1 Humour - assumptions and definitions ............................................................. 23
3.1.1 Cognitive-Perceptual Class................................................................ 23
3.1.2 Social-Behavioural Class................................................................... 27
3.1.3 Psychoanalytical Class ...................................................................... 28
3.2 Humour primitives........................................................................................... 30
3.2.1 Readiness .......................................................................................... 30
3.2.2 Timing .............................................................................................. 30
3.2.3 Exaggeration ..................................................................................... 31
Contents
IV
3.2.4 Incongruity........................................................................................ .31
3.2.5 Derision ............................................................................................ .32
3.3 Humour strategies............................................................................................ .33
3.4 Evaluation of comedy ..................................................................................... .42
3.5 Conclusion....................................................................................................... .44
CHAPTER 4 Film ........................................................................................................ ..45
4.1 Cinematic meaning .......................................................................................... ..45
4.2 Phenomenological Approaches to film............................................................. ..47
4.2.1 Film Image........................................................................................ ..49
4.2.1.1 The sign.............................................................................. ..50
4.2.1.2 Sign and idea ...................................................................... ..52
4.2.2 Film Movement................................................................................. ..60
4.2.2.1 From frame to shot ............................................................. ..60
4.2.2.2 Montage, or the semantics of fragmentation........................ ..63
4.2.2.3 The sequence - film relationship ......................................... ..67
4.3 A model of film editing ................................................................................... ..69
4.4 Conclusion....................................................................................................... ..72
CHAPTER 5 The representation of video content ..................................................... ..75
5.1 Related work ................................................................................................... ..76
5.1.1 Bloch and his machine for audio-visual editing ................................. ..76
5.1.2 Parkes and CLORIS .......................................................................... ..79
5.1.3 Aguierre-Smith and the Stratification System.................................... ..82
5.1.4 Semantic and conceptual indexing for video...................................... ..83
5.1.5 Davis and Media Streams .................................................................. ..87
5.2 An ontology for the representation of film content.......................................... ..92
5.2.1 General concepts and assumptions..................................................... ..93
5.2.2 Cinematographic devices................................................................... ..97
5.2.3 Denotative aspects............................................................................. ..98
5.2.3.1 Character, object and action................................................ ..98
5.2.3.2 Settings: space, time and lighting....................................... 102
5.2.4 Conclusion ........................................................................................ 105
5.3 Technical environments for content annotation................................................ 106
CHAPTER 6 The representation of knowledge for automated editing..................... 108
6.1 Shot editing: Mixage and Cut .......................................................................... 109
6.2 Spatial and temporal continuity in editing: the 180ÛV\VWHP ............................. 110
6.3 Related work ................................................................................................... 115
Contents
V
6.3.1 Splicer............................................................................................... 115
6.3.2 Bloch’s machine for audio-visual editing ........................................... 116
6.4 A novel approach to automated video editing .................................................. 119
6.4.1 Plot requirements ............................................................................ 120
6.4.2 Shot intention and the shape of the awareness space........................ 122
6.4.3 Automated establishment and maintenance of content
space over several shots ................................................................... 126
6.4.4 The influence of action on continuity editing ................................... 130
6.4.5 The comparison of surrounding content space and
graphical pattern .............................................................................. 135
6.4.6 Temporal and rhythmical relations between shot A and
shot B............................................................................................... 138
6.4.6.1 Preliminary remarks............................................................ 140
6.4.6.2 Temporal clipping for action expansion .............................. 141
6.4.6.3 Temporal clipping for the temporal
equivalence of actions........................................................ 141
6.4.6.4 Temporal clipping for action contraction ............................ 142
6.4.6.5 Rhythmical shaping of a sequence ...................................... 143
6.4.7 Conclusion...................................................................................... 146
CHAPTER 7 The representation of narrative and thematic knowledge.................... 148
7.1 Approaches to knowledge representation ......................................................... 150
7.1.1 Quillian and semantic networks......................................................... 150
7.1.2 Miller, Bateman, Lenat and large databases of semantic
relations ............................................................................................ 150
7.1.3 Haase’s approach to memory-based representations ........................... 153
7.1.4 Schank’s conceptual dependencies and dynamic memory .................. 154
7.2 Knowledge representation to support the creation of emotion
provoking narrative sequences........................................................................ 155
7.2.1 Actions.............................................................................................. 156
7.2.1.1 Conceptual structure ........................................................... 156
7.2.1.2 Semantic relations............................................................... 159
7.2.2 Abstract concepts .............................................................................. 161
7.2.2.1 Emotions ............................................................................ 162
7.2.2.2 Visualisations ..................................................................... 163
7.2.3 Events ............................................................................................... 163
7.2.4 Conclusion ........................................................................................ 168
Contents
VI
CHAPTER 8 AUTEUR: An architecture for automated video story
generation............................................................................................. 170
8.1 Related work ................................................................................................... 170
8.1.1 Sack & Davis’ video generator IDIC.................................................. 170
8.1.2 Bloch’s machine for audio-visual editing ........................................... 173
8.2 A proposed architecture for the editing of theme oriented video
stories .............................................................................................................. 175
8.2.1. Overview ......................................................................................... 176
8.2.2. The video database........................................................................... 178
8.2.3. The video representation .................................................................. 178
8.2.4. The Knowledge Base........................................................................ 178
8.2.5. The Editor - a controller module ...................................................... 179
8.2.5.1 The Structure Planner ......................................................... 179
8.2.5.2 The Content Planner ........................................................... 181
8.2.5.3 The Visual Designer ........................................................... 184
8.2.5.4 The Visual Constructor ....................................................... 185
8.2.6 The Retrieval System and Interface ................................................... 185
8.3 Conclusion....................................................................................................... 186
CHAPTER 9 The operation of AUTEUR: Show me a joke....................................... 187
9.1 The banana skin joke ....................................................................................... 188
9.1.1 Preparation phase .............................................................................. 188
9.1.2 Motivation phase............................................................................... 189
9.1.3 Realisation phase............................................................................... 194
9.1.4 Resolution phase ............................................................................... 197
9.2 The lamp post joke .......................................................................................... 200
9.2.1 Preparation phase .............................................................................. 200
9.2.2 Realisation phase.............................................................................. 203
9.2.3 Resolution phase ............................................................................... 205
9.3 The bus joke .................................................................................................... 205
9.3.1 Preparation phase .............................................................................. 206
9.3.1 Realisation phase............................................................................... 208
9.3.3 Resolution phase ............................................................................... 209
9.4 Conclusion....................................................................................................... 210
Contents
VII
CHAPTER 10 Achievements and conclusions ............................................................. 211
10.1 Achievements ................................................................................................ 211
10.2 Conclusions ................................................................................................... 213
10.3 Postscript ....................................................................................................... 215
Bibliography.................................................................................................................... 216
Filmography .................................................................................................................... 232
Appendix ......................................................................................................................... 234
VIII
Pictures
Figures
Tables
Pictures
Picture 4.1
Picture 4.2
Picture 4.3
Picture 4.4
Ruth Gordon in Polanksi’s Rosemary’s Baby (1968) ....................................... 48
An image taken from Spike Lee’s Do the right thing (1989) ........................... 57
Liv Ullmann in Bergman’s Shame (1968) ....................................................... 57
Image from Bertolucci’s The Last Emperor (1987) ......................................... 59
Figures
Figure 2.1
Figure 2.2
Figure 2.3
Figure 2.4
The structure of communication (based on Tudor (1974, p. 31)) ..................... ..8
The narrational process (based on Bordwell (1985, p. 50)).............................. 11
Relationships between plot structures .............................................................. 16
Relationship between narrative elements
(adopted from Chatman (1978, p.26)) ............................................................. 21
Figure 3.1 Relationship between incongruity and narrative structures .............................. 32
Figure 3.2 Relationship between derision and narrative structures.................................... 33
Figure 4.1 Syntagmatic and paradigmatic structures of clothing
(Monaco, 1981, p. 341). .................................................................................. 54
Figure 4.2 The compositional and interpretational structures that make up
the image (based on Monaco (1981, pp. 144 - 145))....................................... 60
Figure 4.3 Syntagmatic categories of visual material
(based on Monaco (1981, p.145))................................................................... 69
Figure 4.4 Influential communication factors (based on Tudor (1974, p. 31)) .................. 70
Figure 4.5 Simplified model of the film editing process ................................................... 71
Figure 5.1 Bloch’s shot representation (Bloch, 1986),p. 149. ............................................ 78
Figure 5.2 Layers of annotations for a 100 frame shot...................................................... 83
Figure 5.3 FRAMER structure for Fido the Wonder Dog’s legs
(taken from Davis (1995, p.137)) .................................................................... 90
Pictures, Figures, Tables
IX
Figure 5.4 Actions annotated in layers in a 100 frame shot............................................... 96
Figure 5.5 Relevant shot segment for a query for all three actions.................................... 96
Figure 6.1 The 180ÛV\VWHPEDVHGRQ%RUGZHOO7KRPSVRQS ................... 110
Figure 6.2 Schematically description of the POV shot (based on Bordwell &
Thompson (1993, p. 273))............................................................................... 114
Figure 6.3 Plot requirements for the editing process......................................................... 121
Figure 6.4 Conceptual relationship between the space of visual awareness
and narrative functionality .............................................................................. 125
Figure 6.5 Memory structure for spatial relationships between subjects over
a number of shots............................................................................................ 127
Figure 6.6 Influence of sequence decomposition on the number and order of
shots................................................................................................................ 128
Figure 6.7 Trimming of a shot from 140 to 108 frames .................................................... 145
Figure 7.1 Semantic subnet for the action "walk" ............................................................. 161
Figure 7.2 The relation between the narrative logic and the choice of visual
material used to represent it............................................................................. 166
Figure 7.3 Semantic subnet .............................................................................................. 167
Figure 8.1 GPS operator "threaten renewed violence" in IDIC
(Sack & Davis, 1994, p.5) ............................................................................... 171
Figure 8.2 Bloch’s editing model (Bloch, 1986, p. 133).................................................... 173
Figure 8.3 Tasks performed by AUTEUR ........................................................................ 176
Figure 8.4 Proposed architecture for the creation of a visual story of
emotional impact.............................................................................................. 177
Figure 9.1 Startshot for the banana skin joke.................................................................... 188
Figure 9.2 Result of startshot analysis for the banana skin joke ........................................ 188
Figure 9.3 Motivation Sequence-Structure for the banana skin joke ................................. 189
Figure 9. 4 - 9.6 Three possible motivation shots for the banana skin joke ....................... 191
Figure 9.7 Event shot for the banana skin joke ................................................................. 192
Figure 9.8 Status of the banana skin joke after the end of the motivation
phase............................................................................................................... 193
Figure 9.9 Realisation Sequence-Structure for the banana skin joke ................................. 193
Figure 9.10 Realisation part for the banana skin joke, generated out of two
shots.............................................................................................................. 196
Figure 9.11 Status of the realisation phase of the banana skin joke................................... 196
Figure 9.12 Resolution Sequence-Structure for the banana skin joke................................ 197
Figure 9.13 Retrieved shot for the realisation phase of the banana skin joke..................... 198
Pictures, Figures, Tables
Figure 9.14
Figure 9.15
Figure 9.16
Figure 9.17
Figure 9.18
Figure 9.19
Figure 9.20
Figure 9.21
Figure 9.22
Figure 9.23
X
The banana skin joke generated by AUTEUR ............................................... 200
Startshot for the lamp post joke ..................................................................... 201
Result of startshot analysis for the lamp post joke ......................................... 201
Realisation Sequence-Structure for the lamp post joke .................................. 203
Realisation part for the lamp post joke .......................................................... 204
The lamp post joke generated by AUTEUR................................................... 205
Startshot sequence for the bus example ......................................................... 207
Realisation Sequence-Structure for the bus joke ............................................ 208
Realisation sequence for the bus joke ............................................................ 209
The bus joke as generated by AUTEUR ........................................................ 209
Tables
Table 3.1 Classification of humour strategies ................................................................... ..42
Table 4.1 Tudor’s paradigm of cinematic meaning Tudor (1974, p.128). ......................... ..45
Table 5.1
Table 5.2
Table 5.3
Table 5.4
Table 5.5
Table 5.6
Table 5.7
Representational structure for cinematographic devices .................................... ..97
Substructure "character appearance" ................................................................. ..99
Substructure "actor action" ............................................................................... 101
Substructure "object "....................................................................................... 101
Substructure "relations" .................................................................................... 103
Substructure "deep-space composition" ............................................................ 103
Substructure "setting" ....................................................................................... 105
Table 6.1 Relationship between camera distance and size of presented
content space..................................................................................................... 112
Table 6.2 Spatial relationships between shots A and B in terms of camera
distance............................................................................................................ 123
Table 6.3 Relationship between camera distance and hierarchical
representation level of subjects......................................................................... 131
Table 7.1
Table 7.2
Table 7.3
Table 7.4
Table 7.5
Table 7.6
Conceptual structure for a representation of the action "slip" ............................ 158
Simplified conceptual structure for the abstract object "time". .......................... 162
Representation of an emotional doublet ............................................................ 162
Simplified representation of the emotion class "pleasure" ................................. 163
Structure of an event, i.e. "meeting" ................................................................. 164
Actions in the event "getting coffee"................................................................. 165
Table 9.1 Conceptual structure for the event "meeting".................................................... 191
Pictures, Figures, Tables
XI
Table 9.2 Conceptual structure for a representation of the action "slip" ............................ 194
Table 9.3 Structure of an event "using_transport"............................................................. 206
1
Chapter
I
Introduction
One of the strongest memories of my childhood is of my first film experience,
Disney’s animated film Snow White and the Seven Dwarfs. Since then, I have been
fascinated by the mystique of this artificial world of light and imagination, and
consequently hours upon hours of my youth were taken up by watching TV, or sitting
in the darkness of a cinema to follow the action on the screen.
Astonishingly enough, film has never been part of the educational system as I know
it, though the medium is one of the strongest intellectual and emotional influences in
our culture. Curricula were designed around the traditional subjects of arts and
science, and I remember that only during the last year of school did we see some
films in our German course. We discussed the content of these films in comparison
with the novels from which they had been adapted, and that was as far as film
analysis went. Unfortunately, skills for creating, manipulating and analysing film
were not developed.
Today the situation could be different. Access to tools for creating, manipulating and
playing with moving images is much more widespread. Due to rapid developments in
entertainment technology (e.g. camcorder, TV, video recorder) and computer
hardware, it is now affordable to produce videos at home, in the office or at school.
Moreover, research has produced computer based environments to assist with the
editing, annotation and retrieval of video (see Aigrain & Joly (1994); Aigrain, Joly, &
Longueville (1995); Bloch (1986); Chakravarthy (1994); Davis (1995); Gordon &
Domeshek (1995); Parkes (1989a); Pentland, et. al. (1994); Picard & Minka (1995);
Sack & Davis (1994); Sack & Don (1993); Tonomura, et al. (1994); Ueda, et. al
(1993); Yeung, et al. (1995)). Further developments in environments such as the
Internet will, in the nearer future, provide users with even more possibilities to create
1: Introduction
2
films, as users will be able to access large video databases, which offer them the kind
of visual material they cannot shoot or synthesise themselves.
Visual communication is a creative process, like writing a letter or telling a story.
Whereas for written or spoken language we are usually experienced in producing and
receiving information, we are consumers, rather than producers, of visual statements.
Nevertheless, while we usually understand what a film is showing us, we do not
understand it on the basis of our knowledge of the filmic mechanisms used, but rather
because we understand the film as a unified structure that results from the
composition of all its elements (Metz, 1974). It is not uncommon, therefore, that after
seeing a film, people wonder why it affected them in the way it did, and they wish
that someone could tell them.
Thus, we face the ironic situation that while there are more possibilities than ever to
become creative in a visual sense, most people still lack the necessary skills, i.e. the
"selection, timing and arrangement of given shots into a film continuity" (Reisz &
Millar, 1969, p.15), to make video part of their daily communication.
The aim of my research is to change this situation through the development of tools
to support the process of transforming ideas into appropriate visual presentations, or
assist in the understanding of video documents, either for professional or personal
purposes.
1.1 Automated video editing: a scenario
Let us look at a hypothetical and idealised intelligent video desktop publishing system
of the future, in order to understand the motivation behind my research.
Suppose a user wishes to create a multimedia document that describes the work of a
particular director. The document is to provide an understanding of the director’s
work and its specific filmic expression. Though the user has access to a database of
digitised video material of work by the director, and about the director (e.g. film
essays), there will be the need to summarise some of the material visually. Such
summaries can be automatically generated by the system, based on specifications
made by the user. For example, the user may specify the sequences to be combined
and the time span of the summary.
In cases where users edit sequences themselves, there is supporting software available
which, analogous to spell checkers and grammar checkers, offers the user feedback
regarding whether the created sequence is visually acceptable, and if that is not the
1: Introduction
3
case, how it can be improved. In the latter case, the system may itself generate a
sequence, so that the user can compare it with his or her own, thus providing
feedback which may influence the nature of the overall work.
Sometimes the user will have problems in understanding a particular filmic
technique or presentational style, e.g. there may be a problem with a particular cut, or
uncertainty may arise as to how the emotional effect of a particular sequence should
be achieved. In such cases, the system must be able to dynamically generate
appropriate visual examples in order to clarify the problem. Additionally, there may
be options for the user to experiment with the particular filmic technique or style,
which requires that the system can analyse and criticise the results achieved by the
user.
To deal adequately with the above situations, it is essential that the system is
equipped with editing techniques and content-based representations of video material.
The latter are particularly important, as editing requires content-based retrieval of
video material.
Our research focuses on the definition of methods to support automated editing of
video, based on the representations of the semantics and syntax of video, and on the
conceptual representation of thematic goals and narrative. To reduce the complexity
of this open-ended research program, we have set ourselves the initial goal of
producing short sequences of video that realise the theme of humour.
Our approach draws on diverse research disciplines, each contributing ideas and/or
technologies, and has resulted in a prototype system, AUTEUR, that has achieved a
limited degree of success in producing humorous film sequences. The system is far
from complete, and it should be appreciated that what is being presented in this thesis
is but a small step toward the goal of intelligent video editing systems. Nevertheless,
one real contribution of this research is the conceptual representation of the many
types of knowledge and skills required by video editing.
1.2 Methodologies
This thesis concerns the application of video semantics and theme representation to
the automated editing of visual stories at the level of events. Such an endeavour is, of
necessity, of a multidisciplinary nature, and draws on research results from four
distinct fields: Artificial Intelligence (AI), film theory, narrative theory and humour
theory, as is now briefly described.
1: Introduction
4
1.2.1 Artificial Intelligence (AI)
As stated above, the goal of this thesis is to provide a machine with the ability to
autonomously generate emotionally stimulating visual events. Since our aim is the
computer modelling of a hitherto human task, many of the problems we face are
those addressed by AI, which attempts to study ’mental faculties through the use of
computational models.’, (Charniak & McDermott, 1985, p. 6).
In order to understand the functions of human memory, perception and emotion,
research in AI has developed representational techniques and mechanisms enabling a
computer to perform tasks such as story understanding, deduction, information
retrieval and planning.
Our attempt is to provide a machine with appropriate knowledge and inference
mechanisms so that it can plan events with the intention to make them humorous, and
then present the events in the most visually credible way. Therefore, we discuss the
relevant AI techniques for search, planning and story generation. In particular, the
work of Parkes (1989a) on the representation of video content, Bloch (1986) on
automated video editing, and Schank and students (Lehnert (1983); Lehnert, et al.
(1983); Riesbeck & Schank (1989); Schank & Abelson (1977); Schank (1982);
Wilensky, 1983a; Wilensky, 1983b)) on story understanding and memory-based
representation have contributed to the thesis in this respect.
1.2.2 Film theory
As our work is devoted to the definition of methods to perform automated editing of
film, we use the theoretical constructs provided by film theory, and the practical
experience of film makers and editors, to identify and model the fundamental units of
the image, and the conceptual relationships between image, shot and sequence.
Referring to the initial goal of this thesis, i.e. the automatic production of short
sequences of video that realise the theme of humour, we focus our investigation of
film theory on narrative film making, which also influences our point of view on
editing, in that we focus on the "continuity editing" style.
Though we touch on the "is film a language" debate in this thesis, we concentrate on
a semiotic oriented approach to the description of film content, which is mainly
influenced by work of Eco (1977, 1985), Jakobson & Halle (1980) and Peirce (1960).
The most pervasive influence of film theory on our work, with regards to film editing
1: Introduction
5
and its influence on the description of film structure, has been that of the formative
movement, represented by Eisenstein (1988, 1991), Kuleshov (1974) and Vertov (as
described in Petric (1987)), and the work on context and order by the cognitive
psychologist Gregory (1961).
1.2.3 Narrative theory
In attempting to automatically generate filmic narrative, we provide an abstract
computational model of the causal and temporal relationships between events, people
and objects, drawing particularly on the ideas of the Russian Formalist movement
(see Lemon & Reis (1965)) which aimed at classifying the artistic skills applied in the
process of creating a piece of prose, literature or film.
Though targeted, narration is first of all a formal communication system for the
dynamic interaction between narrator, content and audience, influenced by social,
cultural and medium specific entities. As our aim is to generate film sequences that
provoke an emotional reaction, we are particularly interested in the influence the
narrative has on the ways in which the viewer understands the film portrayal of
events. The theoretical analyses of Bordwell (Bordwell,1985; Bordwell, 1989;
Bordwell & Thompson, 1993), Chatman (1978), and Tudor (1974), have contributed
to this thesis in this respect.
1.2.4 Humour theory
As stated above, the goal is to automatically create emotion provoking video
sequences. The emotional reaction we have in mind is laughter. The decision to use
humour as our target was based on two factors. Firstly, comedy is appreciated by
most people, and thus it seems reasonable to use humour as the reason for telling a
story. Secondly, research on the problem of automatic generation of visual humour
would help in the evaluation of the many theories of humour that have arisen in such
diverse research fields as psychology, philosophy, linguistics, and others.
1.3 The thesis: an overview
We have presented the overall aim of our research, and briefly described the diverse
influences on our research. We now give an overview of the structure of this thesis.
Note, that, due to the multidisciplinary nature of our work, the relevant "literature
review" for each domain has been integrated in the relevant chapter.
1: Introduction
6
Chapters 2, 3 and 4 effectively provide the theoretical analyses of narrativity, humour
and film. Those specify the essential elements of each domain that are required to
support the process of automatically editing visual stories.
Chapter 2 identifies and analyses the key narrative principles, i.e. theme, order, event,
time, and space, and the relationship between them. The results of the analysis lead to
a critique of the merely structural approach to computer-based story understanding
and generation, as proposed by AI research in the form of story grammars. In place of
this approach, a multi-dimensional planning approach to visual story generation is
suggested, which supports the two main representation layers of a story, i.e. structure
and content.
Chapter 3 provides a brief introduction to humour theory, by grouping humour into
three classes, i.e. cognitive-perceptual, social-behavioural and psychoanalytical. The
functional humour primitives identified are then combined with the logic of narrative
structures to form a set of strategies that automatically generate humorous versions of
event sequences. The chapter also describes a novel mechanism to evaluate jokes.
Chapter 4 discusses formal and textual aspects of film, paying particular attention to
the information carried by the image, its formal permutations, and its capacity to
communicate expressive meaning. We investigate the semantic relationship between
images within the shot, and the semantic relationships between shots. The chapter
concludes with our simplified model of the editing process.
Chapters 5, 6 and 7 describe our computational structures to support retrieval,
rearrangement and presentation of video for visual story generation. These structures
derive their theoretical basis from the preceeding three chapters.
Chapter 5 is concerned with the representation of video content and is thus mainly
related to chapter 4. We critically examine related work from AI. We then describe
our representational structures for the semantic, temporal and relational features of
video.
Chapter 6 discusses the background knowledge necessary for automated video
editing. We provide a brief introduction to the system of continuity editing, a style of
editing that provides temporal and spatial continuity between juxtaposed shots. We
then discuss and evaluate research into automated editing, and finally discuss in detail
our own representations and strategies for automated editing. The structures and
strategies developed establish a link between video content (chapter 5) and narrative
specifications (chapter 7), so this chapter draws heavily on chapters 2, 3, and 4.
1: Introduction
7
Chapter 7 describes our representations for narrative structures such as actions,
events, and emotional codes. We discuss and evaluate the major relevant AI
approaches of knowledge representation. Drawing on the theoretical foundation
provided in chapters 2 and 3, we then describe our approach to the shallow
representation of the physical world and abstract mental and cultural concepts.
Chapter 8 demonstrates the practical applicability of the work presented in this thesis
by discussing the architecture and function of our prototype system AUTEUR
(Artificial Intelligence Utilities for Thematic Film Editing using Context
Understanding and Editing Rules).
Chapter 9 analyses three example humorous films that were actually produced by
AUTEUR.
Chapter 10, the conclusion, discusses the achievements of the research, and points the
way to future research in this area.
8
Chapter
II
Narrativity
The telling of stories is a pervasive aspect within our life because it helps to shape our
experience by structuring the events during our encounters with reality. Narrating
means making a comment about a certain event, following an idea about the medium
and form of presentation which is grounded in one's own motivational and
psychological attributes. Narrating is a targeted phenomenon. There is a receiver and
the narrator's perception of him or her will doubtless have an impact on the outcome
of the story. Moreover, both narrator and receiver do not exist in a vacuum but share a
social environment which adds extra structures to the narrational process. Thus,
narration is a dynamic process of interaction in a partly given social context, where
‘the interaction encompasses on the communicator, the content, the audience and the
situation’ (Janowitz & Street, 1966, p. 209). Figure 2.1 diagrammatically describes the
structure of communication.
Personality
attributes
Task specific
knowledge
Organismic
attributes e.g. male,
adult, etc.
Outside cultural
attributes
Narrator
Outside social
attributes
Shared cultural
structures
Medium
Receiver
Effects
Shared social
structures
Figure 2.1 The structure of communication (based on Tudor (1974, p. 31))
It should be noted, though, that the influences on the communicator are essentially the
same for the receiver, but that the former uses them for construction while the latter
uses them for interpretational purposes. Since this thesis deals with the creation of
thematic film sequences, it focuses naturally on the narrator. However, the receiver
2: Narrativity
9
cannot be neglected because it is his or her potential intellectual and emotional
response which is to be predicted.
The position of the structural analysis of narration proposed in this chapter within the
above model of the communication process (Figure 2.1) might best be located in the
part task specific knowledge. The intention of the analysis is not to present an in-depth
analysis of narration in its entirety, but rather to lay one part of the foundation for our
approach to automated film editing by reflecting on the process of narration.1 The
problem is to identify the devices and syntactic structures that must be considered
when modelling the dynamics of the narrational process. The aims of narration are
essentially the same for film, literature, plays or pantomime, even though each
medium uses its specific features to shape the process into an appropriate form. A
more detailed analysis of structures relevant to the understanding of film will be
discussed in chapter 4.
2.1 Narrative principles
To identify the elements involved in narrative, and the relationship between them was
an enterprise of the Russian Formalist movement (1918 - 30), a group of critics who
developed the notion of an opus as the sum of all applied artistic skills. Their aim was
to understand and describe these skills in their respective functionality (practical,
theoretical, symbolic and aesthetic) within prose, metre, literature evolution, genre
and film theory. The philosophical context of Russian formalism exerted great
influence on the development of early film theory, for example on Eisenstein,
Arnheim and Balázs, and, in the 1960s, influenced the development of film semiotics.
The resulting analysis, discussed below, is strongly related to the ideas of the formalist
approach, especially to Tomashevsky's "Thematics" (Lemon & Reis, 1965).
Within narratology, a distinction is drawn between the story being told (the Fabula)
and the form in which the story is presented (the plot).2
The Fabula is understood to be the entire structure of causal-chronological joint
events within a given time and space. The Fabula is the result of a dynamic process
1
Detailed analyses of narrative can be found in the literature referenced in this chapter.
2
It should be noted, that this distinction has been made since Aristotle's Poetics, where the
distinction is between mimesis, the creative imitation of human behaviour and muthos, the
organisation of the events (Aristotle, 1968; Golden & Hardison, 1968; Ricoeur, 1985). In
particular, Boris Tomashevsky's 'Thematics' has much in common with the Poetics.
2: Narrativity
10
based on assumptions and inferences. The basis of these intellectual mechanisms are
three types of schemata as described in Bordwell (1985):
prototypes
which organise the identification of types of persons, actions,
localities, etc.,
templates
which articulate common story formats, where each formal element
represents a structural story movement, realising the stages through
which the agent of the story must pass, such as: Orientation,
Complication, Evaluation, Resolution, Coda3, and
procedures
which organise the search for appropriate motivations and relations of
causality, time and space.
Since all these processes are subjective, so is the Fabula. However, the Fabula is not
the story as the receiver learns it from the actual chronological order presented but
rather the action, event or episode itself. The Fabula is thus to be seen as an
elaboration.
The plot, on the other hand, shares the same events with the Fabula but in the plot
'...events are arranged and connected according to the orderly
sequence in which they are presented...'
(Lemon & Reis, 1965, p. 67), quoting Thomashevsky.
The plot, therefore, accepts all of the temporal and spatial dislocations which the
creator, having an intention-directed end in mind, has put into effect. The plot is
therefore '...an expression of an I external to it.' (Barthes, 1977, p. 110). In film in
particular, the impression can arise that the "I" of the agent which puts forward the
story is identical with the receiver (Metz, 1982). Since the plot is the actual
arrangement of the Fabula, it can be concluded that narration is precisely manifested
within the dramaturgical principles of the plot. However, the structural conception
behind the plot (dramaturgical process) is not sufficient for the required presentational
task and the additional aspect of style is needed, where style stands for the systematic
use of media-specific devices. In film this may concern, among other things, camera
3
This model of a story format is from Labov and Waletzky described in Segre (1979, p. 46).
For contrasting models see, among others, Bremond as described in Segre (1979), Greimas
(1983) and Propp (1968).
2: Narrativity
11
work and editing. The relationship between plot, style and Fabula is described in
Figure 2.2.
Plot
Fabula
Narration
Style
Figure 2.2 The narrational process (based on Bordwell (1985, p. 50))
From the above analysis it should be clear that the plot is responsible for the
perceptional design and for attracting and holding attention, whereas the Fabula is
'... a theoretical elaboration of fundamental importance for describing
the plot by contrast, for it constitutes a touchstone, a means of
measuring dislocations there realized.'
(Segre, 1979, p. 15).
It is now appropriate to characterise the underlying principles of the plot.
2.1.1 Theme
There is initially a purpose to a story, which unites the story's separate elements. This
purpose which can be called an external reason (Wilensky, 1983a, 1983b) or idea, is
the theme of the story. To make a story coherent, at least one theme must be
represented. However, several overall themes are often represented, and, at the same
time, each part of the story may reflect its own theme.
A theme actually performs two concurrent tasks. First, it arouses the interest of the
receiver. In this respect, the underlying selection process for a theme deals merely
with general human emotions, such as love, grief, lust, guilt, and so forth, which must
be elaborated within particular and well-formed material4. The choice of a theme has,
therefore, a significant influence on the material forming the characteristic body of the
presented story, or, in other words, the reality on which the story is based. Thus, there
is a link between theme and genre5. A genre can be seen as an abstract network of
4
This material can be seen as the internal reason for a story (Wilensky, 1983a, 1983b), such as
the reason a character behaves in particular ways. The internal and external reasons must both
be taken into account, to communicate the intended idea.
5
For film in particular, genre is to be understood as 'a world common to a range of films which
may vary considerably in the detail of their narrative' (Tudor, 1974, p.111).
2: Narrativity
12
features, such as a set of possible narrative objects (Aristotle's Poetics provides a set
for tragedy, chapter 3 of this thesis provides a set for comedy), and characteristic
objects and actions from the real world, upon which the individual plot defines its
structure. Examples of genres are the Russian folktales described by Propp (1968) or,
for film, the screwball comedy as described in (Brunovska Karnick, 1995; Gehring,
1986). Thus, genre can be seen as the macro structure of a plot. However, genre
implies generality, which leads to the reduction of reality into stereotypes, as
described by Lippmann:
'A pattern of stereotypes is not neutral. It is not merely a way of
substituting order for the great blooming, buzzing confusion of
reality. It is not merely a short cut. It is all of these things and
something more. It is the guarantee of our self-respect; it is the
projection upon the world of our own sense of our own values, our
own position and our own rights. The stereotypes are, therefore,
highly charged with the feelings that are attached to them. They are
the fortress of our tradition and behind its defences we can continue
to feel ourselves safe in the position we occupy.'
(Lippmann, 1934, p. 96).
It is justified, we believe, to argue that the individual plot often reflects many genres.
The reason for this is that a particular genre might be limiting for the development of
a theme, a problem that can be solved by combining it with other genres, whose
stereotypical representations of reality features the missing aspects of the theme. An
example of the combination of genres is Zemeckis' film Back to the Future Part III,
which combines the genres of western, adventure film, romantic love story, science
fiction and comedy.
The influence of the theme on a narrative is that it determines the choice of human,
social and natural material and their order of presentation. Hence, the theme is crucial
to the creation of meaning, which happens on a semantic and semiotic level.
However, a discussion of the semantic and semiotic aspects of film will be deferred
until chapter 4.
The second role of a theme within a narrative is to stimulate and maintain interest.
The effect of a theme depends, in this case, on the intended emotion the theme should
evoke, since emotions are a powerful medium for maintaining attention. For example,
the aim might be to provoke anger, joy or disturbance, which must be developed from
within the story. Thus, a theme evokes and develops feelings of hostility or sympathy
2: Narrativity
13
according to a system of emotional strategies. The structure of these emotion-related
strategies, for the example of humour, will be discussed in chapter 3.
2.1.2 Order
Narration is a structure-oriented activity that begins in the mind of the narrator but is
completed in the mind of the receiver6. The plot must be ordered, therefore, in such a
way that the relationships between the portrayed events and the meaning being
derived from these events and episodes (a series of events) encourage the receiver to
perform causal-chronological inferencing.
It should be mentioned here, that the current author does not believe in film as a
hermetic system of obvious logic which can be completely provided by its creator.
The decoding of the different layers of a film by the receiver might result in a different
meaning from that intended by the creator. Thus, we agree with reader-response
theory.7 This theory describes the process of creating meaning as a communication
process between the end-product and the receiver. However, we do not follow its
extension, that the origin of the piece of work, i.e. its creation process, is of no
importance, as expressed by Barthes, who wrote:
'...a text is made of multiple writings, drawn from many cultures and
entering into mutual relations of dialogue, parody, contestation, but
there is one place where this multiplicity is focused and that place is
the reader, not, as was hitherto said, the author. The reader is the
space on which all the quotations that make writing are inscribed
without any of them being lost; a text's unity lies not in its origin but
in its destination.'
(Barthes, 1977, p. 148).
The significance of narration is to establish the relevance of an action, event or
episode by its position within a logical bond, since each (i.e. action, event, episode) is
the consequence of the other. For example, an obscure relationship between events
6
See also Ricoeur (1985, p. 46), and his interesting division of mimesis into
memesis1 the representation of human action in its semantics, its symbolic and its temporality
memesis2 the mimesis of creation
memesis3 the interpretation of the mimesis,
which connects the process of writing with reading and thus provides a theory of the relationship
between narrative and time.
7
Represented in film theory by Bordwell (1985, 1986, 1989, 1993).
2: Narrativity
14
might lead to the idea of superfluousness, while an unsuitable link might give the
impression of incoherence.
When discussing the theme, we argued that the plot is based on conventionalised
event and object structures.8 This conventionalisation is essential to the process of
plot construction, since conventions transform an event into a self-regulated, i.e.
closed and self-maintained, structure, or as described by Piaget:
'... what they [closure and self-maintenance] add up to is that the
transformations inherent in a structure never lead beyond the system
but always engender elements that belong to it and preserve its laws.
Again, an example will help to clarify: In adding or subtracting any
two whole numbers, another whole number is obtained, and one
which satisfies the laws of the 'additive group' of whole numbers. It
is in this sense that a structure is "closed", a notion perfectly
compatible with structures being considered a substructure of a larger
one; but in being treated as a substructure, a structure does not lose
its own boundaries, the larger structure does not "annex" the
substructure; if anything, we have a confederation, so that the laws of
the substructure are not altered but conserved and the intervening
change is an enrichment rather than an impoverishment.'
(Piaget, 1970, p. 14, comments in [] added by the current author).
The above has two implications for the construction of a narrative, one being for
single events, and the other for the relationships between events.
2.1.2.1 The event
To us an event is a closed, self-maintained structure, of composed pre-conditions,
main-conditions and post-conditions. The pre-conditions perform a type of
introduction of the characteristic objects, and the locations, activities or moods of
characters necessary for the main part of the event. We refer later to this structural
element as the motivation. However, if an object or activity is perceived, the object,
action or the perceived mood of the character suggest certain possible events that can
occur, and expectations that need to be realised. The appropriate plot structure for the
main-condition of the event will be described as realisation. The particular realisation
may lead to certain reactions being expected, which provide additional clarifying
8
"Object" means here either a character or a physical object.
2: Narrativity
15
information. These post-conditions are part of a phase which will henceforth be
referred to as resolution.
For example, imagine the event of a character attempting to obtain coffee from a
coffee machine. During the motivation phase, the character approaches the machine.
In the realisation phase of the event, the actor searches for, and then inserts, the
change. While the machine is operating, the character waits, and finally the machine
provides the cup and the coffee. In the resolution phase of the event, the character
looks for change, then takes the coffee, and then finally drinks the coffee while
walking away.
In defining an event as a single and discrete structure that establishes archetypical
behaviour patterns and cultural stereotypes, it is clear that certain functional elements
- we actually understand actions as being the core functional elements of events - are
more relevant to the event or a particular stage within it than others.9 The
indispensable functional elements are dominant or bound functional elements of the
event, whereas those that are not vital to the chronological causality are called free
functional elements. Though free functional elements are not essential for the story,
they can serve as the icing on the cake, since their digressive nature can enrich the
presentation and thus support the well-formedness of the plot. In the above example
of the coffee machine, the searching for change by the character in the motivation
phase is an example of a free action, and the change itself is a free object, whereas the
drinking is a dominant action and the coffee machine a dominant object.
2.1.2.2 The relation between events
The relationship between events can serve to support resolution by answering the
question 'What will happen next?'.10 In this respect, the relationship between events
within a narrative is based on a hermeneutic code, as described by Barthes:
'Let us designate as hermeneutic code ... all the units whose function
is to articulate in various ways a question, its response, and the
9
We understand actions as being the core functional elements within an event, since they define
other functional elements, such as the intentions or moods of characters, or the importance of
the objects involved in the actions.
10
The author is aware of the fact that a narrative can serve other purposes apart from
development, for example to display events as state of affairs, as Chatman (1978) points out,
suggesting Virginia Woolf's novel Mrs. Dalloway as an example of this second kind of
narrative. However, this thesis is not focused on such open narrative structures.
2: Narrativity
16
variety of chance events which can either formulate the question or
delay its answer; or even constitute an enigma and lead to its
solution.'
(Barthes, 1974, p. 17).
The three essential elements for causality within an event, i.e., motivation, realisation
and resolution, also provide the basis for the correlative, logical relations between
events. Additionally, we can identify major and minor events. Major events are the
driving force of the plot, they raise questions and answer them, and thus function as
selective branching points for the possible paths through the narrative. It is obvious,
for example, that Hitchcock's film Rear Window depends upon Mrs Thorwald's
murder and the main actor's injury, which forces him to stay at home and thus results
in him investigating the neigbourhood with his camera. Minor events serve the
purpose of aesthetic enrichment, since they can easily be removed from the causal
chain without affecting the overall understanding of the plot. Examples of minor
events in Rear Window are the accompanying courtyard stories. Figure 2.3. illustrates
the relationship between the logical stages motivation, realisation and resolution and
plot structure.
Plot
Episode
Episode
Episode
Mot
Real
Event Event
Event
Reso
Event
Mot Real
Reso
Action Action
Action Action Action
Action
Mot
Subaction
Subaction
Real
Subaction
Subaction
Reso
Subaction
Figure 2.3 Relationships between plot structures
The analysis thus far has demonstrated that, although the structure of a plot is
restricted by the choice of theme and the self-maintenance and closure of event
structures, a plot can be distinguished within its particular presentation from any other
plot which might share the same internal structure. The reason for this is that the
2: Narrativity
17
narrator can organise the events in a great many ways. Depending on the theme(s) and
the related genres, the stylistic aim of the narration may be to emphasise or conceal
certain plot-events, to highlight events by providing additional detail, to omit events,
to order events sequentially, to present them out of chronological sequence (e.g.
flashback), and so forth. The following serve as stylistic devices that alter the way
information in a narrative is perceived:
exposition
retardation
which provides crucial information in the plot to encourage curiosity,
which increases suspense by delaying information, as often used in
digression
omission
redundancy
detective stories,
of which commentary is an example
of which incompleteness is an example
which intensifies significant information through repetition or
opposition.11
Finally, since the plot order encourages the perception of information, it is, of course,
important to determine how much knowledge a plot can establish (objectivity versus
subjectivity), how much recognition can be produced in the perceiver, and how
communicative the plot is, i.e. '... how willingly the narration shares the information to
which its degree of knowledge entitles it.' (Bordwell, 1985, p.59). These categories
are very important for the manipulation of temporal, spatial and causal-chronological
factors to cue and guide the perceiver's activity. Thus, the combinatorial possibilities
for plot ordering are enormous, but are nevertheless limited by the semantics and
semiotics of the available material.
2.1.3 Time
As described above, the plot is a dynamic process. Thus, there is mediation between
time and narration which is succinctly expressed by Paul Ricoeur:
'Time becomes human to the extent that it is articulated through a
narrative mode, and narrative attains its full meaning when it
becomes a condition of temporal existence.'
(Ricoeur, 1985, p. 52).
11
For an exhaustive description of these devices and their perceptual and cognitive
consequences, see (Bordwell, 1985). For a discussion of retardation in particular, see
Sternberg (1978), and for a discussion of redundancy, see Suleiman (1983).
2: Narrativity
18
The three key elements of the representation of time within a narrative are duration,
frequency and order. The relevance of order and frequency (e.g. repetition) to the
composition of the plot was discussed above. The following discussion therefore
concentrates on duration, which is particularly important in film.
A narrative has a beginning and an end, between which it relates a sum of events,
where each event can be either with or without conclusion. The narrative itself is
always a closed sequence (i.e. the last image describes the definite end of the film).
A narrative is, then, a temporal sequence comprised out of two temporal schemata,
where one tells the story and one is the time over which the story is told.12 The
established relation in a narrative is, therefore, a time-in-time relation, whereas any
description creates a space-in-time relation and an image forms a space-within-space
relation (Metz, 1974). A simple example should illustrate this. Imagine a sequence
showing successive shots of the sea followed by a shot of a boat sailing across a
stretch of water. The first motionless shot of the sea is simply an image of the sea
(space-in-space), where the successive shots form a description of the area (space-intime) and the crossing boat establishes the narrative (time-in-time). Thus, a narrative
can be described as a system of temporal transformations. The system covers the time
relations between Fabula (TF), plot (TP) and performance (TPE) time and contains
three transformational classes: equivalence, contraction and expansion. (Bordwell,
1985).
Equivalence can be described as the following relation :TF = TP = TPE.
Contraction, where TF is reduced, can be divided into two subsystems:
TF > TP and TP = TPE
Ellipsis
plot discontinuity marks omitted
portion of TF.
Compression TF = TP and TP > TPE
no plot discontinuity but a condensed
duration of performance time
Expansion, where TF is expanded, can also be considered as consisting of two
subsystems:
12
There are actually three time levels if we consider the time when the story was created.
However, this level is more important for the interpretation of a story. For the understanding
of Thomas Mann's novel Dr. Faustus (Mann, 1980) it is important to know that it was
actually written during the time when the internal narrator is writing his report on the life of his
friend Adrian Leverkühn. A further example is the film Trainspotting by Danny Boyle. Boyle
creates a realistic view of the use of heroin and, due to its use of similar structures and stylistic
means, a direct contradiction of Tarantino's artistic view presented in Pulp Fiction.
2: Narrativity
Insertion
Dilation
19
TF < TP and TP = TPE
TF = TP and TP < TPE
plot discontinuity marks added data
no plot discontinuity but TF and TP
are elongated
The importance of temporal transformational classes for film is discussed in chapter 4.
2.1.4 Space
Objects and subjects and their actions, i.e. events, appear in a spatial-referential frame
which provides the perceiver with relevant information about the current
surroundings, positions and assumed directions. However, the articulation of space is
not really bound to specific narrational principles but is rather a cognitive-perceptive
process based on lighting, sound, image composition and, especially in film, editing.
Thus, space is discussed in detail in chapter 4, in which the perceptual and cognitive
influences on the understanding of film are considered.
2.2 From structure versus content to structure and content
The underlying assumption of the above analyses is that a plot is a dynamic
psychological entity that refers to mental or conceptual objects such as themes, goals,
events or actions. In fact, the dynamics within plot construction are twofold. On one
hand, the intentions of the narrator must be achieved, i.e. to present the material as
plausibly and succinctly as possible. The impact of this relies on articulational
techniques, i.e. communication strategies between narrator and receiver. On the other
hand, the dynamics within the material must be considered, since these form the bases
of the plot, i.e. thematic structures.
Referring to the plot structure of Figure 2.3, we see that episodes, events, actions and
subactions are organised hierarchically. It is, therefore, suitable to characterise a plot
in terms of states and transformational rules, an idea which was propagated by the
structuralist movement. Structuralism formulated a rationalised and deductive
approach to narration, which considered narrative structure as analogous to language
structure and thus linked structure with the determination of content (see Barthes
(1967, 1977); Chomsky (1965); Lévi-Strauss (1968, 1977); Metz (1974); Price
(1973); Propp (1968); Striedter (1971); van Dijk (1972) ).13
13
This position was adopted by Barthes, who understood language as the master system for all
other communication systems:
'Now it is far from certain that in the social life of today there are to be found any extensive
systems of signs outside human language' (Barthes, 1967, p. 9).
2: Narrativity
20
An analogous approach to the representation of narrative structure, in the field of
Artificial Intelligence, considered the applicability of story grammars to text
understanding, where the main influences came from Propp's work on Russian
folktales and Chomsky's transformational grammar (see Colby (1973); Knitsch & van
Dijk (1978); Lakoff (1972); Mandler (1977); Rumelhart (1975, 1977); Stein (1979);
Thorndyke (1977)). The main arguments against this approach are advanced by Black
& Wilensky (1979).14 They show that not only are the formal properties of the
grammars insufficient, but also that the computational costs of the representation is
too high. The number of deletion and reordering transformations in the proposed
grammars become extremely large, and yet the grammars are unable to produce a
sufficiently varied set of stories. Finally, Black and Wilensky show that the semantic
interpretation which should be supplied by the grammars is actually needed in order to
apply the grammars. Black & Bower (1980) establish experimentally that the
structures of story-grammars do not reflect the human memory structures that are
related to story parts. Graham (1983) argues that the finite set of lexical items within
the story grammars reflect the assumption that the propositions of stories are,
analogous to sentence structures, also finite. However, story propositions are defined
recursively and thus their depth is arbitrary, from which it can be concluded that their
propositional space is infinite. Hence, the analogy with sentence structures does not
hold. Black & Wilensky, Black & Bower, and Graham arrive essentially at the same
result: a story is a mental process based on different aspects of people's knowledge, of
which structure is but one (see also Wilensky (1983a, 1983b, 1983c)).
Referring to Figure 2.1, we understand the plot to be a communicational process
which is organised around surface structures (expression) and deep structures
(content). However, as described in the above discussions of theme and genre, and the
relationship between events, it becomes clear that the communicational process needs
further distinctions than content and expression. These additional differentiations
concern substance and form, where substance represents the natural material for
content and expression, but form represents the abstract structure of relationships
which a particular media demands (Chatman, 1978). Figure 2.4 shows the
relationships between the differing structures found in narratives.
'It is true that objects, images and pattern of behaviour can signify, and do so on a large scale
but never autonomously; every semiological system has a linguistic admixture' (Barthes,
1967, p. 10).
14
For a summary of this debate see Andersen & Slator (1990).
2: Narrativity
21
Events
Actions
Happenings
Existence
Characters
Settings
Form of content
Content
Plot
People, things, etc. as preprocessed by the authors's
cultural codes
Substance of content
Structure of narrative
transmission
Form of expression
Manifestation (Verbal, cinematic,
pantomimic, etc.)
Substance of expression
Expression
Figure 2.4 Relationship between narrative elements (adopted
from Chatman (1978, p.26))
The theory of narration and the history of story grammars in AI shows us that a story
is a representational system based on two main layers, structure and content, each
serving two distinct purposes simultaneously, these being form and substance. Hence,
a primarily structure-oriented approach to plot generation is not appropriate. A
planning approach seems more promising, as then the different levels can be
separated, as they should be, while maintaining the interaction between the structure
and content layers.
Furthermore, we have seen that it is, in actuality, the substance of content and
expression which is responsible for the well-formedness of a plot and the distinctions
between plots. Chapter 3 (substance of content, form of expression) and 4 (substance
of expression) will discuss these matters for two particular examples, humour for the
former and film for the latter.
22
Chapter
III
Humour
As described in the previous chapter, the theme is the link between plot construction
and plot understanding. This chapter discusses, in detail, the emotional implications of
our chosen theme of humour, and presents a system of strategies for evoking and
developing a humorous reaction.
Emotions are subjective physical and psychological phenomena. Examples are
excitement, relief, pleasure and aversion. Every feeling, such as hate, anger or joy has
a specific affiliated emotional response, which might vary in intensity. However, there
is one particular emotional reaction that is associated with a wide range of feelings,
and that is laughter. A cause of laughter might be when we see a bride drop a piece of
chocolate cake on her wedding dress, when we draw a winner, or lose our mind. Thus,
laughter can be associated with happy or bitter feelings, it can be sarcastic, ironic,
nervous or insane - but more importantly it is never rationalised at the time. Laughter
arises spontaneously from the unconscious (Jordan, 1975). The ability to appreciate
the comic seems to be shared by all people, a factor which can be predicted to help in
the evaluation of any results. Hence, humour was chosen as the example emotional
theme for the film sequences to be constructed.
Given the aim of this thesis, it is necessary to investigate laughter, or rather the
processes which raise in the mind that pleasing sentiment of which laughter is the
physical sign.
3: Humour
23
3.1 Humour - assumptions and definitions
It is commonly assumed that, regardless of actual content, jokes have the same
underlying features.1. However, humour is obviously particularly conditioned by the
sociocultural background of creator and perceiver. The underlying themes and
concepts on which humour is based vary according to culture, social class, role, sex,
and so forth. Furthermore, cultures establish role relationships and other rules relating
their elements and, therefore, offer contextual clues signifying that specific occasions
are appropriate for the creation of humour.
Several disciplines feature research on humour, among them being psychology,
philosophy, semiotics, sociology, and linguistics. In order to provide a framework for
our discussion of the major theories of comedy, we adopt the approach of Raskin
(1985), in grouping humour into three main classes: cognitive-perceptual, socialbehavioural and psychoanalytical. Since this thesis deals with the automatic
generation of film sequences, further limitations are defined by discussing mainly
visual humour.
3.1.1 Cognitive-Perceptual Class
The theories represented in the Cognitive-Perceptual class consider forms of humour
based on incongruity or exaggeration.
The first theoretical definition relevant to this class was given by Aristotle in his
Poetics, where he describes the ridiculous as:
'... some error or ugliness that is painless and has no harmful effects.
The example that comes immediately to mind is the comic mask,
which is ugly and distorted but causes no pain.'
(Aristotle, 1968, p. 9).
The three essential comic principles recognised by Aristotle are exaggeration,
readiness and incongruity.
The simplest way to create humour is to exaggerate salient traits. The actors in Greek
plays, for example, always performed with stylised masks, portraying a type of, rather
1
There is, for example, little shared content or plot structure in such diverse films as Dr.
Strangelove, Modern Times, Monty Pythons's the Meaning of Life, Airplane! and Take the
money and run.
3: Humour
24
than an actual, character. Exaggeration was used as a poignant device to emphasise a
personal shortcoming, which for comic purposes was often the grotesque. This means
that an exaggerated facial expression, or perhaps an overly large nose, or the
extravagant appearance of an actor is likely to be funny in itself. The same concept
can readily be found in silent movies, where appearance almost immediately implies
character2. A villain, for example, had to be 'big, fat, expensively-dressed and
moustachioed' (Jordan, 1975, p.6). In this view, character is seen in terms of
appearance rather than personality, since personality is subordinated to a defective and
eccentric notion of people.
A further way to create humour is to imitate, or parody, existing works. The
exaggerated imitation of a work or genre requires more ingenuity than character
parody, though exaggeration is common to both. Examples of parody plots are Abbott
and Costello meet the Mummy (as in many of their films, they parodied the horror
genre), Bananas (a series of parodies of specific films) or The Naked Gun (a parody of
detective stories).
Lightheartedness is one of the basic tenets of much comedy. An action or event is only
considered to be funny if it takes place in a comic climate, a term coined by Gerald
Mast (1979). It is obvious that, even in a tragedy, there is potential for some comic
touches. However, the overall aim of tragedy is to promote painful emotions, which
then become a catalyst for the intended catharsis of the audience. Comedy, on the
other hand, avoids pain by creating a sphere of sympathy and reduced importance or
worthlessness3. The aim is to suspend audience disbelief, and establish a frame of
mind in which comic events are expected. This can be achieved by introducing
insignificant subject matter or by reducing a serious subject to trivia, as, for example,
in the film Dr. Strangelove. In such a context, it is also helpful to provide a character
with either positive attractiveness (grace, poetic charm, wit, or good looks) or
grotesque, stylised wish fulfilment (Jordan, 1975), which enables the audience to feel
empathy, or even identify, with the character.
The fundamental notion behind the principle of incongruity is that laughter is based
on the recognition of the dissimilarity between the way things are and the way they
should be, a relation which is concisely described by Schopenhauer:
2
To some extent this also applies to contemporary films.
3
See Olson (1968, p. 36-37), who describes comedy as 'the imitation of a worthless or valueless
action.'
3: Humour
25
'In every case, laughter results from nothing but the suddenly
perceived incongruity between a concept and the real objects that had
been thought through it in some relation; and the laughter itself is
just the expression of this incongruity. ... All laughter therefore is
occasioned by a paradoxical, and hence unexpected, subsumption, it
matters not whether this is expressed in words or in deeds.'
(Schopenhauer, 1966, p. 59).
Incongruity can be based on different rhetoric strategies, such as inappropriateness,
paradox, or dissimilarity, depending on the intended type of comedy, i.e. ludicrous or
derisive humour, satire, irony, pun or wit4.
Additional principles for incongruity are introduced by Kant who states:
'In everything that is to excite a lively convulsive laughter, there must
be something absurd (in which the understanding, therefore, can find
no satisfaction). Laughter is an affection arising from sudden
transformation of a strained expectation into nothing.'
(Kant, 1951, p. 177).
If a spectator is confronted with a character in a particular environment, a certain train
of thought develops in the viewer, who expects the character to behave predictably.
However, if the character acts in an unexpected way, this surprises the viewer. The
important aspect is that the expectation within the viewer should be carefully
developed, i.e. strained, so that the resulting and sudden surprise can be even
stronger. Suddenness is important, as jokes based on surprise are either immediately
funny or are not funny at all.
There are arguments suggesting that pure incongruity does not entirely explain the
structure of humour (Beattie, 1776; Freud, 1960; Rothbart, 1976; Shultz, 1976; Suls,
1972). Such arguments propose a two-stage model in which incongruity becomes
meaningful and appropriate through resolution. It is this second stage which conveys
information that explains the sense of the joke in the light of the material that
4
Since the last two concern mainly language oriented humour, they will not be considered
further, since in this thesis sound is excluded. However, the author is aware of the fact that
inferences drawn on the basis of visual clues are often confirmed or disconfirmed by verbal
statements, which leads to a dramatic increase in explanatory possibilities. Work concerning
the automated generation of puns can be found in Binstead & Ritchie (1994a; 1994b).
3: Humour
26
presented the joke. Within this incongruity-resolution model it is possible to
distinguish humour from sheer nonsense, as Shultz points out:
'Whereas nonsense can be characterized as pure or unresolvable
incongruity, humour can be characterized as resolvable or
meaningful incongruity.'
(Shultz, 1976, p. 13).
See also Rothbart & Pien (1977), who present an elaborate model of two types of
incongruity and two types of resolution:
•
Impossible incongruity: events that are unexpected and also impossible given
one's current knowledge of the world. Film examples can be found in Chaplin's
Gold Rush.
•
Possible incongruity: events that are unexpected or improbable but possible. A
film example is the comedy Bringing up Baby, where a series of misadventures
befall a couple, including their adoption of a leopard called Baby.
•
Complete resolution: the initial incongruity is completely solved by the given
resolution information. An example is the emotional reaction of a person who
slips on an ice-covered path.
•
Incomplete resolution: the initial incongruity is solved by the given resolution
information in some way, but is not made completely meaningful because the
situation remains impossible.
For the enterprise of creating humour, the theory of Monro (1951) is important.
Monro describes humour as consisting of different types of nonsense which serve to
create a new reality, such as attitude-mixing or universe-changing, by twisting
familiar material using argument, rhetoric, or exaggeration, so as to obtain an absurd
conclusion.
Two examples from Chaplin movies should further demonstrate incongruity. The first
is taken from the Chaplin film The Immigrant. The scene is set on a rolling boat.
Chaplin is leaning over the rail and his legs are waving in the air. We automatically
assume he is suffering from a heavy attack of seasickness. However, when he turns
around we realise that he is actually fishing.
The second example is from The Idle Class. A women refuses to forgive her husband
(Chaplin) and leaves the room. Chaplin turns his back to the camera. His shoulders
3: Humour
27
heave and shake. Then he turns and is seen to be shaking a cocktail. His face is one
big smile.
3.1.2 Social-Behavioural Class
The main principle of theories in the Social-Behavioural Class is detraction, which
derives from Hobbes, who wrote:
'The passion of laughter is nothing else than sudden glory arising
from some sudden conception of some eminence of ourselves, by
comparison with the infirmity of others, or with our own formerly:
for men laugh at the follies of themselves past, when they come
suddenly to remembrance, except they bring with them any present
dishonour.'
(Hobbes, 1650, p. 45).
Although there are many social-behavioural theories of humour, e.g. disparagement
theory (Suls, 1977), superiority theory (La Fave, 1972), vicarious superiority theory
(La Fave, Haddad, & Maesen, 1976) or dispositional theory (Zillmann & Cantor,
1976), all describe humour as laughter arising from a sense of superiority of the
perceiver to the object, person or idea being presented. Laughter, according to this
view, is laughter at someone. Thus, such theories refer to the relationship between, or
the attitude held by, two parties. To distinguish comedy from spite or cruelty, it is
necessary that the laughter does not cause offence, which means it must not be based
on the absurdity or infirmity of the other5.
A feeling of superiority can also be related to a feeling of triumph. The perceiver,
managing successfully to create a logical pattern from the presented material,
suddenly realises that the portrayed event has twisted into an unexpected direction and
starts laughing. Hence, it is not only perception of incongruity which makes people
laugh, but also the realisation which establishes a feeling of superiority to the situation
(J.B. Baillie, as presented in Jordan (1975, pp. 17 - 18)).
An example from Mike Leigh's film Naked should exemplify the idea of superiority
and triumph as basis of joke. A man pastes a poster on a wall (advertising a concert),
while the main character constantly speaks to him. Both leave the place in the van of
the posterer. They drive through the city and arrive at another place, where the concert
5
See also Beattie (1776), who stated that strong moral beliefs (indignation), feelings (disgust,
pity or fear) or common sense (disappointment) of the perceiver can lead to unfunniness.
3: Humour
28
poster is already on the wall. The posterer puts some other posters on the wall and
then suddenly he pastes a stripe over the concert poster, marked "cancelled". We see
the posterer's behaviour as stupid, and this makes us laugh.
A feeling of triumph can, of course, be turned against the perceiver. Imagine a scene
where a man approaches a freshly painted bench but shortly before sitting, realises the
state of the bench and gets up, only to turn and immediately fall over a litter bin. This
example shows that it is not necessarily only the other at whom we laugh (the
eventual misfortune), but that the joke can also be on us, the perceivers (the expected
misfortune did not take place).
A further relevant theory in the Social-Behavioural Class is the Mechanical Theory of
Bergson (1956). This theory suggests that a comic phrase, action, character or plot
within a comedy is reduced to a mechanical process - a functional principle especially
useful for slapstick where the key element is actually the rapid transformation of
states. In this context, people are seen as mechanical toys because no human being
could possibly bear the physical torture of the slapstick world. However, the perceiver
is not at all concerned with the health or safety of these 'puppets' because they are
thought of '...as machines - which might break but which can always be fixed or
replaced' (Mast, 1979, p. 50). Such situations are found regularly in cartoon films.
An example, taken from the film The Naked Gun, should clarify this idea. The main
character, Lieutenant Drebin, has to find a brainwashed baseball player and thus takes
part in a baseball game as a referee. His behaviour suggests that he is not familiar with
the role of a referee, as he stands to close to the batsman. As a result, he is hit by the
bat, and from this moment on he behaves like a baseball - propelled backwards into
the arms of the catcher.
However, the above example shot also reveals a weakness of the theory, i.e. Bergson's
denial of emotions which, in his opinion, ruin laughter. Who is Lieutenant Drebin,
with all his shortcomings, if not a figure of sympathy? Nevertheless, Bergson's theory
applies to certain restricted cases.
3.1.3 Psychoanalytical Class
Humour theory from the Psychoanalytical Class deals with suppression and
repression. Here, laughter is seen to provide relief from the great number of
constraints - such as to be logical, to think clearly, to talk sense - a human being has to
3: Humour
29
obey (Freud, 1960)6. Thus, the comic allows us to return to a kind of childishness by
providing an anarchic freedom from existing conventions7.
Humour emanating from the unconscious liberation from authority finds its strongest
expression in comedy related to aggressive or sexual drives (tendentious humour), two
instincts that are strongly repressed by society. The controls of society mean that
aggressive instincts, for example, are softened and dispelled by the intellect.
Nevertheless, a derivation from social norms, such as that represented by violence,
can be used to provide the audience with an emotional release for its antisocial urges,
especially if the antagonism displayed exposes a higher morality. An example of
tolerable violence of this type can be found in Frank Capra's Mr. Deeds Goes to Town.
The film ends in a courtroom climax, in which Deeds hits several characters while
simultaneously stating the superiority of his values, with the assent of the judge.
However, a similar result was observed at a showing of Quentin Tarantino's Pulp
Fiction to a student audience, where the accidental killing of a character by one of the
philosophical hitmen produced a great deal of laughter from the audience.8 It is clear
that such humour is connected with the abuse of social norms.
The above discussion of the sources of laughter leads us to conclude that comedy is a
context-related experience of intellectual complexity, influenced by the norms,
feelings and ideas of a particular social group9.
It is now time to consider the narrative structures discussed in the previous chapter, in
conjunction with the approaches to humour discussed above. The aim of the
remainder of this chapter is to isolate the key primitives of humour, i.e. those found in
most humour theories, and to demonstrate how a formalism based on these primitives
might be used to generate a visual statement that effectively provokes a humorous
reaction.
6
See also Mindess (1971, p. 28), who wrote: "In its early stages our sense of humour frees us
from the chains of our perceptual, conventional, logical, linguistic, and moral system."
7
Examples in this vein are: Brats, Ferris Bueller's Day Off, Big, Bill and Ted's Excellent
Adventure, Home Alone and Wayne's World
8
Despite its high level of violence, Pulp Fiction is often described as one of the most humorous
movies of its year.
9
Nevertheless, each of us has a sense of humour which is, in part, uniquely ours (see, for
example, Leventhal & Safer (1977)).
3: Humour
30
In section 3.2 we present humour primitives. In section 3.3 we describe our scheme of
humour formalisms. Finally, section 3.4 defines a method to evaluate jokes.
3.2 Humour primitives
The crucial primitives of humour creation are incongruity, derision, timing, readiness
and exaggeration. These primitives can be classified into two types, first, those which
support the making of a joke, i.e., exaggeration, timing and readiness, and, second,
those which have a constructive or intentional drive, i.e., incongruity and derision.
The following discussion emphasises the functionality of each primitive.
3.2.1 Readiness
Readiness is the establishing of a frame of mind, or a comic climate. To provide the
audience with a credible situation or character requires material that is plausible in
nature. Imagine a character in a hurry. He jumps into the sidecar of a motorcycle, the
engine starts, and the driver disappears on the motorcycle leaving the character and
sidecar behind10. For the best humorous result, the audience should appreciate that the
character is hurrying. In other words, the emotional continuity of the impression
depends on the narrated logic. Thus, plausibility applies to the feelings and attitudes
of the audience as well as to the meaning of the presented material. The role of
readiness is, therefore, to combine the logical and emotional elements of a joke, by
applying the values of serious life (represented by stereotypical behaviour and patterns
of events as well as by standards of morality) to achieve the ironies, ambiguities and
inconsistencies presented, and maintain the equilibrium between the comic elements
involved.
3.2.2 Timing
In humour, timing is crucial and serves two purposes. Firstly, much humour relies on
the element of surprise (as discussed earlier), without which it becomes merely a
logical juxtaposition of emphasis. Comedy must be portrayed quickly and arise from
the unknown, or it loses strength, as described by Eastman:
'Therefore, not only must the mind be genuinely on the way
(plausibility), and the-not-getting-there a genuine surprise
(suddenness), but the surprise must come at the instant when the on-
10
This example is taken from Eastman (1937, p. 343).
3: Humour
31
the-wayness is most complete, and the surprise most unexpected.'
(Eastman, 1937, p. 354).
A second temporal aspect of humour is repetition. We never laugh as hard at a joke
we have seen before, no matter how delighted we were on first viewing. Hence, the
same joke should not be immediately repeated. However, if the joke is paraphrased
through new material, humour can be prolonged, but only until no further escalation is
possible. A wonderful example of the exploitation of one particular idea and comic
mechanism is the stateroom scene in the film A night at the opera. During a voyage,
the Marx Brothers call many different service people (room service, manicurist, etc.)
into their cabin. Each of them perform their duty, while the cabin gradually fills up.
The scene ends with all of the occupants of the cabin tumbling out through the door
towards the last person for whom the door is opened
3.2.3 Exaggeration
The basic idea behind exaggeration is an emphasis of conflict, based on two
principles:
Hyperbole
A way of describing something in order to make it seem bigger, better,
worse, etc.
Miosis
A way of describing something in order to make it seem smaller,
milder, etc.
The aim of humorous exaggeration is to work on the quality of something with which
the perceiver is not deeply concerned, in such a way that the logical extension is
beyond what can be seriously accepted. Hence, it is not the event or action itself that
matters for the gravity or levity of the situation, but rather the intentional view of
these.
3.2.4 Incongruity
The term incongruity is not particularly helpful in itself, since it merely defines the
conflict between the expected and what actually occurs. It is more useful to look at the
various concepts implied by the term:
Ambiguity
The possibility of interpreting an expression in two or more distinct
ways.
3: Humour
32
Irony
The use of situation, speech, and so on, to imply the opposite of what
is apparently meant. Variations in grade are understatement, satire and
sarcasm.
Absurdity
At variance with reason. Variations in grade are ridiculous and
ludicrous.
In considering variations in the possible level of incongruity, it is apparent that
incongruity can easily lead into derision. However, the essence of incongruity is an
understanding of the relationship between two situations or thoughts - one reason why
different people laugh at different things. It is, therefore, important that the creator of
the comedy is aware of the apparently dominant information and the pattern breaking
finale and takes all possible steps to reinforce the deception, so that the final collapse
into incongruity is complete. Thus, all types of incongruous humour depend mainly on
an appropriate temporal presentation of the final stages of a plot. Figure 3.1. shows
the emphasis based relationships between incongruity and the logical elements of plot
construction described in Chapter 2.
Motivation
Realisation
Resolution
Incongruity
Figure 3.1 Relationship between incongruity and narrative structures
3.2.5 Derision
There are actually two important features of derision. First, the explicit transformation
of moods or states of the portrayed character:
Mischief
Behaviour that results in trouble (downgrade of mood) and possible
damage (transformation of state) to the subject, but no serious harm.
A second feature of derision, which is more difficult to measure, is the implicit
upgrading of the mood of the viewer:
Schadenfreude
A delight in another's misfortune, where that misfortune is
unexpected by the other.
Superiority
A delight in another's misfortune, where the misfortune results
from inappropriate (e.g. stupid) behaviour.
3: Humour
33
As temporality supports incongruity (discussed earlier), so it supports derision, though
to a less critical extent. For derision, sympathy is far more important. If the
unpleasantness is not appropriately presented, it might be that the situation is too
violent, banal or pathetic, and derision then leads to a lack of interest in being amused
and the material is seen as unpleasant rather than funny. Thus, the adequate
motivation of the viewer, and the presentation of the humorous event are particularly
important for derisive jokes. The position of derision in narrative logic is described
graphically in Figure 3.2.
Motivation
Realisation
Resolution
Derision
Figure 3.2 Relationship between derision and narrative structures
3.3 Humour strategies
Humour strategies combine the logic of narrative structures with the functional
operators of comic primitives. The goal of the ensuing discussion is to show that the
combination of the above primitives in strategies can achieve emotional reactions
from viewers of automatically generated meaningful visual sequences.11 The
strategies to be presented were derived from the humour theories described above,
from discussion with humour theorists and from intensive analysis of the visual
humour found in film comedies.
As discussed in Chapter 2, every narrative begins by establishing a set of spectator
expectations, based on a logical order of events, where these expectations are
eventually foiled. A comic event 'can exist only within a narrative context - as a
consequence of the existence of characters and a plot' in contrast to gags which
'constitute digression within a story or story-based action' (Neale & Krutnik, 1990, pp.
44, 57). It is thus clear that comic strategies are applicable at the level of events, as
well as episodes, or even plot, since all reflect the same logical pattern of motivation,
realisation and resolution. The strategies defined below apply to the episode, an
approach which can be compared with early silent movies, where a plot was replaced
by a basic situation, perhaps focusing on a place (shopping centre, school), an event (a
11
The author is aware of the fact that the strategies cannot be seen in isolation but require global
understanding of the video images and their wide range of possible micro- and macroscopic
connotations. However, discussion of these problems is deferred until chapter 4.
3: Humour
34
marriage), or an object (a hose), and a series of jokes were based around this central
situation (see, for example, Shoulder Arms by Charles Chaplin).
Recalling Figure 2.3, which illustrates the relationship between the three logical
elements and the structure of a plot, it is clear that action establishes plot. The
character's actions express his or her mood or intentions to the audience. Thus, actions
define the character. Moreover, actions take place in an environment, which may
feature other characters and objects, and thus creates a context. This leads to the
following strategy:
H-Strategy 1
An action forms the most suitable subject for a joke, then an
actor, then an object, and finally a location.
An action expresses an intention or mood. Given the assumption that an actor focuses
his performance on one action, the following strategies may transform the
performance into comedy. The comments in brackets contain the comic primitive and
the expected reaction of the viewer.
H-Strategy 2
if the action portrays an intention [goal], interrupt the action
in a way that is unexpected by the character, so that the goal
cannot be fulfilled and the character's mood is downgraded or
he or she suffers in some way. (Mischief + Schadenfreude)
An example joke of this type is when we see someone eating a cake (goal) and a piece
drops onto the shirt (interruption). The narrative element here is the following of one
action by an oppositional event.
A variation in the level of Schadenfreude can be achieved by changing the motivation
of the derisive joke. It is, for example, more harmful to the character if the situation
takes place in a public setting in which he or she plays a major role, e.g. if she is the
bride in a wedding, than the same event taking place, say, in the privacy of her home.
H-Strategy 3
If the intention of the joke is derisive, a public setting is
preferable to a private setting but in both cases the character
should play a major part.
(enhanced Schadenfreude)
However, if the character realises that there is too much cake on the fork and yet still
tries to get the cake into his or her mouth, with the same humiliating result, an
equivalent narrative structure then provides a qualitatively superior joke, since the
3: Humour
35
changed emphasis in the motivation leads to a stronger involvement of the viewer.
From this the following two rules can be derived:
H-Strategy 4
If the action portrays an intention (goal), interrupt the action,
in a way that is expected by the character, so that the goal is
unfulfilled and the character's mood is downgraded or he
suffers in some way.
(Mischief + Schadenfreude +
Superiority)
H-Strategy 5
If the intention of the joke is derisive, reveal the point in
advance. (enhanced Schadenfreude)
Referring back to the last cake example, we now see that it is covered by H-Strategy
4, where eating the cake represents the goal, the awareness of the character fulfils his
or her required expectation, and dropping the cake represents the interruption and the
damage.
Nevertheless, the intention behind, or the goal of, an action is not always obvious in
its intention or goal. The fishing joke from the Chaplin film The Immigrant (described
earlier) cannot be established by the above strategies. Hence, additional strategies
must be introduced for such cases:
H-Strategy 6
If the action portrays an ambiguous mood, encourage the
context related positive attitude but then lapse into a mood that
portrays the inferior attitude. (Ambiguity + Superiority)
H-Strategy 7
If the action portrays an
ambiguous intention [goal],
encourage the context related obvious action but then lapse
into the alternative. (Ambiguity + Superiority)
H-Strategy 8
If the intention of the joke is incongruous, do not reveal the
outcome until the realisation or resolution.
(Superiority or enhanced Superiority)
Consider again the Chaplin "fishing joke". This can be created by H-Strategy 7. The
ambiguous intention (fishing, feeling sick) is represented by the action "leaning over
the rail"". The impression of "feeling sick" is encouraged by the environment, a
rolling boat, while the lapse into the alternative is achieved by Chaplin turning around
and showing a fishing rod. Reordering the joke by first showing Chaplin with the
3: Humour
36
fishing rod and then him leaning over the rail, would not be funny and thus, this kind
of order should be avoided (see H-Strategy 8).
The combination of the above strategies can result in relatively powerful jokes.
However, first we introduce four more strategies, to support such combinations.
H-Strategy 9
The structure of a joke can only be repeated if different
content combinations are possible. The continuation should be
of higher emotional intensity. (Timing + Exaggeration)
H-Strategy 10
The structure of a joke can only be repeated while the outcome
of the action or event portrayed is still warranted. Progression
ad absurdum.
(Exaggeration)
H-Strategy 11
If the exaggeration is part of a repetition it is easier to
exaggerate attributes of objects than attributes of actions.
(Exaggeration)
H-Strategy 12
Given that strategies 9 and 10 are fulfilled, a repetition of a
particular strategy should not be repeated more than 4 times,
or boredom may result. (Timing)
A sequence from the film The Naked Gun 2 1/212 serves as an example of the
strategies 9 - 12. In this sequence, Dr. Albert Meinheimer, a kidnapped researcher,
and Lieutenant Frank Drebin are held prisoner in a storehouse. Drebin, who is tied to
an office chair, attempts to free himself by moving his tied hands along the corner of a
steel framed stack of shelves. In front of the same stack sits Meinheimer, tied to his
wheelchair. Drebin's rhythmic movements release the following objects from the top
shelf onto Meinheimer's head: a baseball bat, a number of balls (H-9), a set of pool
balls (H-9 + H11), four horseshoes (H 10 + H 12), a skittle and then a skittle-ball (H-9
+ H 12), oil spilled out from a can (H-10 + H 12), white flakes of polystyrol from a
overturned box (H-10) and finally a small anvil (H-10). The last object, in particular,
presents problems of plausibility. However, the film provides the audience with a
large number of exaggerated absurdities, so this one is readily accepted.
12
The films in this series make repeated use of H-Strategies 9 - 12, either in a simple way, such
as the given example, or through more advanced variations, such as the repeated parodying of
other films (e.g. Casablanca, Ghost, and Psycho, in The Naked Gun 21/2), which actually
strengthens the underlying structure of the film series itself, this being the parodying of
'detective stories'.
3: Humour
37
With the above rules we are now in the position to use repetition in jokes. However,
we would still have problems with a joke such as the following. Recall the joke where
a man approaches a freshly painted bench. The viewer expects the character to sit on it
(H-5). The character notices the state of the bench just before sitting down (H-4), and
manages to avoid sitting (which is a typical reaction of the character but unexpected
by the viewer of the scene), after which he turns, only to fall over the litter bin (see H2, where "falling over" interrupts "getting up"). The extra but crucial structural plot
element in this case is not only the repetition of a strategy but the creation of two
"butts", the character and the viewer. We therefore require an additional strategy:
H-Strategy 13
If the intention of a joke is derisive then suggest a mishap, but
do not resolve into mischief but into the expected reaction
which avoids the mishap. Then repeat the strategy used on the
resulting action, this time enabling a mishap. (Mischief +
Frustrated Schadenfreude + Mischief + Schadenfreude +
Superiority)
The strategies defined thus far reveal, firstly, that the focus on single actions puts the
emphasis on derisive humour and, secondly, that strategies involving the character
and the perceiver might be more complex, but are definitely more effective:
H-Strategy 14
For a single action, mischief is easier to achieve than
ambiguity.
H-Strategy 15
Strategies that contradict the viewer's expectations before
fulfilling them are preferable to those that present the material
in a straightforward way.
However, it is usual for a character to be involved in several actions simultaneously,
which has implications for both the construction and the content development of the
presentation.
The actions involved in an episode may be connected, in that they form a larger
logical pattern, such as the different stages involved in obtaining a cup of coffee from
a vending machine, or they may simply be performed at the same time for some
particular reason. It should be clear that a familiar pattern more suitably meets the
requirements of exposition, since the context is already given and need not be
motivated.
3: Humour
H-Strategy 16
38
A sequence of actions that is meaningful is more preferable for
the construction of jokes than a sequence of unrelated actions.
Before investigating actions that take place within an existing context, it is pertinent
to consider random actions taking place in the same temporal interval.
Parallel actions give rise to the problem of creating a relationship between the actions
to place them in the same context. This can be achieved either by constructing a wider
narrative context which can accommodate the actions, or by establishing a contextual
relation based on conflict.
The first approach, the creation of a wider narrative context, requires a comparison of
the given actions with a set of expositional actions within an expected chain of events.
If a set can be found where one (or preferably more) actions match required actions,
then the complete chain of events can serve as the basis for a joke. Imagine a character
standing in a large public indoor area, scratching his head and moving his other hand
in a pocket of his trousers. In the far background we can see a coffee machine, and our
character appears to be looking into this direction. A possible exposition is that our
character would like to obtain some coffee, and is searching for money in his pocket.
If several expositions are available, the most applicable (the one with the highest
number of matching actions and suitable environment) is chosen. The relevant
narrative structures involved are exposition and addition, as it is most likely that
several actions or objects must be added to form a suitable motivation for the joke. In
the above example of the coffee machine, it would be essential to establish the
character near enough to the machine to use it. The construction of jokes based on the
interaction between the character and the coffee machine will be described later in this
chapter.
The second alternative for creating a relationship between random actions, based on
conflict, reflects the assumption that a number of a character's actions will share
resources, i.e. subactions. Conflict can then be used to establish a derisive joke.
Imagine, for example, a character sitting at a table for breakfast. He is reading a
newspaper, and at the same time he is dipping a croissant into a cup of coffee. Both
dipping and reading usually require looking. The character cannot look at the
newspaper and the coffee at the same time. Thus, there is a conflict which can lead to
an unexpected result for the character, e.g. dipping the croissant into a jar of mustard
instead.
With reference to the above a new set of strategies can now be introduced.
3: Humour
H-Strategy 17
39
The combination of parallel actions into larger meaningful
units provides more comic potential than a single action.
H-Strategy 18
The combination of parallel actions based on conflict is more
straightforward than the combination of parallel actions into
larger meaningful units.
H-Strategy 19
The combination of parallel actions based on conflict is
preferable for the construction of derisive jokes.
H-Strategy 20
If parallel actions establish a conflict, then encourage the
action with the stronger link to the shared resource but then
construct the joke on the basis of the other action by using the
existing strategies for a single action.
H-Strategy 21
The combination of parallel actions into larger meaningful
units is preferable for the construction of incongruous jokes.
Referring back to the "croissant joke", we see that it can be created using H-Strategy
20. The conflict is established by using "reading" and "dipping". The use of "dipping"
for the joke is justified by its looser relationship to looking. Imagine, that we first see
the character reading the newspaper, we can then generate the joke by using HStrategy 2 for the action of dipping, where the mustard represents the suffering
caused.
Thus far, the assumptions made are relatively restrictive, since they consider only a
single character. However, actions usually involve objects, other characters, or even
groups of characters or objects. A new level of complexity arises as the idiosyncratic
goals of each participant can converge or diverge. This requires the construction of a
wider narrative context.
We will first investigate the relationship between a character and an object, as in this
case we are confronted with only one goal (in cases where the object is active we have
to support two goals). Assume that a character approaches a coffee machine. Even
before the interaction between human and machine takes place, a perceiver of the
scene would have certain expectations concerning the events about to occur. These
expectations form the foundation for possible gags. The important narrative elements
3: Humour
40
are immediately apparent: additional repetition, deletion or twisted analogy.13 Since
humans tend to empathise with other humans rather than with machines, it is
necessary to work additionally on the mood of the character, i.e. we have to establish
a sufficiently good mood to be downgraded. A possible sequence, therefore, could
show the smiling character approaching the machine, inserting the money, receiving
the coffee but no cup and becoming angry (see H-Strategy 23 below). An example of
twisted analogy might be the ejection of a chocolate bar instead of coffee.
H-Strategy 22
If the relationship between a character and an object can be
established within a given sequence of actions (event), either
delete actions from the character related chain, repeat actions
within the chain or add actions of an analogous behavioural
chain of actions. (Superiority or enhanced Superiority).
H-Strategy 23
If the relationship between a character and an object can be
established within a given sequence of actions (event), either
delete actions from the object related chain, repeat actions
within the chain or add actions of an analogous behavioural
chain of actions. so that the mood of the character is
downgraded or he suffers in some way. (Schadenfreude +
Superiority or enhanced Superiority)
The interaction between two human characters is more complicated than those
relating to a character and an object, since for two characters we must deal with
multiple intentions. As an example, consider one of the earliest films, L' Arroseur
arrosé (the sprayer sprayed) by Louis Lumière. The film shows a gardener watering
flowers with a hose. A boy sneaks up behind him, and steps on the hose, thus blocking
the flow of water. The gardener looks down the hose to see what is happening, the boy
lifts his foot and the gardener gets soaked.
In the above example it is the relationship between the two characters (the rascal and
the victim) and the object that links them (the hose) that provides an extra level of
complexity. Nevertheless, the comic structures involved are similar to the strategies
introduced above. We continue to emphasise the action of one actor (the gardener
watering the flowers, see H-1), and then interrupt the action to downgrade his mood or
make him suffer (due to his unawareness, see H-2). However, this time the action is
13
Twisted analogy means that the behaviour of a participant, or the outcome of an event, is
correct but for a different context.
3: Humour
41
interrupted by a second character, who actually intends to make the first character
suffer. Though the interruption is based on the relationship between a character and an
object (gardener and hose), we cannot apply strategies of the type H-Strategy 22 - 23,
since they represent an internal relation, whereas in the above example the
interruption is based on an external event (boy and hose). Thus, it is necessary to
introduce a set of new strategies, which cover the divergent intentions of two
characters in order to achieve a humorous scenario such as described in the gardener
example. H-Strategy 24 represents the top level rule for an event such as that of the
gardener example, while H-Strategy 25 represents an example rule for the type of "the
biter will be bitten", and H-Strategy 26 describes a rule for an incongruous joke based
on the relationship between two characters.
H-Strategy 24
A relationship between two oppositional characters should be
established in such a way that the goal of one character is to
interrupt the goal of the other in such a way, that is unexpected
by the second character. The reaction of the second character
must then be influenced by the first so that the second
character's mood is downgraded or he suffers in some way.
(Mischief + Schadenfreude)
H-Strategy 25
If the intention of a gag, based on the relationship between
two characters, is derisive, encourage the context related
obvious chain of actions but then lapse into an alternative
chain where the second character's mood is downgraded or he
suffers in some way. (enhanced Schadenfreude + Superiority)
H-Strategy 26
A relation between two equally valued characters should be
established in such a way that the context related obvious
chain of actions of one character is established but then lapses
into an alternative chain where the reaction of the other
character would make sense for a related but not explicitly
provided context. (Ambiguity + Superiority)
An example of H-Strategy 25 is the typical cartoon situation, where one character tries
to fool another character by smearing glue on a chair, and finally sits on the chair
himself.
An example of H-Strategy 26 is when a woman wants her husband, who objects to
washing, to take a shower before going to bed. He agrees, only to be seen standing
beneath the spray shielded with an open umbrella.
3: Humour
42
Further strategies can be imagined, e.g., the use of analogy to transform the behaviour
of a character from human into automaton or vice versa, or to assign human behaviour
to an animal, or vice versa.
Structural complexity can be increased by introducing more characters with several
assigned goals. An example can be found in the film Bringing up baby, where five
characters in predefined roles (first partner, initial partner, second partner, conscience
figure and blocking figure) attempt to keep the ideal couple (first partner, second
partner) apart. The goals of each character are linked by a particular homogenous
narrative structure (commitment comedy) which provides all the functionality, i.e.
introducing the flaw within the initial relationship, introducing the ideal partner,
separating the ideal couple, reuniting the ideal couple, etc.14 Needless to say, the
portrayal of such a "film world" or "genre" carries with it emotional overtones, and
thus offers considerable potential for humour.
The goal of this thesis is not to achieve the automatic creation of such complex
structures as those described immediately above. Nevertheless, we claim that our
above strategies are sufficient to enable the automatic generation of numerous
meaningful and humorous film sequences. Table 3.1 represents the above strategies
classified as either constructive or supportive humour strategies.
Constructive humour strategies
Supportive humour strategies
2, 3, 4, 5, 6, 7, 8, 13, 20, 22, 23, 24, 1, 9, 10, 11, 12, 14, 15, 16, 17, 18,
25, 26
19, 21
Table 3.1 Classification of humour strategies
The classification of the humour strategies mirrors the two main layers of a narrative,
i.e. form and substance (see section 2.2), where supportive strategies relate to form,
while constructive strategies relate to substance. Thus, the above classification of
humour strategies reflects our assumption concerning the applicability of a planning
approach to the automated plot generation (see section 2.2). The influence of the
strategy classes in the design of the planning system will be discussed in later
chapters.
The remainder of this chapter is devoted to defining a method for evaluating a joke.
3.4 Evaluation of comedy
The overall goal of the creation of a meaningful video sequence is that it should
communicate the intended theme as successfully as possible, which in our case means
14
For a detailed analysis see Brunovska Karnick (1995, pp. 132 -136).
3: Humour
43
that the sequence is as funny as possible. The measure of this achievement is usually
the perceiver's reaction. Knowledge of the actual reaction of the perceiver is, of
course, unavailable to an artificial system, unless that system provides an interface
that permits the traditionally used mirth index to be applied.15
However, if a system could apply the mirth index, it would continue to be necessary to
provide certain measures during the actual generation process, as the mirth index is a
post-indicator and can thus be applied only to a completed joke. Thus, indicators are
required to support the choice of strategies, and to testify to the potential of content
directions to be taken. Examples of the former are meta-strategies such as HStrategies 1 and 15. To testify to the potential of content directions, we allocate:
•
a positive value for each resolved comic primitive per motivation, realisation, and
resolution based on its structural importance;
•
a negative value for each unresolved comic primitive per motivation, realisation
and resolution based on its structural importance;
•
the degree of resolution for each comic primitive, based on the strength of the
semantic relationship of the content elements involved;
•
the degree of complexity (viewer / character emphasis).
Later in this thesis we will show that the degree of resolution for comic primitives,
and the degree of structural complexity are both expressed as applicability values,
which are calculated, based on particular H-Strategies used, during the generation
process.
The above enables the simultaneous calculation of two "joke values". One represents
an ideal, and thus hypothetical, value which serves as the comparison measurement
for the actual joke to be created from the available material. If we calculate the
relationship between the hypothetical and real value as a percentage, and compare it
with, say, a mirth index transformed into a percentile system (i.e. 90 - 100 % =>
excellent, 70 - 89 % => very good, and so forth) it is possible to actually rate the
created joke. Of course, this does not guarantee that any viewer will find it humorous.
15
The "mirth index" represent an ordinal scale from 0 (negative response) to 4 (audible laughter).
For detailed information see Rothbart (1977, p. 93), and Gehring (1986).
3: Humour
44
A further advantage of the comparison approach described above is that it would
enable a system to detect the weak parts of a joke and either alter, or at least suggest
alternatives to, these parts.
3.5 Conclusion
Based on an investigation of the major humour theories, this chapter defined five
crucial primitives for the automatic generation of humour, i.e. readiness, timing,
exaggeration, incongruity and derision. We then introduced a novel method of
combining these primitives with narrative structures, by introducing 26 humour
strategies, which can support the generation of numerous meaningful and humorous
film sequences. Moreover, we showed that each of these strategies either supports the
generation of structure or content for a visual joke, and concluded that this will
influence our design of the planning system (i.e. multi-dimensional), to be described
in later chapters of this thesis. Finally, we described a novel mechanism to evaluate
the potential funniness of a joke, which is correlated with the introduced strategies.
45
Chapter
IV
Film
The main aim of this thesis is to investigate the potential for the automated generation
of meaningful and emotionally stimulating film sequences. In dealing with such a
task, it is necessary to investigate the medium itself, which obviously plays a central
role in any communication. Thus, we now examine theories of film structure which
will enable us to identify and then represent the knowledge elements, and the filmic
mechanisms, necessary for the creation of meaning through visual material.
4.1 Cinematic meaning
We begin our analysis of film by adopting Tudor's perspective on cinematic meaning,
as described in Table 4.1, which categorises the various sources of meaning in a film.
Channels
Nature of film
Cognitive
Aspects
of
meaning
Expressive
Normative
world
Factual nature of
film world (genre)
Emotional
meanings,
associated with
film world (genre)
Normative
meanings implicit
in film world
Thematic structure
Formal structure
Events in thematic
development, e.g. plot
Emotional involvement
in thematic structure,
(e.g. humour strategies)
Factual meanings
conveyed by form
Emotional
consequences of
formal structure
Normative meanings
implicit in thematic
structure
Normative meanings
conveyed by formal
means
Table 4.1 Tudor's paradigm of cinematic meaning Tudor (1974, p.128).
[Text in brackets added by the current author.]
Tudor's paradigm is organised around a cross-classification of the aspects of meaning
and the channels through which meaning is communicated. The aspects of meaning
can be subdivided into, firstly, cognitive aspects, which inform the spectator about
actions, appearances, events, and so forth; secondly, expressive aspects, which evoke
4: Film
46
emotions, and, thirdly, normative aspects, which shape the ethical inferences the
spectator draws from the film.
Transmission channels are classified as:
Nature of film world
The characteristic body of human content or reality
Thematic structure
The order of events and development of themes to
construct the narrative
Formal structure
The five categories of substance defined by Metz: the
moving photographic image, the recorded phonic
sound, the recorded musical sound, the recorded noise
and the graphic tracing of written matter. (Metz, 1976b,
p. 586).
In chapters 2 and 3, we laid down a theoretical foundation for the cognitive and
expressive aspects of the channels film world and thematic structures. It was shown
that a narrative is pre-selected and arranged material based on an intention to
communicate a particular theme or idea to the perceiver. The chosen material is taken
from characteristically conceptualised areas of human, social and natural content
(genre). Furthermore, we considered how channels serve to evoke the viewers'
emotions. In film, this is achieved through emotional overtones pertaining to the
content of the material. Within the narrational structure, emotional responses are
evoked particularly by the thematic structure, e.g. humour strategies. Applying this
theoretical background to film allows us to investigate the structure of film so that we
can see why film "works".
Meaning is always inextricably linked to the medium and its content, which form and
carry the message. The goal of analysing the medium is to reveal the textual or formal
aspects of film, and how these support the creation of meaning. We are specifically
interested in the information carried by the visual image, its formal permutations and
its capacity to communicate expressive meaning.1
1
The author is aware of the fact that the absence of sound is a strong qualitative drawback, since
sound provides three out of the five standard categories for substance. Sound defines space and
time in addition to being a powerful commentative story element. Instrumental music, for
example, was in use during the era of the silent movie (Güttinger, 1984). A recent example of
the effective use of music is Richard Linklater's film Dazed and Confused, which offers an key
to the feelings and ideas of the mid 1970s. Nevertheless, sound was excluded because its
integration is too great a challenge for such a short term project.
4: Film
47
Normative aspects of meaning are not considered in depth in this thesis, though the
author is aware of the relevance of questions relating to ideology or "world view" to a
film's author. However, we leave the addressing of such questions to further research.
The chapter closes with a discussion of a model of film editing that represents the
process of generating meaningful film sequences.
4.2 Phenomenological Approaches to film
When Parisians filed into the Salon Indien, where the Lumiéres held their first film
showing before a paying public, they were excited by the lifelike portrayal of motion,
which of course is one of the striking features of film and the main reason behind the
psychological effect of reality (Metz, 1974).2
Motion is an almost totally visual experience (a counterexample from the real world is
a passing train, which can also be detected audibly and haptically) that cannot be
reproduced unless one recreates the same order of reality, which means either
repeating the impression of reality or repeating reality itself. The physiological
phenomenon responsible for the illusion of movement is stroboscopic motion, which
describes the effect of an image being retained on the retina of an eye for a short time,
after it is experienced. If the image is displaced by another image in quick succession,
the deception of movement is maintained (Hochberg, 1978)3.
Unlike a photograph, which also confronts us with reality in the form of the objective
reproduction of objects, film includes the element of motion. It is the combination of
motion and the appearance of form which together “force” us to perceive a film as
something real. The interaction between movement and form causes a deliberation of
the objects from the two-dimensional screen, so that they stand out as substantiated
figures against their surroundings, or as Metz describes it: 'Movement brings us
volume, and volume suggests life.' (Metz, 1974, p. 7).
2
However, the illusion of realism conveyed by photography and motion pictures is historical,
because the public learned culturally to distrust them. The photograph or film is understood as
an object that has been worked on, with respect to composition and construction, under the
influence of aestethical or ideological norms (Barthes, 1977, p. 19). The analysis of the image
later in this chapter provides additional evidence of this.
3
The standard projection rate for a film is 24 images per second. However, European television
films use a speed of 25 frames per seconds, due to the frequency of 25 images per second used
by European television systems (Monaco, 1981). The playback rate of a video camera is
normally 30 frames per seconds.
4: Film
48
It can be argued, however, that motion is not solely responsible for the creation of
reality in film. Describing a still photograph as a frozen moment from the past (Bazin,
1967a) it is easy to understand why we do not take a photograph as something real,
despite its objectivity. We know that this particular image once existed in front of the
lens but that we are, at the moment of observing, confronted with a mere
reproduction. Although this fact is also indisputably true for images in film, i.e., the
objects and characters are reproductions, we are still encouraged to believe that we are
witnessing the occurrence of a real event. An example of the engagement of the film
audience in a portrayed event occurred in connection with a scene from Polanski's
"Rosemary's Baby" (see Picture 4.1). Ruth Gordon is sitting on a bed, talking on the
telephone. One can see her back and parts of her head, but not her face. William A.
Fraker, the film's cameraman, reported in an interview that he and Polanski actually
observed members of the audience attempting to turn their heads to the right so that
they could see around the wall and door frame to have a complete view of the
character. So, reality assumes presence, which is created out of two parameters, space
and time.
Picture 4.1
Ruth Gordon in Polanski'sRosemary's Baby (1968)
Metz (1974) demonstrated, in comparing film with theatre, that the latter cannot be a
convincing duplication of life, because it is so obviously a part of life (e.g. social
ritual, real actors, real stage, etc.). Despite the fact that the actors pretend to be in a
different time and location, it is impossible for the spectator to accept the performance
as reality because he or she shares the same space as the actors. The impression of
reality in the theatre is created indirectly and by virtue of a cultural awareness
(Bettetini, 1973). Film, on the other hand, confronts the spectator with real people and
objects, and their actions take place within a seemingly real environment.4 Moreover,
there is no connection between the space of the spectator's reality (i.e. sitting in a
4
This is the characteristic body of 'human content', as discussed while describing the concept of
genre in chapter 2, of which certain factors are constitutive for film reality. Within this reality
it is narrative that develops 'anything from outer space to an inner dream-world.' (Tudor, 1974,
p. 113).
4: Film
49
chair) and the space of the fictional film reality (e.g. Marlon Brando sitting in a
Cambodian temple, during the Vietnam war). The result is a steady mixture of the
reality of the ongoing film fiction and our perception of it.5 In other words, the
denotative material of the film becomes real through our identifications and
projections. One aspect of film reality is, therefore, our imagination, which means that
film reality is created within us, the audience.
To intentionally stimulate the viewer's imagination, we require a functional
understanding of the two formal structures within film, the content (realised through
the image) and its temporal order (as realised through montage). Since the
relationship between the two representational systems is complex, it is useful to
examine them separately to determine their relevance to the automated generation of
meaning. We begin the investigation with the image.
4.2.1 Film Image
The strongest impression of a film comes from its images. After seeing a film we
might not remember particular effects achieved by a skilled use of camera movements
or editing techniques, but the effect of the images remains, such as the rainy, gloomy
streets in Blade Runner, the expressionistic setting of Das Kabinett des Dr. Caligari,
the old man in Murnau's film Der letzte Mann, who sits dejected in the lavatory, after
being relegated from his position as the doorman of the Hotel Atlantis to that of toilet
attendant, or the impressionistic lightning in Murnau's Sunrise.
Though the message of the images is perceived by virtue of visual perception, which
is, in its physical aspects, the same for all perceivers, we can see from the above
collection of visual memories that it is also related to the experience and knowledge
of the viewer and thus differs between people and cultures. In this respect, film
perception is similar to language understanding. Hence, the language metaphor is a
common feature of theoretical analyses of motion pictures.6 In fact, linguistic tools
can be useful in the analysis of images. Visually literate people, those who learned to
comprehend visual material on anthropologic-cultural and psychological grounds, can
understand film more readily, and so the analogy hints at shared communicational
5
With reference to the above it should be clear that the reality reflected in a film is not absolute,
as suggested by the theoretical approaches of Kracauer (1960) and Bazin (1967b, 1971), but
rather multiple and thus open to permutation and combination.
6
See especially Metz's article Problem of Denotation in the Fiction Film in Metz (1974, pp. 108
- 146), Barthes (1967; 1977) and Carroll (1980).
4: Film
50
mechanisms which may thus be useful for the analysis of the medium. The analogy
between film and spoken or written language is, however, inadequate (Bettetini,
1973), as it breaks down when one attempts to identify filmic equivalents to words
and sentences.
Given the assumption that the word is the smallest meaningful unit in spoken
language, the question arises as to whether the single image represents the filmic
equivalent.7 Obviously not, since each image includes an indefinite amount of visual
information, which provides a continuum of meaning. Since it is this visual
information that forms the minimal unit within a film, it seems more appropriate to
see film as a semiotic system, where semiology is not understood as a translinguistic
system for examining all sign systems in terms of linguistic principles (Barthes,
1967), but rather as a general discipline for the study of signs, in which linguistic and
cinematic signs constitutes a specific topic (see Eco (1977, 1985); Jakobson & Halle
(1980); Peirce (1960); Saussure (1966)). Semiotic theories form the basis of the
following discussion.
4.2.1.1 The sign
The content of an image is composed of objects, where each object is a sign.
According to Saussure (1966), a sign usually consists of two distinguishable
components: the signifier (which carries the meaning) and the signified (which is the
concept or idea signified).
There are two significant aspects to Saussure's notion. Firstly, it is important to be
aware that the signified is not a mere referent of the signifier. The relation between the
two elements is not a naming-process only, as the signified resembles not a thing but a
concept.
Secondly, the relation between the signifier and the signified is arbitrary. Saussure
states:
'The idea of "sister" is not linked by any inner relationship to the
succession of sounds s-ö-r which serves as its signifier in French;
that it could be represented equally by just any other sequence is
proved by differences among languages and by the very existence of
7
The current author is aware that the smallest meaningful unit within the sentence is actually the
morpheme.
4: Film
51
different languages: the signified "ox" has as its signifier b-ö-f on one
side of the border and o-k-s (ochs) on the other. (...)
The word arbitrary also calls for comment. The term should not
imply that the choice of the signifier is left entirely to the speaker
(...); I mean that it is unmotivated, i.e. arbitrary in that it actually has
no natural connection with the signified.'
(Saussure, 1966, pp. 67 - 69).
It is, in particular, the arbitrariness of the relationship between signifier and signified
which enables the creation of higher order sign systems and their diversity.
However, the principle that the signifier has no natural connection to the signified is
problematic for photographic images, where the resemblance of signifier and signified
is a key factor. Unlike spoken or written language, the film image depicts, and the
viewer does not usually have to struggle to identify what it shows. The denotative
power of film, the optical pattern, communicates a precise knowledge, which releases
the audience from the process of decision making but nevertheless leaves a problem
of interpretation, as is discussed later in this chapter.
Since Saussure's definition of a sign is unsatisfactory with respect to the relation
between signifier and signified, we arrive at a more comprehensive definition by
adopting Peirce's view:
'A sign, or representamen, is something which stands to somebody
for something in some respect or capacity. It addresses somebody,
that is, creates in the mind of that person an equivalent sign, or a
perhaps more developed sign. That sign which it creates I call the
interpretant of the first sign. That sign that stands for something, its
object. It stands for this object, not in all respects, but in reference to
a sort of idea, which I have sometimes called the ground of the
representation.'
(Peirce, 1960, p. 2228).
Eco (1977) argues that, even though Peirce's definition seems very similar to
Saussure's, in that both base their view of a sign on the combination of a signifier and
signified, Peirce's definition is wider, as it actually inherits three distinct phenomena
of communication: the sign in relation to itself, in relation to the object and in relation
to the interpretant (Peirce, 1960, pp. 1529 - 1572). Thus, neither any kind of intention
nor an artificial production is required. The arbitrariness of a sign is still assumed, as
4: Film
52
in Peirce's understanding it is described as the continual reference of one sign to
another or to a string of signs. This constant transformation of signs into other signs is
not based on the representation of real factors with immediately motivated meaning,
but rather on the presentation of effects of conventionalisation. This is highly
significant as it shows that the signification of visual material is not only based on its
denotative characteristics but also, and more importantly, on its connotative units.
4.2.1.2 Sign and idea
Though the strong denotative quality of the iconic sign can be understood as that
which the creator intended it to be, it should be clear at this point that different
connotations can be attributed to the sign depending on the circumstances and
abductive presuppositions of the receiver at the time of perception, along with the
various codes and subcodes the receiver uses as interpretational channels. An image is
a cultural product, and as such offers more than merely the sum of its denotative
material.
For example, an image of a rose in a political portrait of Richard III can be understood
as that which it is, i.e. a rose. However, the connotation of the rose in this context
means more than the rose itself, because, depending on its colour, it represents either
the house of York (white) or the house of Lancaster (red). Furthermore, film-specific
devices, such as camera angle, colours, clearness or vagueness of the image, etc.,
give rise to connotations that provide additional information. For example, a lowangle view of a white rose might indicate the dominance of the house of York. The
same rose portrayed by an overhead view might give the opposite impression.
Thus, an important feature of the signification within an image is the organisation of
signs. Jakobson identifies two fundamental operations that exploit the organisation of
signs:
'(1) Combination.
Any sign is made up of constituent signs and/or occurs only in
combination with other signs. This means that any linguistic unit at
one and the same time serves as a context for simpler units and/or
finds its own context in a more complex linguistic unit. Hence any
actual grouping of linguistic units binds them into a superior unit.
Combination and contexture are the same operations.
(2) Selection.
A selection between alternatives implies the possibility of
substituting one for the other, equivalent to the former in one respect
4: Film
53
and different from it in another. Actually, selection and substitution
are two faces of the same operation.'
(Jakobson & Halle, 1980, p.74).
According to Jakobson, the application of the above operations to signs results in a
system of meaning based on alternations and alignments.8
In the process of alternation (or choice) a sign is compared, not necessarily
consciously, with potential but unrealised candidates of the substitutional space, i.e.
choices such as camera angle (high - low), colour (bright - dull), appearance (fresh
rose or fading), etc. The organisation of a paradigm is usually described as a vertical
structure.
The process of alignment (or combination) focuses on the order of signs where the
relationships between the signs resolve their meaning, or in Saussurean terms:
'Combinations supported by linearity are syntagms. The syntagm is
always composed of two or more consecutive units [...]. In the
syntagm a term acquires its value only because it stands in opposition
to everything that precedes or follows it, or to both.'
(Saussure, 1966, p. 123).
The paradigmatic and syntagmatic axes of meaning, as described in Figure 4.1, are the
basic supporting structures for signification, for any symbolic process or system of
signs.
A particularly interesting point made by Jakobson is that a sign system does not
consist only of the two fundamental structures (paradigmatic and syntagmatic), but
that each crystallises into a rhetorical device, i.e. the paradigm into the metaphor and
the syntagm into the metonym, which are opponents9. This means that, for Jakobson,
even these "free" variations deal with codes that are based on systems of opposition
and difference within the language of a culture, a social group or an individual.
8
This extends Saussure's syntagmatic and associative understanding of the linguistical planes of
meaning, as described in Saussure (1966, p. 123).
9
However, Whittock (1990) argues that there is no polarisation between the two rhetoric
principles of organisation, as suggested by Jakobson, but rather metonymy is one major type of
filmic metaphor (the second being what Whittock calls distortion metaphors) (pp. 35 -36).
4: Film
54
Paradigmatic
Axis
skirt
knickers
short
Syntagmatic
Axis
shoes
socks
pants
sweater
scarf
hat
kilt
culottes
tights
Figure 4.1 Syntagmatic and paradigmatic structures of clothing
(Monaco, 1981, p. 341).
This leads to the heart of signification and offers the link to Peirce's understanding of
the relationship between object and sign. In his famous trichotomy, Peirce defines a
sign as being either symbolic, iconic or indexical. The characteristics of each can be
described as follows:
• Icon
A sign which represents its object mainly through its similarity
with some properties of the object, based on the reproduction of
perceptual conditions. A zebra, for example, can be identified
from at least two characteristics - four leggedness and stripes.
• Index
A sign which represents its object by an inherent relationship.
Examples: A man with a rolling gait can indicate drunkenness,
a sundial or clock can indicate the time of day, a wet spot on the
ground indicates split liquid, etc.
• Symbol
A sign with an arbitrary link to its object (the representation is
based on convention). Examples: the traffic sign for a forbidden
direction, and the cross as an iconographic convention.
The above types are not mutually exclusive. Clearly, icons are the most prevalent
within images. In the ensuing discussion, we first focus on icons, since they form the
basis for the other two types of signs.
Due to their "similarity" with the represented object, iconic signs are a major factor in
achieving the effect of realism. However, Eco showed that the general arbitrariness of
a sign is valid even for this type of sign:
4: Film
55
'Let's look again at the frame indicated by Pasolini - a teacher talking
to students in a classroom. Consider it at the level of one of its
photograms, isolated synchronically from the diachronic flux of
moving images. Thus we have a syntagm whose component parts we
can identify as semes combined together synchronically - semes such
as 'a tall, blond man stands here wearing a light suit...etc.' They can
be analysed eventually into smaller iconic signs - 'human nose', 'eye',
'square surface', etc., recognizable in the context of the seme, and
carrying either denotative or connotative weight. In relation to a
perspective code, these signs could be analysed further into visual
figures: 'angles', 'light contrasts', 'curves', 'subject-background
relationships'.'
(Eco, 1976, pp. 601 - 602).
Thus, an image is based on the triple articulation of photograms, iconic signs and
iconic semes and receives its expression by convention. From the above, it should be
clear that an image itself should be considered to be a seme.
In his pioneering structural analysis of the cinematic image, Eco (1977; 1985) further
classifies the underlying code system for the triple articulation of the image described
above. Eco argues that the signification of signs is based on a socially determined
reticular system of small semantic systems and rules for their combination. He defines
a number of such systems, of which those relevant to this thesis are:
Recognition codes
Structural blocks of perceptive qualifications (signifieds) which
are transformed into semes, such as black stripes on white fur,
based on which objects are recognised.
Iconic codes
Perceivable elements that can be subdivided into figure, sign
and semes. A figure forms conditions for perception, such as
relationships between object and background, contrast in light,
geometrical proportions.10
A sign
denotes, using
conventionalised graphical methods, units of understanding
(nose, ear, sky, cloud), abstract models, or idealised diagrams
of the object (the sun as a circle with thread-shaped beams).
Semes are complex iconic phrases (the image of an object).
Iconic codes change readily within the same culture, due to
their contextual interlacing (a horse as part of a shop label may
suggest the availability of equestrian products, while a horse on
10
All of these codes have been developed and refined by other visual arts, i.e. painting, sculpture
and photography. Arnheim (1956) proposes ten determinants: balance, shape, form, growth,
space, light, colour, movement, tension and expression.
4: Film
56
a traffic sign may suggest "beware, horses on road"). Figures
represent structures on the syntagmatic axis, but only within the
frame of an image. Signs and semes represent structures on the
paradigmatic axis.
Iconographic codes
Semes of iconic codes are composed into complex and
culturalised semes, e.g. "the four horsemen of the Apocalypse".
Iconographic codes represent structures on the syntagmatic
axis.
Rhetorical codes
Models or norms of communication which can be divided into
rhetorical figures (e.g. metaphor), premises (e.g. a man riding
along a never ending prairie can connote loneliness), and
arguments (which create syntagmatic connotations based on the
succession or opposition of different images).
Stylistic codes
A stylistic feature, such as the mark of an author (for example,
a man walking along a road tapering off into the distance
suggests “Chaplin”), or the typical realisation of an emotion (a
woman who leans seductively against the curtain of an alcove
suggests “Erotic of the Belle Époque”) or the typical realisation
of an aesthetic, technical-stylistic ideal (as in cubism, where
objects are portrayed in abstract, geometrical forms).
The organisational structure of the above described system of cultural units
(Interpretants) is based on semantic fields. Our understanding of a semantic field
follows the definition of Bordwell (1989, p. 106): '...a conceptual structure which
organises potential meanings in relation to others'11. A semantic field can be built on
various principles, such as:
clusters
terms within a semantic field that overlap semantically, e.g.
synonyms
doublets
semantic fields organised on the basis of polarities, e.g.
oppositions
proportional series
a series of oppositional doublets e.g. female - male, passive active, womb - phallus, etc.
hierarchies
ordered semantic units based on relations of inclusion or
exclusion, e.g. Pekinese/dog/animal/living thing (Bordwell,
1989)
11
See also Eco (1977, pp. 73 - 150).
4: Film
57
The conclusion from the above discussion concerning the iconic sign is that the line
between denotation and connotation is continuous, which is explicit in considering the
sign as index. This type of sign is the most dynamic, as it is neither "identical" like the
iconic sign, nor arbitrary, like the symbol. The images shown in Pictures 4.2 - 4.3
represent examples of how indexical signs create meaning.
Spike Lee's movie "Do the right thing" portrays 24 hours of a stiflingly hot day in
Brooklyn (see Picture 4.2). The concept of hotness is suggested in the film by
showing “hot” colours (the red wall as background) for the three men sitting under the
sunshade, objects like the sunshade and aspects of appearance, such as the way in
which the men are sitting.
Picture 4.2
An image taken from Spike Lee's Do the right thing (1989)
Picture 4.3
Liv Ullmann in Bergman's Shame (1968)
4: Film
58
In Bergman's film "Shame", money is seen to be placed beside the head of the main
female character (see Picture 4.3), who is lying on a bed. This creates the concept
prostitution.
The above examples of indexes are metonymic, since associated details or notion is
used to invoke an idea or represent an object. Related to this is the concept of a
synecdoche, where the part stands for the whole or the whole for the part. A visual
example of a synecdoche is marching feet to represent an army.
At this point in our analysis of the image, it may seem that only paradigmatic choices
are relevant to the signification of an image. This is not strictly true. Since an image
creates a space, there are also essential syntagmatic influences on the understanding of
an image.
As an example of a syntagmatic influence on the understanding of an image, we
consider the three-dimensionality in a two-dimensional image, which works on
composition planes: frame (the image), geography (bottom line to horizon) and
depth.12 Example subcodes within these planes are depth perception or latent
expectation. Depth perception is built upon convergence, relative size, density,
gradient and occlusion (overlapping). Thus, for example, the importance of an object
is related to its position and size within a frame. However, social conventions dictate
certain expectations about space (e.g. a text can be read from left to right, right to left,
or from the top down). Other spatial expectations are, for example, that the bottom of
an image is more important than the top, horizontals are more important than
verticals, a diagonal line from bottom left to top right goes up, etc. Needless to say,
there are more codes operating in a static image, e.g. form and line, symmetric versus
slanting composition, lighting, and so forth.
It should be stressed here that the diversity of the semantic system described above
provides, with its combinatorial possibilities, the foundation for a subjective
interpretation by each viewer, as mentioned earlier. The image, embedded in a myriad
of perceptual, cognitive and cultural codes, is subjective in its accessibility. Consider
Picture 4.4, which is taken from Bertolucci's "The Last Emperor".
12
Frame and image are not actually synonymous The frame determines the limit of the image and
thus framing becomes part of the process of Mise-en-Scène, which takes place on the set and
includes decisions as to the position of actors, placing of cameras, choice of lenses, etc.
However, in film, the terms image and frame are used interchangeably.
4: Film
59
Picture 4.4 Image from Bertolucci's The Last Emperor (1987)
One of the key structural elements in this movie is a colour code, which accompanies
different stages of Pu Yi's life. For example, when he cuts his veins, he sees, for the
first time in the film, red as a pure colour. This colour represents our beginning; our
birth, according to Vittorio Storaro, the film's cameraman. The complexity of this
image, however, derives from the added concept of suicide, which merges now with
the idea of birth to rebirth. Codes can only realise their full potential impact if there is
an awareness of them or, in other words, if they can relate to existing knowledge.
The analysis of an image as an abstract element showed that it can be experienced
optically (objectively, realistically) and mentally (subjectively). However, both levels
are necessary for the creation of meaning. Figure 4.2 summarises the compositional
and the interpretational structures that enable the perceiver to understand an image.
The results gained thus far from our analysis of the image system enable us to identify
initial requirements for the representation of video and the architectural structure of an
automated editing system.
There is clearly a need for a representation of images which operates on the level of
iconic signs and, perhaps, semes. The description of the images themselves should be
as objective as possible, in order to facilitate the derivation of various connotative
meanings of the material. This requirement dictates that the representation applied
directly to the visual material should contain no connotative descriptions of that
material (similar points are made by Parkes (1989a) and Davis (1995)).
There should be a strict separation between the representation of cultural and visual
codes in the form of semantic fields, and the mechanisms for traversing networks of
these semantic fields. However, an exhaustive inventory of the codic, constitutive
parameters of an image must always be limited by the extent of the representation
used, and hence it is very doubtful that an automated system for the creation of
meaningful sequences could ever operate in a completely domain independent way.
4: Film
Figure 4.2
60
The compositional and interpretational structures that
make up the image (based on Monaco (1981, pp. 144 145)).
4.2.2 Film Movement
Film is a dynamic medium, and therefore we must analyse not only images but the
transition between images. Of particular interest is the effect on the identifiable
semantics of the single image when it appears in a shot, and how the syntagmatical
behaviour of shots is to be specified.
4.2.2.1 From frame to shot
In our above analysis of the image, we discussed the relationship between image and
frame. We showed that even the static image has underlying semantic structures, for
example iconic codes, that transform the image into a compositional unit.
Of higher complexity is the relationship between the frame and the shot, where a shot
is 'a single piece of film, however long or short, without cuts, exposed continuously'
(Monaco, 1981, p. 452). The additional element here is movement, which provides the
basis for the understanding of action, distance and the relationship between characters,
based on the relationship between images within a shot and their rhythmical
variations.
4: Film
61
In his work on visual codes Eco (1976; 1985) showed that there is a difference
between real physical actions and those represented in the cinematic medium:
'But passing from the photogram to the frame, the characters
accomplish certain gestures: the icons generate kines, via a
diachronic movement, and the kines are further arranged to compose
kinemorphs. Except that the cinema is a little more complicated. As a
matter of fact kinesics has raised the question of whether kines, as
meaningful gestural units (and thus, if you like, equivalent to
monemes, and definable as kinesic signs) can be decomposed into
kinesic figures - ie. discrete kine fractions having no share in the kine
meaning (in the sense that a large number of meaningless units of
movements can compose various meaningful units of gesture). Now
kinesics has difficulty in identifying discrete units of time in the
gestural continuum. But not so the camera. The camera decomposes
kines precisely into a number of discrete units which still on their
own mean nothing, but which have differential value with respect to
other discrete units. If I subdivide two typical head gestures into a
number of photograms (e.g. the signs 'yes' and 'no'), I find various
positions which I can't identify as kines 'yes' or 'no'. In fact, if my
head is turned to the right, this could either be a figure of a kines 'yes'
combined with a kine 'nodding to the person to the right' (and in
which case the kinemorph would be: I'm saying yes to the person on
the right'), or the figure of a kine 'no' combined with the kine 'shaking
the head' (which could have various connotations and at this case
constitutes the kinemorph 'I'm saying no by shaking my head').
Thus the camera supplies us with meaningless kinesic figures which
can be isolated within the synchronic field of the photogram, and can
be combined with each other into kines (or kinesic signs) which in
their turn generate kinemorphs (or kinesic semes, all-encompassing
syntagms which can be added one to another without limit).'
(Eco, 1976, pp. 602 - 603).
Eco not only shows the difference between real and cinematic action, but also
provides a semiotic outline of the paradigmatic and syntagmatic dimensions of the
shot, which places him unequivocally on the side of the "montage-roi", Eisenstein,
4: Film
62
whose approach to editing represents the most systematic attempt to address problems
associated with the notion of fragmentation (Eisenstein (1948, 1951, 1988, 1991)).13
Eisenstein describes the shot as a fragment from which the total filmic expression is
composed. He insists, at great length, on the possibility of mastering the constitutive
parameters of the image for a successful shot analysis, and shows that each fragment
features the same paradigmatic and syntagmatic structures as the image14. For this
reason, Eisenstein describes the frame as the cell of montage15. The creation of a
fragment through filming thus supports the specific translation of necessary actions in
space into a cinematic vocabulary based on which potential meaning can be extracted
(see Eisenstein (1991, pp. 11 - 57)).
Thus, elements that operate even in the static frame now take on extra connotative
power due to their dynamic qualities. The compositional use of focus, for example,
through which the foreground, middle ground or background are emphasised, guides
the perception of a shot. If all planes are represented in deep focus, they are attributed
with the same level of importance, whereas emphasis can be achieved by use of
shallow focus. Citizen Kane, by Orson Wells, provides many well-known examples of
the use of focus in these ways. Of even stronger impact than focus is camera
movement. Also worthy of consideration here are the pan (from left to right or right to
left around the imaginary vertical axis of the camera), the tilt (up and down movement
of the camera, rotating around the axis that runs horizontally through the camera
head), and the roll, in which the camera moves about the longitudinal axis from the
lens to the subject. The tilt, for example, presents the eye-level from which a scene is
perceived. The tilt can affect the importance ascribed to an object (for example, highangle shots may diminish the perceived importance of an object, as discussed earlier
in the example of Richard III).
Using the dynamic qualities of film, specific elements can, in one shot, directly
provoke an emotional reaction. Imagine a shot in which the camera follows a
13
This statement is not meant to diminish the achievements of other theoreticians such as
Kuleshov, Pudovkin or Vertov.
14
Eisenstein was very much aware of the existence of these parameters, but did not describe
them, except in the form of specific examples. The attempt to provide a complete collection of
fragmentation rules remains, perhaps, due to the nature of the material, limited to empiricism.
See for example the excellent but restricted approaches by Arnheim (1983) and Burch (1981).
15
However, Eisenstein also described the shot as a montage cell (see, for example, Eisenstein
(1988, p. 144)).
4: Film
63
character through a group of cheerful, passionate people. The appearance and
disappearance of the group in itself connotes the character's sense of isolation.
It should also be mentioned that the tempo of a shot can, in itself, provide information.
The intense feeling of fast movement may excite, while calm movement expressed,
for example, through the slow rolling of waves filmed from a static camera position,
may encourage feelings of relaxation. Related to tempo, is the perceived duration of
the shot. The actual duration of a long shot full of people and action may well be
identical to one of the close-up of a face, and yet the latter will be perceived as being
longer. Hence, the organisation of perceptible duration is more complex than the
actual duration of a shot16.
It should be made clear that, in itself, the shot is an individualised unit with an
independent semantics based on the juxtaposition of the intra-frame-components (for
a similar argument see Parkes (1989a) and Davis (1995)). However, we must also
consider the ways in which content of a shot can be affected by other shots, which is
the domain of montage.
4.2.2.2 Montage, or the semantics of fragmentation
A landmark in the understanding of film perception was the Kuleshov experiments
(Kuleshov, 1974). Kuleshov found that the juxtaposition of two unrelated images
would force the viewer to find a connection between the two. In one experiment,
described by Pudovkin (1968), Kuleshov focused on the creation of artificial
emotions. He took a close-up of the well-known actor Iwan Moszhukin, with a vacant
expression, and intercut it with shots of a bowl of soup, a dead man and a lascivious
woman to create three distinct sequences. Spectators to whom these three sequences
were shown believed that the actor's facial expression showed hunger in the first
sequence, sadness in the second, and passion in the last.
Other experiments were concerned with the artificial creation of space and character:
'A few years later I made a more complex experiment. Khokhlova
and Obolensky acted in it. We filmed them in the following way:
Khokhlova is walking along Petrov Street in Moscow near the
16
For a discussion of the interesting relationship between rhythm and shot composition, see
Eisenstein's article Vertical Montage in Eisenstein (1991), which provides diagrams in which a
sequence of his film Alexander Nevsky is described in musical terms; see also the first two
chapters of Burch (1981), which feature the use of analogy between serial music and montage.
4: Film
64
'Mostrog' store. Obolensky is walking along the embankment of the
Moscow River - at a distance of about two miles away. They see each
other, smile and begin to walk toward one another. Their meeting is
filmed at the Boulevard Prechistensk. This boulevard is in an entirely
different section of the city. They clasp hands, with Gogol's
monument in the background, and look - at the White House! - for at
this point, we cut in a segment from an American film, The White
House in Washington. In the next shot they are once again on the
Boulevard Prechistensk. Deciding to go farther, they leave and climb
up the enormous staircases of the Cathedral of Christ the Saviour.
We film them, edit the film, and the result is that they are walking up
the steps of the White House. [...]
In the second experiment we let the background and the line of
movement of the person remain the same, but we interchanged the
people themselves. I shot a girl sitting before her mirror, painting her
eyelashes and brows, putting on lipstick and slippers.
By montage alone we were able to depict the girl, just as in nature,
but in actuality she did not exist, because we shot the lips of one
woman, the legs of another, the back of a third, and the eyes of a
fourth. We spliced the pieces together in a predetermined relationship
and created a totally new person, still retaining the complete reality
of the material.'
(Kuleshov, 1974, pp. 52 - 53).
The above experiments demonstrates two distinct, but mutually influential, aspects of
our understanding of film:
•
the meaning of a shot depends on the context in which it is situated;
•
a change in the order of shots within a scene changes the meaning of the shot
as well as the meaning of the scene.
Experiments investigating the "Kuleshov effect" ascertained the variability of the
meaning of a shot within different contexts (Herman D. Goldberg, described in
Isenhour (1975), Salomon & Cohen (1977), J.M. Foley, referenced by Isenhour,
1975)). Experiments concerning contextual detail (e.g. direction of movement) were
performed by Frith & Robson (1975), who showed that a film sequence has a
structure that can be described through selection rules and/or combination rules. An
4: Film
65
example is the continuity of direction within movement, e.g. if a character leaves a
shot to the right, we expect him to enter the next shot from the left.
Gregory (1961) is responsible for some of the most significant analysis of the
importance of context and order in film editing. Gregory claims that not every
combination of shots creates a meaning but there are restricted conventions that can
help create larger meaningful entities. His key elements for creating meaning by
joining shots are assertions and associative cues.
An assertion is the relationship between two elements. There are many different types
of such relationships. For example, the description of an attribute (such as red for a
car) could be as important as a simple action (two men shaking hands). Consider a
close-up of a woman who is looking down, followed by an image showing a hand
holding an electric mixer directed into a bowl. The assertion made by this
juxtaposition is that the woman shown in the first shot is preparing some food
(Gregory, 1961, p. 39).
Gregory argues that a given shot "A" can build divergent assertions with other shots
by using various subsets of the information gathered from shot "A". This is especially
important, as it means that the juxtaposition of shots can be analysed, in that the shot
can be used as a variable collection of information rather than a fixed visual
description.
Associative cues result from combinations of the indicators that make creation of
meaning possible. Gregory introduces two main groups of cues as being important in
the creation of assertions. The first includes cues for the surrounding space. Most
human activities, human roles or objects are associated with specific locations. The
conceptualisation of space is, therefore, an elementary principle of the analysis and
organisation of material in editing. The second type of cue is related to human actions,
of which the above description of the woman cooking is an example.
If but a small number (or no) cues can be found, the two juxtaposed images invite a
combined interpretation by virtue of their ordering, but are nevertheless perceived by
the viewer as isolated units. In such a case, the resulting combination is usually
meaningless and could be irritating.
The main impact of Gregory's work is to show that the juxtaposition of shots is
subject to a situation plan, in which the action, emotional circumstances and timing of
the potential scene are defined. It is this plan which makes editing possible (Wulff,
1993).
4: Film
Thus, montage makes a point. For this thesis, an important factor
which a pattern of fragmentation can add "emotional overtones" to
another way, we are interested in those elements of montage
connotation to the emotional patterns already established by the
66
is the extent to
a sequence. Put
that add extra
narrative itself.
However, we will limit ourselves to some types of montage since a full exploration of
montage is beyond the scope of this thesis.17 The types of montage we investigate are
metric, rhythmic, tonal and polyphonic montage, all introduced by Eisenstein (1988;
1991).
The criterion which establishes metric montage is the absolute length of the shots,
where:
'Tension is achieved by the effect of mechanical acceleration through
repeated shortening of the length of the shots while preserving the
formula of the relationship between the lengths ('double', 'triple',
'quadruple', etc.).'
(Eisenstein, 1988, p. 186).
This means that shots can be combined according to particular rhythms, such as a
march or a waltz. It is not necessary that the meter should be immediately
recognisable, nor is it advisable to establish too complex a rhythm, but the rhythm is
nevertheless a condition for the creation of feelings. An example is the Caucasian
dance scene in Eisenstein's film October. A second instance of metric montage is
where the absolute length of each shot in a series is shortened, with the effect that a
controlled increase in pace builds up to a climax. In short, the control of shot length
and shot angle control the tempo, and thus the emotional appreciation, of a sequence.
Rhythmic montage, the second category that provides extra connotative meaning by
cinematographic means, is based on structures within the shots.
'Here it is quite possible to find a case of complete metric identity
between shots and the reception of the rhythmic figures exclusively
through the combination of shots in accordance with signs within the
shot.'
(Eisenstein, 1988, p. 187).
17
Such a work could also pay attention to methods introduced by Arnheim (1983) (the formalistic
approach), Balázs (1972) (the idea of micro-dramatics in the close-up), Pudovkin (1968)
(relational editing, that supports narrative using contrast, parallelism, symbolism, simultaneity
and leitmotiv) and Dziga Vertov, as described in Petric (1987) (notably interval and
uninterrupted montage).
4: Film
67
The best, and Eisenstein's favourite, example, is from Battleship Potemkin - the
famous Odessa Steps sequence. This sequence is constructed through the contrast
between the orderly troops and the fleeing disorderly population. The final
intensification is the switch from the marching of the soldiers to the rolling of the
pram down the steps, where the relationship between the actions of pram and feet
works as a direct accelerator for the rhythm (Eisenstein, 1988).
Tonal montage focuses on the emotionally dominant features of shots, represented
through combinations of light, graphical elements (e.g. sharp angled objects prevail
over round objects), focus, and so on. In a narrative film, the story line dictates which
features are initially likely to attract our attention, in a more abstract film the lighting
may be critical, and the film may show the changes in the moonlight and its shadows.
An example of tonal composition can be found at the beginning of the "Mourning for
Vakulinchuk" sequence in Battleship Potemkin, where the montage builds on several
foggy shots of the port of Odessa. The dominant feature here is the fog, and thus the
sequence does not establish a spatial transposition but rather an emotional setting.
The conflicting relationship between metric, rhythmic and tonal montage is, for
Eisenstein, one of the key elements of montage and the aim is to resolve it. This idea
of conflict leads to the idea of polyphonic montage, where shots are not mechanically
joined along a dominant line, but sensitively orchestrated so that the perceiver can
receive a multitude of organised stimuli.
It should be clear from the above discussion of montage that a shot, which is in itself a
unit of composition with individualised semantics, can serve different semantic
purposes when it is inserted between two other shots. New levels of meaning can be
created in such a way - levels that can change or even override the individual meaning
of a shot, or, as Eisenstein put it: '... the juxtaposition of two montage sequences
resembles not so much their sum as their product.' (Eisenstein, 1991, p. 297).
4.2.2.3 The sequence - film relationship
Though this thesis does not deal with sequences, which represent an episode in the
narrational model described in Figure 2.3, we mention them here for the sake of
completeness.
Ruttmann's avant garde documentary Berlin, die Symphonie der Großstadt is an
excellent example of how montage can be used to build the structure of a whole film.
However, the usual structure of film is not essayistic but narrational, and thus the
relationship between sequences is based on narrational aspects.
Metz (1974, pp. 108 - 146), describes a system of binary structures with which he
attempts to synthesise various theories of montage into a logical pattern. A key
4: Film
68
product of this work is the set of binary oppositions defined for the film segment, i.e.
that a film segment is either:
•
autonomous or not
•
chronological or not
•
descriptive or narrative
•
linear or not
•
continuous or not
•
organised or not.
However, based on the discussion of narrativity in chapter 2, and the above
description of meaningful structures in visual material, we have strong reservations
about the approach of describing a film through its syntax, a reservation which is
partly shared by Metz, who asserted that the syntax of a film is understood because
the film has been understood, and only when it has been understood. Nevertheless,
there are sequence structures that can reinforce meaning based on human content and
thematic structures.
The temporal aspects of a film can reinforce meaning. For example, in the film High
Noon, the real time of the film emphasises the structure of the sequences and thus
their tempo and shifts. Tarantino's Pulp Fiction is an example of the exact opposite of
this rigid structure. Here four stories are combined together, and over time (here at
crucial points of the narrative) the seemingly disorganised pattern falls into place.
The repetition of shot devices can also serve to reinforce meaning. A film composed
mainly of close-ups excludes information about its setting and becomes
claustrophobic, whereas a predominance of long shots emphasises context over
characters.
The above analysis of the syntactic and semantic structures of visual material
indicates that it is important to distinguish between filmic and cinematic codes. The
latter codify the reproduction of reality using cinematographic devices, while the
former codify that communication based on narrative mechanisms. It is clearly the
cinematic code which relies on the filmic code. It is essential that they are not
confused.
However, the work of Eco, Eisenstein, Kuleshov and Gregory tells us that film,
though based on common human content and thematic structures, provides its own
4: Film
69
realities of time and space which are interwoven in the narrative structure. Figure 4.3
summarises the syntagmatical categories of film.
Syntagmas
Space (synchronic)
Frame
Time (diachronic)
Shot
Sequence
Film
Figure 4.3 Syntagmatic categories of visual material (based
on Monaco (1981, p.145))
In order to support film editing, any direct representation of film should be restricted
to “pure” content and exclude any narrative mechanism.
Finally, it was suggested that the process of arranging visual material is plan based,
In the following section we will investigate this process in more detail.
4.3 A model of film editing
There are two key problems that the video editing process needs to address. Firstly,
the filmic material must be composed so that the film becomes perceptible in its
entirety, or the effect of reality is lost. Secondly, the intended idea or theme must be
communicated in such a way that the spectator can participate in the final product
both emotionally and intellectually.
Our model of editing (see Figure 4.5 below) is based on a knowledge elicitation
exercise which involved studying and interviewing editors at work in their own
environment, and on editing theory (Beller (1993); Katz (1991); Oldham (1992);
Rabinger (1989); Reisz & Millar (1969); Rosenblum & Karen (1979)).
A foremost task for an editor is the retrieval of information about the film, such as
the topic, the story, the characters, the intention of the film and its target audience.
The next step is to examine the available material. This is an extremely significant act,
because now the editor forms a model of the different characters and the influence of
different story lines. The editor retrieves information from and about the different
pieces of film by browsing through the material to order and categorise it. This usually
4: Film
70
results in shots and takes being placed on separate "heaps", each heap representing a
potential scene. Each heap and its elements is annotated in a list containing
information such as heap identifier, shot identifier, shot length in frames and shot
characteristics.
The act of observing the raw material evokes a chain of associations for the editor. He
or she remembers events, persons or emotions and these experiences may influence
the created scene. At this stage, the editor, the assistant and the director often talk
more about their own experiences than about the actual material. These conversations
serve many purposes, the most important being to clarify the material and to compare
it with other experiences in order to create a much subtler and richer concept, so that
finally the audience is confronted with a theme which can be re-created by each
spectator. Figure 4.4 describes the revised communication factors which influence
editors' decision making and help them to predict the potential viewer's intellectual
and emotional response to the film.
Personality
attributes
Narrativity
knowledge
Thematic
knowledge
Outside cultural
attributes
Editor
Editing
knowledge
Organismic
attributes e.g.
male, adult, etc.
Outside social
attributes
Shared cultural
structures
FILM
Receiver
Effects
Shared social
structures
Figure 4.4 Influential communication factors (based on Tudor (1974, p. 31))
The process of scene creation begins with a discussion of each scene with respect to
the available material, its intention and its part in the overall story. While the image is
highly important in making the visual statement, it is the specific interaction of shots
(at the level of length, rhythm, graphical direction, darkness and lightness, colour
etc.) which produces meaning. Thus, every cut must support both the concept of the
current scene and the overall appearance of the film. If the film is narrative in nature,
then the editor pays particular attention to different forms of spatial and temporal
continuity between juxtaposed shots, which may be based on the position of a
character in the screen, on the location or on actions performed. If the film is more
abstract, the continuity may rather be based on compositional features such as
graphical directions, or on rhythmical features such as speed of movement. Typical
features of the editing process at this stage, which is known as the rough cut, are
4: Film
71
insertion, elimination, substitution or permutation of shots both within a scene, and
the same for complete scenes within the overall structure of the film. These variations
are necessary for shaping the scenes until their appearance and position within the
film becomes stable.
At the end of the rough cut stage, the film continues to lack a definite visual precision
with respect to rhythm and technique, which it receives during the fine cut. The fine
cut deals with the perception-related connection of two shots, which is given by their
graphical appearance (contour, centre of sight, shared axes, etc.). In this stage, work
on the overall context is replaced by a narrow field of activity typically concerned
with units of something between 10 to 30 frames. (Schumm, 1993).
Retrieve film info
e.g. story, characters,
intention
Group shots and
takes into heaps
Create scene out
Control the effect
of related heap
of scene
Recreate scene
+
Control effect on
overall story
+
-
Decide about end
+
+
-
Register each heap
in a list & set
relevant parameters
Find start point of
film.
+ -> positive outcome
Redo ordering
- -> negative outcome
-
Stop
Figure 4.5 Simplified model of the film editing process
From the results of the above investigation of the editing process, we derived our
simplified model of the editing process, as presented in Figure. 4.5. This model covers
only the juxtaposition of takes, shots and scenes and does not take sound editing into
account. Furthermore, it emphasises only the rough cut. Finally, the complex
interrelationships between different stages of the process (e.g. the influence of
personal attributes on decisions or the comparing of different solutions) are not
specified in detail. However, the model serves as a workable approximation. It must
be made clear that we are, at this stage, considering only the creation of a single scene
where a start-shot is given. Thus, we focus on elements such as information retrieval,
scene creation, control effects and reordering or recreation of scene structure. We
exclude the creation of larger meaningful sequences of scenes.
4: Film
72
4.4 Conclusion
In this chapter, we showed that there are two levels involved in the viewer's
understanding of a film. Firstly, there is the optical level, which provides the perceiver
with mainly denotative information, and secondly, there is a mental experience,
which, based on cultural knowledge, provides predominantly connotative information.
In order to allow different connotative meanings depending on the context in which
the material is presented, we argued that a content-based representation of video must
be as objective as possible, must not contain connotative descriptions, and must
operate on the level of iconic signs and semes. As the organisational structure for
signs and semes we identified semantic fields.
We pointed out that an image, in itself, is a compositional unit with individualised
semantics, where the semantics may change if the image is combined with another
image. The same process of overriding individual meanings through composition also
arises in the juxtaposition of shots.
We stated that a distinction between filmic and cinematic codes must be made, since
the latter codify the reproduction of reality using cinematographic devices, while the
former codify communication based on narrative mechanisms. As a result, we showed
that film, though based on common human content and thematic structures, provides
its own realities of time and space which are interwoven in the narrative structure.
Finally, we illustrated that the process of arranging visual material is plan-based, by
introducing a model of film editing which focused on narrative oriented scene
creation.
Thus, the analysis provided in this and the previous two chapters has defined the
essential elements to support the automatic generation of meaningful and emotionally
stimulating film sequences. It was shown that a model of film generation operates on
two levels; firstly, on the surface level, which maps the concrete physical and social
properties onto the visual material presented in film, and secondly on the underlying
logic of the perceptual process.
A more detailed examination of the surface level of the generation model reveals that
two main representational tasks can be identified.
The first challenge is the problem of representing the optical pattern of images. The
second representational task within the surface level of the generation model is
concerned with an ontological representation describing the physical world and
4: Film
73
abstract mental and cultural concepts; representations that constitute the narrative
"playing field". The main categories in this ontology are based on the elements
described in Figure 2.4, i.e. events (actions and happenings), existence (characters and
settings) and cultural codes.
Obviously, both representational tasks mirror each other, in the sense that plot
generation establishes the query for the retrieval of visual material, which must then
be accessible on the basis of its content. However, film has its own "reality" which
exerts considerable influence on the representation of common sense knowledge. The
challenge for AI research is to synthesise the different representational requirements
of common sense knowledge and film content, so that both can contribute to the
editing of film.
The aim of the next three chapters is to describe the necessary representational
structures of the surface level, with which an artificial system can then perform the
task of developing an emotionally stimulating story by appropriately arranging the
relevant film material. Chapter 5 is concerned with the representation of Video
content. Chapter 6 discusses representations of the relevant background knowledge to
for automated video editing. Finally, chapter 7 is concerned with the representation of
narrative structures such as actions, events, and emotional codes, and also considers
the representation of conceptual structures required for the construction of thematic
effects such as ambiguity and mischief. The actual process of the generation of film
sequences by our prototype system, AUTEUR, is described in chapter 8, with
examples of created films being presented in chapter 9.
However, before entering into a discussion of the above representational issues, we
introduce several constraints on the ensuing investigation:
•
In terms of shots, we assume short units with a restricted range of actors, actions
and objects.
•
In describing the representational structure of common sense knowledge, we do
not intend to cover the whole spectrum of human knowledge18. In particular, the
18
Human knowledge here describes that structured information which provides the perceiver
with the ability of 'world making' which was described by Dudley Andrew as follows: 'Worlds
are comprehensive systems which comprise all elements that fit together within the same
horizon, including elements that are before our eyes in the foreground of experience and those
which sit vaguely on the horizon forming a background. Those elements consist of objects,
feelings, associations and ideas in a grand mix so rich that only the term "world" seems large
enough to encompass it.' (Andrew, 1984, p. 38).
4: Film
74
discussion of genre in chapter 2 yielded the insight that the narrated world is a
structured and stereotypical world. Hence, our aim is to provide structures for
micro worlds that feature stereotypical actions, episodes and behaviour.
We introduce the above constraints as the time span of our research project is, of
necessity, limited. Therefore, it is important to define a manageable problem space.
75
Chapter
V
The representation
of video content
The previous chapter defined the essential elements of the structure and function of
video. It was shown that stills and film are representational systems with independent
semantics, which may change if arranged according to the principles of montage.
The aim of this chapter is to use the insights gathered from the discussion in chapter 4
to describe the necessary representational structures for video content. In particular,
we wish to show how the optical pattern of images can be represented, which, as
described in Figure 4.2, provide the receiver with mainly denotative information. As
discussed in chapter 4, content-based representation of video should be as objective as
is possible, in order to allow different connotative meanings to arise according to the
context in which the material is presented. Given the current state of the art in
machine vision and image processing, we are able to automatically derive certain
restricted representations of the content of video. However, the information so
obtained is insufficiently rich (and will remain so for the forseeable future), to create a
representation of video content on which automated video editing can operate1. Thus,
sufficiently detailed content annotations must be provided manually, to enable the
artificial system to "perceive" the video material.
A scheme for the representation of video content must be concerned with more than
the visual aspects of the shot. This scheme must also support the appropriate visual
presentation of automatically generated emotionally-motivated narrative sequences.
Hence, we need to focus on the underlying organisational structure of the
1
Later in this chapter, we argue that image processing and machine vision may be used to
support the creation and maintenance of structures for representing video content.
5: The representation of video content
76
representation of video content, and its relationship to the representation of common
sense knowledge (to be discussed in chapter 7), and the representation of editing
knowledge (chapter 6).
In this chapter, we first examine related work in the representation of video content.
We then describe our approach for representing the structure, function and semantics
of video. The chapter ends with a brief outline of how the proposed representational
scheme can be applied automatically.
5.1 Related work
Film representation and automated editing is a relatively young field. Since the mid
1980's attempts have been made to combine computer technologies and media studies
to create artificial systems that embody mechanisms to interpret, manipulate or
generate video. We now examine this related work.
5.1.1 Bloch and his machine for audio-visual editing
The first relevant contribution to the development of video content representation is
that of Bloch (1986; 1988).
In his thesis Bloch discussed in detail:
•
the structural differences between the two media, language (spoken and written)
and image (still and motion), and the implications of these for transforming
written text into a visual construction;
•
some important features of the juxtaposition of shots; for example, techniques for
breaking down an action into shots, the principles of eye-line match and screen
direction of character actions, and the importance of camera movement in relation
to the movement of a character;2
•
the process of film editing, which resulted in the design of a basic editing
algorithm similar to that described in chapter 4 of this thesis.
2
Bloch's film analysis is based mainly on the work of Bazin (1967b; 1971), Burch (1981),
Eisenstein (1948, 1951), Metz (1974, 1976b), Mitry, as described in Metz (1976a), and Jost.
Since Bloch's thesis was written in French, he used the original texts. Most of these texts are
available in English, except those by François Jost.
5: The representation of video content
77
Bloch's object-oriented prototype system could generate short film sequences from a
conceptual representation of two given stories.3 The first story mentioned two actors
looking at each other. Bloch's system generated two visual versions: one version
showed the two actors in the same shot, while the other combined two shots, where
each shot contained one actor looking in the direction of the other. The second story
presented a character walking down a spiral staircase, leaving a building, seeing a
wallet and picking it up. The visual material used, shots stored on a video disc,
contained simple actions which were particularly designed so that the problematic
issues of matching point of view (camera position) and direction of actions between
shots could be ignored.
In this section, we focus on Bloch's representation of video content. A detailed
discussion of the importance of Bloch's work for the representation of editing
techniques is given in chapter 6. Details of Bloch's approach to the actual editing
process are discussed in chapter 8.
Bloch's perspective on film was strongly influenced by Metz's theory (Metz, 1974).
Thus, Bloch divides film into segments (sequence shot, parallel syntagma, ordinary
sequence, and so on) of which an autonomous shot forms the smallest unit.
In his research, Bloch focused particularly on two aspects of the juxtaposition of
shots. Firstly, he was concerned with matching two shots based on the line of sight of
actors, their position within a shot, or a combination of both. The categories of the
spatial transitions used were adopted from Burch (1981), i.e. inclusion, intersection
and determinate or indeterminate proximity.
Secondly, Bloch considered the problem of maintaining continuity, based merely on
maintaining fluency of motion. Bloch divides motion into direction and speed. The
direction of motion within a shot is described in terms of horizontal and vertical
vectors. The speed of motion is defined as the relationship between the speed of the
camera and the speed of the actions performed. The element of time, as represented by
Burch's temporal taxonomy, is integrated with other meaningful elements, such as
action and movement.
3
Bloch's conceptual representation of the story is influenced by the work on Natural Language
Understanding of Schank (1982), Schank & Abelson (1977), Schank & Riesbeck (1981) and
Schanks's students Dyer (1982) and Lehnert, et al. (1983). A brief overview of their work
appears later in this thesis (section 7.1.4).
5: The representation of video content
78
Thus, Bloch's approach to editing shots focuses on three key constraints, i.e. sight,
position and motion, related to semantic units within a shot, i.e. actors, actions and
location. Figure 5.1. shows the resulting representation of a shot in Bloch's scheme.
(deftrecord (#:machine:gen:plan plan)
video
;IMAGE DEBUT ET FIN DU PLAN
action
[Action of actors described as a triple vector
(Actorid, Objectid, Instant) e.g.
(attend (actor(said)) (object gilles) (inst yeux))]
acteurs [list of character names]
lieu
[location]
ass
;ASSOCIATION : UTILISE SEULEMENT
POUR L'ORGUE
type
[type of shot e.g. plan autonome (PA)]
interieur
jour
pos
[A list containing the ActorId-Shotposition, e.g
[said-gche, gilles-dte]]
mvt
;DIRECTION DU MOUVEMENT APPARENT
vapp
;VITESSE DU MOUVEMENT APPARENT
regard
;DIRECTION DU REGARD DU OU DES ACTANTS
[A list containing the ActorId-Sightdirection, eg.
[said-dte, gilles-gche]]
couleurs
clarte
[clarity of light: value between 1 and 10]
chaleur[heat: value between 1 and 10]
contraste
[contrast: value between 1 and 10]
mca
;DIRECTION DU MOUVEMENT CAMERA
mpe
;MOUVEMENT DES ACTANTS
vpe
;VITESSE DES ACTANTS
effects ;EFFECTS SPECIAUX (POUR L'ORGUE)
Figure 5.1 Bloch's shot representation (Bloch, 1986, p. 149).4
The representation of directions is based on:
•
ten elementary directions (left, up-left, up, up-right, right, down-right, down,
down-left, front, back);
•
circular direction, where the start position of the character is followed by the
direction, e.g. cir: left > up-right;
4
Texts in [] were added by the current author.
5: The representation of video content
•
79
directions in sequence, e.g. left + cir: left > down-right + down (describes the
movement of a character down a circular staircase).
The shot categories intérieur, jour, couleurs, clarté, chaleur, contraste, associations
and effects were not used in the prototype. However, they give an impression of the
wider shot representation considered by Bloch.
Bloch's work reflects a deep insight into the issues of the representation of video
content. However, it has deficiencies in terms of the retrieval of denotative and
connotative information. The distinction between character and object, for example, is
blurred. The representation of space is rudimentary, as, for example, no distinction
between foreground and background is made and the relation between intérieur and
position is left unclear. Furthermore, Bloch's representation lacks the ability to handle
multiple overlapping actions and camera movements. Finally, the representation of
point-of-view is not tackled, as noted by Bloch himself.
Nevertheless, Bloch's approach provides an extremely useful basis for representing
video content. In particular, Bloch introduces semantic and syntactic structures of
video content and introduces the key categories: action, character, location, context,
shot type, relative position of character and objects within a shot, movement and light
(Bloch, 1986, p. 49).
5.1.2 Parkes and CLORIS
Parkes describes the CLORIS system (Parkes, 1987; Parkes & Self, 1988; Parkes,
1989a; Parkes, 1989b; Parkes, 1989c; Parkes, 1992). In CLORIS, a user can interrupt
a moving film at any point (by executing a simple action with a “mouse”), and then
use menu commands to pose questions to the system about the objects shown, and the
events and states underway in the narrative at the interrupted point. The video used in
CLORIS was pre-shot videodisk material on the use of a micrometer.
The CLORIS methodology for knowledge-based description of video sequences is
derived from cognitive theories of visual information processing (Gibson, 1950;
Gibson, 1971; Gregory, 1971; Kennedy, 1974), research on temporal knowledge
representation (Allen, 1983), episodic memory (Lehnert, et al., 1983; Schank &
Abelson, 1977; Schank & Riesbeck, 1981), conceptual graphs (Sowa, 1984), and
cinema theory (Arnheim, 1983; Balázs, 1972; Bettetini, 1973; Carroll, 1980;
Eisenstein, 1970; Metz, 1974; Monaco, 1981; Pudovkin, 1968; Spottiswode, 1955).
Three overall dimensions of the representation of video material are defined, and are
realised in the CLORIS system as the following components:
5: The representation of video content
80
Domain Representation Inference System (DORIS)
In contrast to other multimedia researchers (Clark & Sandford, 1986; McAleese,
1985), Parkes understood that even though a visual medium is specific, it is necessary
to make a distinction between the actual portrayed object (e.g. Jim's house) and the
class the object belongs to (e.g. houses). He states the following requirements of video
representation:
'A representation language for photographic material needs to be able
to maintain the distinction (for physical objects, at the very least)
between concepts and instances of concepts.'
(Parkes, 1989a, p. 73).
'A representation language for photographic material needs to be able
to maintain the distinction between, and commonalities of,
information about concepts and information about specific instances
of those concepts.'
(Parkes, 1989a, p. 75).
'A representation language for photographic material needs to be able
to maintain a definitional type hierarchy which includes the concepts
which have instances depicted in the visual material.'
(Parkes, 1989a, p. 76).
'Objects do not exist merely on film, but have an independent
existence in the real-world. [...], even if all the visual details of an
object are presented, discussion may be required about the non-visual
characteristics of that entity, its relationships to other objects in a
domain, etc.'
(Parkes, 1989a, p. 78).
In his representation for the background knowledge required by a viewer of a film,
Parkes rigorously distinguishes between classes and instances of classes. CLORIS
uses scripts (based on those described by Schank & Abelson (1977)) to represent
stereotypical sequences of events. Type labels within the scripts feature in a type
hierarchy, and some types are defined by means of type definitions and schemata.
Events within scripts are represented by event-frame-rules, which capture the pre- and
post-conditions for an event, and also represent information about the states that
facilitate, accompany and result from, an event. The constructs used to provide the
descriptions of scripts were graph structures and event relations adopted from Sowa
5: The representation of video content
81
(1984). The description of temporal relations between events draws on interval-based
temporal logic (Allen, 1983).
Image State Descriptions (BORIS)
A fundamental insight of Parkes' research is the understanding that the image has a
'continuum of meaning', depending on whether the image is displayed in a dynamic
context, or simply as a still image (Parkes, 1989a, pp. 68 - 70). Consequently, he
rejects the keyword approach to the representation of images as insufficient, since
keywords can neither provide the necessary relations between the objects within an
image nor support the relations between images; and finally, keywords do not support
inheritance. Additionally, Parkes develops a new conception of the basic unit for the
representation of film content, the setting:
'Definition: a setting in a moving film is the unit of film associated
with the longest time-interval over which the visual content of the
film can be objectively described by using the same conjunction of
formulæ. A setting description is such a conjunction of formulæ.'
(Parkes, 1989a, p. 44)
Thus, a film setting could be a single frame or a collection of several hundred frames.
Furthermore, Parkes states:
'The setting is the minimal described unit of film sequence at the
level of events i.e. the constituent below which descriptions, at the
level of events, are not attached.'
(Parkes, 1989a, p. 44).
Settings have their own descriptions (so that the system can discuss the content of
frames from the setting as pictures in their own right). If a sequence of events
described by some script is depicted in a moving film, the script is specialised to refer
to the particular objects and actors within that film, and the events within the script
are associated with settings over which those events are realised. As the system has
access to the original, “abstract” script, it can infer which events are “present” in the
narrative but have not been displayed in the film (what Carroll (1980) calls “linear
deletion”).
The concept of the setting enables Parkes to describe relations between images in
terms of relations between settings, where such relations represented camera
movements, such as pan, tilt and zoom. The major problem of Parkes' approach is that
5: The representation of video content
82
it can neither cope with multiple overlapping camera movements nor with overlapping
setting structures.
Film Structure Knowledge (MORISS)
When answering questions, CLORIS would, if possible, use sequences of the film to
support the text generated as the answer. However, this facility was rudimentary, and
CLORIS would merely piece together sequences of film featuring a relevant concept
or event, without concern for the rules of editing or montage.
Parkes' system was the first to demonstrate that content-based descriptions of moving
film could be used by a system to intelligently discuss those films. His major
contributions to the representation of video content are the identification of the setting
and the highlighting of its context dependent behaviour; introducing the strict
separation between instances and classes of objects within an image; and emphasising
the objective description of objects in an image. As a result of these insights, Parkes
describes the settings structure, which offers a facility for browsing images on a
spatial basis. However, as for Bloch's scheme, Parkes' approach to content
representation is deficient when it comes to the retrieval of denotative and
connotative information within a setting. Furthermore, Parkes' scheme contains no
knowledge of editing and montage.
5.1.3 Aguierre-Smith and the Stratification System
The research carried out in the Interactive Cinema Group, at MIT, is concerned with
exploring the use of digital technologies to support the collection and access of nonlinear media materials. The aim of such research is to provide tools for the design or
use of media units, such as interfaces for the annotation of video material or the
tailoring of video news stories. Much of the research focusses on the indexing
problem for video (Davenport, Aguierre Smith, & Pincever, 1991; Mackay &
Davenport, 1989). Results of this work are systems for the annotation of video
material which consider stream-based/keyword methods of representation (Aguierre
Smith & Davenport, 1992; Aguierre Smith & Pincever, 1991), or stream-based/iconic
approaches (Davis, 1993; Davis, 1995).
Aguierre-Smith designed the Stratification System to support an anthropological video
study in the state of Chiapas Mexico (Aguierre Smith & Davenport, 1992). The idea
was to provide a number of researchers with random access to a video archive, in
which video could be annotated with complementary or even contradictory
descriptions. The video material was stored on a laserdisc. The annotations used in the
5: The representation of video content
83
Stratification System were keywords organised in hierarchical classes which were
implemented as directory trees in UNIX. The novel feature introduced by AguierreSmith was the multiple partially overlapping annotation, where each annotation is
related to a precise time index (begin and end frame). Figure 5.2 illustrates the idea
behind such a layered context representation, for a shot consisting of one hundred
frames.
0
100
bike
pepsi
praying
garden
Shot
Annotations
Figure 5.2 Layers of annotations for a 100 frame shot
To provide a visual representation of the distinct layers of the representation, the
Stratification System used a histogram, where the keyword classes are displayed as
buttons along the y-axis and the time code (frame numbers on the laserdisc) form the
x-axis.
Aguierre-Smith's stream-based content representation for video enables the dynamic
development of context while maintaining the completeness of the original footage.
The notion of multiple partially overlapping annotations establishes the Stratification
System as a breakthrough in the effective representation of video content, despite its
weaknesses, i.e. the keyword approach and the lack of a true representation of the
semantics of the video.
5.1.4 Semantic and conceptual indexing for video
The work of Chakravarthy (Chakravarthy, Haase, & Weitzman, 1992; Chakravarthy,
1994) is in the tradition of standard record retrieval and deductive retrieval in
databases. Chakravarthy's scheme provides computer access to a large database of
semantic knowledge and rules that manage background knowledge to match user
queries to the representation of stored pictures or video clips. The concepts described
in the semantic network use the organisational structures of WordNet (Miller,
Beckwith, Fellbaum, Gross, & Miller, 1993). We return to WordNet later.
5: The representation of video content
84
Chakravarthy's representational scheme for visual material is based on a set of actions,
including information about the agent, object, location, etc. If the picture or video clip
does not show actions, then the annotated description describes only people, location
or objects. However, the content representation of video does not provide information
about the temporal relationships between different actions, nor does it represent
cinematic features. The matching rules are of three classes (object, action, semantic
relations), each providing heuristics for finding pictures or video clips that match the
user's query. The relations provided for each class are:
object
A-KIND-OF,
HAS-MEMBER,
ASSOCIATED-WITH
PLAYS-ROLE-IN,
action
ENTAILS, CAUSES, TYPICAL-ACTION
semantic relation
these are rules describing relations between entities, where
matching rules make use of additional contextual information,
e.g. combinations of LOCATION-OF and PART-OF may be
used to create a rule that can retrieve visual material showing a
part of an object in a particular location.
Chakravarthy's system enables obvious matches, such as presenting a Basset hound in
response to a query for a dog. However, it can also perform more sophisticated
matches, such as answering a query for the action "riding", by providing a clip of an
astronaut driving a lunar buggy, or satisfying a query for an action in a hospital by a
video clip of a doctor positioning a microscope for microsurgery in an operating
room.
A related approach is taken by Lenat and Guha for their OPIAM system (Lenat &
Guha, 1994). In this research project, the large semantic knowledge base and
inference mechanisms of the Cyc system (Lenat & Guha, 1990) (also discussed later,
in section 7.1.2) are applied to the representation and retrieval of still and moving
images. The goal of OPIAM is that the captioner provides a fairly neutral statement of
the content. This, together with the domain knowledge, is then used to generate
indices on demand, at query time, to support the retrieval process. For example, the
system satisfies a user query "Find images of shirtless young men in good physical
health", by presenting clips annotated as "Pablo Morales winning the men's 1992
Olympic 100-meter Butterfly event" and "Three blond men holding surf boards on the
beach".
5: The representation of video content
85
However, there are several problems in the approaches of both Chakravarthy and
Lenat & Guha. Firstly, inferences that are based on the indexing statement may lead to
inappropriate retrieval results. Lenat's example of the three blond men holding surf
boards on a beach may show men in good physical health, but this may not necessarily
be the case. Secondly, both systems do not specifically orient their representations to
the requirements of still and moving visual material in ways we specified as necessary
in chapter 4. This means that the systems do not take video-specific ontological
properties and constraints concerning semantics and syntax into consideration.
Thirdly, both systems represent the image or video content in an explicitly determined
way, which reduces the possibility of exploiting such semantic and syntactic issues as
those raised by the Kuleshov experiments (section 4.2.2.2).
Nevertheless, the approach of using semantic background knowledge to represent and
retrieve images and video enables the derivation of different connotative meanings
depending on the context in which the material is presented (discussed in sections
4.2.1.2, 4.2.2.1 and 4.2.2.2). Furthermore, semantic networks add to an automated
editing system by facilitating the drawing of inferences regarding which shots may be
substituted for others, which in turns increases the fullness and accuracy of the
representation of context.
As an alternative to natural language based query and search approaches, Domeshek
and Gordon (Domeshek & Gordon (1995), Gordon & Domeshek (1995)), propose
conceptual indexing organised around cases in memory that support browse and zoom
oriented retrieval.
Domeshek and Gordon base their work on Domeshek & Kolodner (1994), Kolodner
(1993), Schank (1982), Schank & Abelson (1977), Schank, Kass, & Riesbeck (1994),
Riesbeck & Schank (1989), and have developed a stock video library for Andersen
Telemedia, to promote video production for training purposes. The conceptual
indexes are based on a canonical representation of concepts in particular cases (here
for the domain of the everyday social world) and a specific vocabulary. Six types of
indexes are suggested (Gordon & Domeshek, 1995)5:
•
the scene content based on abstract organisational schemes for people, activities
and locations,
5
The version seen by the current author during a visit to the ILS in August 1995 supported only
indexing for the scene content and the points illustrated by a clip.
5: The representation of video content
86
•
the points illustrated by a clip, i.e. an abstract idea or concept communicated by
the clip,
•
the composition and camera work in a clip,
•
the likely function of a clip in a larger narrative, i.e. as part of a interludes or
prologue
•
information concerning the source of the clip,
•
the relationship to other clips in the library.
The semantic network of the system is composed of single nodes, where each node
represents a set of disjoint concepts, and each concept contains the indexes for a case.
The indexes for the scene content, for example, include information about the
location, the events happening in the clip, the people and their roles, and objects. It
must be stressed that the concepts forming the indexes of a case are unstructured,
which allows the creation of simple frameworks and basic matching algorithms.
The domain dependent vocabulary for each index type is organised into several
hierarchical categories. For the component "location", the relevant categories are
specific places (organised by contained-in relationships), the function of individual
places (e.g. library, submarine, etc.) and the type of place (e.g. natural place, manmade place).
The retrieval mechanism used is based on the zoom and browse approach developed
for Ask Systems (Schank, 1994). Domeshek and Gordon adopt this approach by
offering the user case indexes that allow the user to refer either to the beginning of the
retrieval process (zooming), or to navigate through related case indexes by following
the system provided links to other conceptual indexes related to the types of indexes
on which the user is currently focused. The form of search suggested is, therefore, an
incremental case discrimination based on the availability of indexes.
The importance of the case-based approach presented by Domeshek and Gordon is
that a parallel can be drawn between the retrieval in case-based reasoning systems and
visually oriented storage systems. However, there are two drawbacks to the approach.
Firstly, the visual material is understood as a text that communicates a particular idea
or concept and is thus described as if the idea directly coincides with the content,
which is not the case, as our discussion of the connotative information provided in an
image showed (sections 4.2.1.1, 4.2.1.2 and 4.2.2.1). Thus, the case-based approach
5: The representation of video content
87
presents serious problems when visual material needs to be resequenced or
repurposed. Secondly, the structures in a case-based approach are based upon indexes
for particular concepts in particular cases. This means that a specific indexing
vocabulary must be introduced, on which the assignment of concepts to cases can be
based, which in turn restricts the indexes to those domains that have been analysed.
As we will show later in this chapter, the case-based approach presented by
Domeshek and Gordon not only serves the task of retrieval, but can also be valuable
for the creation of meaningful narrative film sequences, as the collection of cases can,
for example, represent a particular genre or theme (discussed in section 2.1.1). We
will discuss this point later, in chapters 7 and 9.
5.1.5 Davis and Media Streams
A five year program involving collaboration between BT (British Telecom) and MIT
has attempted to develop automatic and semiautomatic tools for the construction,
interaction and distribution of images and sequences of images (Pentland, Picard,
Davenport, & Haase, 1993). The main emphasis of the research is the development of
database oriented mechanisms for storing (image representation) and retrieving
(including browsing) images on the basis of their semantic content. A second aim is
the development of user friendly tools for the recording, annotation and presentation
of images. An important result of the research is Davis' system Media Streams
(Davis, 1993; Davis, 1995).
The main challenge in designing systems to promote the use or manipulation of video,
such as interactive TV or video on demand, is defined by Davis to be the mastery of
the video content representation problem. In his thesis Davis states:
'Signal-based parsing and segmentation technologies must be
combined with representations of the higher level structure and
function of video data in order to support annotation, browsing,
retrieval and resequencing of video according to its content. [...]
The challenge is to develop usable technologies for the
representation of video content that can leverage off of what
machines can currently offer us and what humans can achieve with
computational support. We are in need of technologies which add
structure to the signal such that video data becomes a structured data
type which can more effectively support current functionality and
uses, and more importantly, enable new uses and functionality.'
(Davis, 1993, p. 26).
5: The representation of video content
88
Media Streams is a result of a number of influences, such as:
•
dynamic memory, case-based reasoning and ontological and analogical knowledge
representation (Haase, 1994; Lenat & Guha, 1990; Schank, 1982; Schank &
Riesbeck, 1981),
•
text and film analysis based on reader response theory (Bordwell, 1985; Bordwell,
1989; Iser, 1989; Iser, 1993),
•
formalist, structuralist and semiotic approaches to film theory (Eco, 1976; Eco,
1977; Eisenstein, 1948; Eisenstein, 1951; Eisenstein, 1970; Kuleshov, 1974;
Metz, 1974)
•
the aesthetics and practise of reuse of TV material by fans of TV series (Jenkins,
1992).6
Media Streams is an advanced system for the annotation, retrieval and browsing of
video and audio7. Furthermore, it supports repurposing of video. A key feature of
Media Streams is an iconic visual language to create temporally indexed, multilayered
content annotations that support the retrieval and repurposing of video descriptions.
The design of Media Streams' annotation language is based on Davis' application of
the distinction between the sequence-independent and sequence-dependent meaning
of an image, and the necessity for an objective description of video content (see also
the discussion of Parkes work in section 5.1.2). Davis' principle categories for the
description of video content include the actions of humans and objects in spatial and
temporal locations, also taking account of weather and lighting conditions.
Furthermore, he highlights the important problem of representing cinematographic
properties, such as camera movement and framing, or properties of the recording
medium (colour, granularity, etc.) which also carry denotative and connotative
meaning.
To support the paradigmatic as well as syntactic features of video, Davis introduces:
6
The list of references here contains only those which can be found in the current bibliography.
For certain authors Davis refers to additional material. The references for Iser and Jenkins were
added as they are of significant importance to Davis' approach to the representation of video
content.
7
At the time of writing, Media Stream runs on a Apple Macintosh Quadra 950. The database
contains 17 different videos with a total length of 24.07 minutes and 2090 annotations. Media
Streams' visual annotation language is based on 3500 iconic primitives.
5: The representation of video content
•
89
A representation for actions based on Eco's triple articulation of codes of action
(kinesic figure, kinesic sign and kinesic semes). The spatial decomposition of
actions is organised around body parts that participate in the action. The temporal
decomposition of actions is based on a hierarchical organisation that describes
longer sequences of actions as being composed of temporal subabstractions (Lenat
& Guha, 1990). The direction of actions is related to the object or the screen
position at which an action is oriented.
•
A representation of character, which is oriented not towards identity but towards
continuity. The description distinguishes between the actor and the role. Actor
contains distinguishable descriptive elements such as sex, age, body type, skin
colour, etc. The role of a character is based on his or her appearance. The uniform
of a general, for example, identifies the character as such.
•
A representation for objects, oriented to form and function.
•
A representation of screen geometry, capturing spatial relations between objects in
symbolic terms, such as "in front of", "on top of", "inside" etc., and the position
of an object on the screen.
•
A representation of location, stating the actual location of filming and descriptions
that distinguish between geographical and functional space.
•
A representation of time, containing the actual time of filming and details of
temporal aspects of the portrayed event (historical period, time of year, time of
day, etc.).
•
A representation of cinematographic devices, containing descriptions relating to
the camera, the recording medium, and spatial and temporal transitions of the shot.
The camera is represented in terms of descriptions of lens actions (framing, focus,
exposure), tripod actions (angle, canting, motion) and truck actions (height and
motion). The recording medium is described in terms of stock type, colour quality
and colour grain. The spatial and temporal transitions within a shot are adopted
from Burch (1981).
The above categories are organised in a cascading semantic hierarchy with increasing
specificity of primitives on subordinate levels. The relationships between levels are
represented as class/instance (adult/male/Paul), class/subclass (dog/Pekinese),
whole/part (lamp/electric bulb) and term/co-occurring term (toothpaste/toothbrush).
The hierarchy is implemented in FRAMER, a knowledge representation database
5: The representation of video content
90
language developed by Haase (1994). The basic unit in FRAMER is called a frame,
which can have other frames (called annotations) subordinate to it. Figure 5.3 shows a
typical FRAMER structure.
Inheritance, to be understood here as the basis for the paradigmatic ordering of units,
is established as a relationship between prototype (animals) and spin-offs (Fido the
Wonder dog).
animals
fish
birds
amphibians
mammals
primates
canines
Fido the Wonder dog
legs (ground:4)
Figure 5.3 FRAMER structure for Fido the Wonder Dog's legs
(taken from Davis (1995, p.137))
Key elements of Media Streams' icon based interface are the Media Time lines, where
the iconic annotations for a particular piece of video are given temporal boundaries (in
and out points) and semantic relations (spatial location, character, character action,
object, object action) which connect the annotations in episodic structures, such as
'Mava lying on the beach' or 'A wave crashing into Mava' (Davis, 1995, p. 147).
Davis shows, with the above structures, that user queries can be mapped directly onto
the concepts represented in the semantic hierarchy and matched against indexes for
each case. The strategies used are based on three types of similarity: semantic
similarity, relational similarity and temporal similarity. The final result of a query is
valued on the basis of:
•
an exact match between query and retrieved video material
•
the hierarchical distance between the prototype of the query and the spin-offs in
the match (the lower the better)
•
the hierarchical distance between the prototype of the match and the spin-offs of
the query (the lower the better)
•
a match, where retrieved material and query both form immediate spin-offs of a
common prototype (the higher the better).
5: The representation of video content
91
Media Streams can use inheritance inferences in the matching process. For example,
in response to the query 'adult male eating food' Media Streams returns a shot of
'Steve Martin eating pizza', 'an elderly male eating food' and 'Charlie Chaplin eating a
shoe' (Davis, 1995, pp. 193 - 194).
Davis also provides examples where the user query is not satisfied by a matching
sequence of the annotated video material but is composed out of parts from various
video fragments. Davis refers to this retrieval strategy as retrieval-by-composition.
The mechanisms behind Davis' retrieval strategy are based on the continuity of actor,
role, location and/or action. Take the following query as an example: 'An adult female
at a beach rotating her body clockwise and then a medium shot of an adult male
waving his right arm with a boat in the shot'. (Davis, 1995, p 188). Media Streams
retrieves two different sequences. The first shows a shot of Mava on the beach,
turning to look off screen, followed by a shot of John waving from a boat. The
second version contains the same shot of Mava, this time followed by a shot of a male
sitting on a horse, waving a gun. In the background of the shot is a boat. A second
example, which is of particular interest for our purposes, describes a query requesting
an elderly female with mud on her head, followed by a shot of a laughing character of
indeterminate sex. Davis points out that the retrieved result is not particularly
successful in terms of the spatial continuity, but that it works due to the presented
action-reaction pair. (Davis, 1995, p. 190).
Though Davis describes the above sequences, and other example sequences, in
cinematic terms, it is obvious that Media Streams contains no explicit knowledge of
concepts such as:
•
methods for shot juxtaposition based on narrative related editing strategies (e.g
action match),
•
object specification based on shot transformation (e.g., longshot -> close-up)
•
constructive strategies for the creation of emotional reactions, as assumed in the
above example of the elderly woman.
The ability of Media Streams lies not in the creative combination of shots, but rather
in the retrieval of video material that is similar or analogical to a given query, and is
then presented in the order specified by the query. Thus, the narrative is explicitly
specified by the query. The composition of the given examples is not based on
5: The representation of video content
92
knowledge of sub-narrative structures, organising principles for video sequences (such
as continuity of graphical appearance between objects), or thematic intentions.
In conclusion, Media Streams is a useful system for the annotation and retrieval of
digital video. Davis' stream-based ontology for video is an important development in
the representation of video, with respect to its denotative, connotative and semantic
features. Media Streams' strongest asset is that it demonstrates how an intelligent
interface can facilitate rapid annotation of large quantities of video. However, with
respect to the repurposing of video, Media Streams plays a merely supportive role,
since any requirements for mise-en-scene, ordering, structure, cinematography, and so
on, must be explicitly stated in the query created by the user. Thus, strictly speaking,
Media Streams is not a system for generating video sequences automatically.
5.2 An ontology for the representation of film content
We now combine the results of our discussion of the semantics and semiotics of film
(chapter 4), with elements of the various approaches discussed above, to specify in
detail our approach to the representation of video content.
An important point concerning the use of an iconic visual language for video
annotation and retrieval is made by Davis, i.e that there exists an analogy between the
two visual systems of icons and video8. Though icons are not identical to video, they
share the same parallel legibility in terms of 'gestalt view of features, foregrounding
and backgrounding and spatial relationships' (Davis, 1995, p. 258). From this, Davis
concludes that representations of visual media ought themselves be visual.
We share Davis' reservations about language oriented representations of video, for the
interface. However, here we are concerned not with the interface but rather with
underlying representational structures and units to support essential tasks in the
editing process, i.e. the maintenance of continuity and temporal clipping. Thus, we
need to develop descriptive computational structures for video content that can also be
synchronised with structures representing the physical world and abstract mental and
cultural concepts.
Our proposed solution is based on a textually oriented representation that describes
semantic, temporal and relational features of video in a structured way, and uses a
vocabulary based on a subset of natural language. Nevertheless, the previous chapters
8
See also Eco (1977; 1985)
5: The representation of video content
93
emphasised the need to describe a communication system in its own terms, which may
differ from those provided by written and spoken language. What should be noted,
however, is that we use the connotative features of textual language to describe,
without saying that the resulting meaning is directly linguistic. In other words, we use
textual terms to express the salient features of video in a representational system
whose structure is designed to match visual requirements. The advantage of such an
approach becomes apparent when we later discuss representations of background
knowledge (chapter 7), and the analytical nature of natural language, i.e. to generalise
the abstract idea of a mental or cultural concept, is fully exploited.
We now address our proposed framework for the representation of video content.
First, we state the assumptions made in our approach.
5.2.1 General concepts and assumptions
The following video representation formalism pays specific attention to the
maintenance of objectivity in the description of shots, so that given shots can be used
for a variety of purposes. The shot is a complex combinatorial system for the visual
presentation of location, lighting, costume and the behaviour of subjects (mise-enscène). An important and influential source of the representation of such primary
visible properties is Marr's theory of vision (Marr, 1982). Two important points arise
from Marr's work. First, the content of visual representation is always individuated by
reference to the physical subjects, their properties and the relations among the subjects
that are seen. Second, there is in principle the possibility that a person's visual
interpretation based on objective and physical objects and properties might be
mistaken, and require other modalities to rectify the mistake (for Marr, this might be
achieved by using another sense, e.g. touch).
The preceeding chapter discussed in some detail the two important and fundamental
structures for the signification of any sign system, i.e. the paradigmatic and
syntagmatic axis of meaning. The basic organisation of these structures is hierarchical,
as is the organisation of the description of a shot itself, in that the description
descends from general features to detailed specifications. This enables us to reduce
redundancy, since valid relationships between descriptional units can be implicitly
expressed (inheritance). A similar approach is provided by Parkes' setting structure
(Parkes, 1989a).
Though the signification of film is strongly based on common human content and
thematic structures, film clearly provides its own communicational mechanisms, i.e.
cinematic and filmic codes, such as the use of fades and wipes as punctuation devices,
or the use of the spatial relationship between camera and subject to create a three
5: The representation of video content
94
dimensional space. Hence, our shot description features two main structures,
cinematographic devices and denotative aspects. The denotative part of the shot
representation is subdivided into two nested structures, the foreground and
background, each containing information about essential categories in the proposed
ontology: action, character, object, relative position, screen position, geographical
space, functional space and time. It can thus be said that the proposed structure for the
content representation of a shot is decontextualised, as inspired by Eisenstein's
concept of attractions (Eisenstein, 1948, p. 231). It is the decontextualisation of a shot
that enables the selection of objective information, which can then be used as a basis
for rearranging or constructing new connotative combinations, and thus new
meanings.
The aim of our content representation is to identify the visual aspects of the video
which can be seen in the shot, rather than those one might infer. As our representation
is language oriented, we face two major and related problems concerning the primary
level of representation: objectivity and continuity.
As we are using textual terms, we must avoid any overly directive choice of labelling.
Therefore, we introduce generic terms. For example, instead of instantiating the
action of an actor as gorge, which implies greed, only the most general term, eat, is
recommended. It is then the task of the system's inference mechanisms to use the
wider context established through background knowledge, to either conclude greed, or
reject this interpretation.
Though the information relevant to the essential categories in the denotative
dimension of the ontology should be sufficient for drawing inferences, it is not
necessary to slavishly apply the principle of objectivity. To do so may, in any case,
lead directly to a single interpretation. Consider, for example, a character wearing a
white coat, white trousers, and white shoes, with a stethoscope around his or her neck.
He or she appears to be a doctor, and should be labelled so. Further examples relate to
the description of geographical or functional space. An image showing the Eiffel
Tower can acceptably be labelled as Location: "Paris", or four huge heads of
American Presidents as Location: "Mount Rushmore" or "USA". Moreover, in the
Paris example, there is yet scope for interpretation, to clarify if Eiffel Tower should,
for example, be understood in terms of the semantic tags "Eiffel Tower - Paris France - Life Style" or "Eiffel Tower - Engineering - Intelligence". Such
interpretations depend on the wider context and the intended concepts to be
communicated. In conclusion, the use of generic terms taken from natural language
supports, but does not uniquely determine, the interpretation of a shot.
Related to the problem of objectivity is the problem of continuity. Kuleshov's
experiments (section 4.2.2.2) and Eco's semantic systems (section 4.2.1.1.) reveal that
cinematic continuity can be achieved through different content categories:
5: The representation of video content
95
actor
e.g. Charlie Chaplin
appearance
features of a character are shown without identifying the character
action
continuity can be achieved through reaction to, chronology of, or
direction of, the action
location
e.g. to show a house and then a room.
The essential aim of cinematic continuity is to hold constant the distinguishing details
of a character or object over a number of shots unless there is a reason to change the
appearance of the character or object. Thus, continuity leads to the problem of
identity. The representation of identity is a complex problem, particularly in film, a
medium that does not suggest, as does language, but rather states. Imagine a number
of shots to be joined, each of which shows a character described with the same values
for various attributes (e.g. Race = "black", Role = "doctor", etc.). For an artificial
system which depends on the given information, the person in each shot would be
assumed to be the same, even though each shot may, in fact, show an obviously
different human being. The problem here is one of descriptional depth and its
maintenance over time; a problem clearly related to the frame problem.9 Expedience
dictates the need to reduce the descriptive richness to a manageable level. For this, we
introduce an identifier, which is always used when a character or object is objectively
distinguishable, e.g. we see the face of a character.10
To facilitate dynamic use of video material, we follow the stream-oriented approach,
inspired by Aguierre Smith & Davenport (1992) in combination with Parkes' concept
of the setting as outlined above (Parkes, 1989a).11 The usefulness of combining these
two approaches results in gaining the temporality of the multi-layered approach
without the disadvantage of using keywords, as keywords have been replaced by a
structured content representation. For each objectively described unit of video
associated with a time-interval, it can be stated that the visual content of this video
holds constant over the time interval, and what is invisible does not exist.
9
Some of the more important works discussing the frame problem are Hayes (1990), McCarthy
& Hayes (1990), Raphael (1971), and Sandewall (1972).
10
This approach is in accordance with that of both Bloch and Parkes. A different opinion is
expressed by Davis.
11
A similar approach can be found in research by Butler et al. (1996). He realises filmic
principles as a "film grammar", so that the system can generate films of a certain type. The
nonterminal symbols of the grammar represent groups of video segments (e.g. a close-up
fragment), which are derived from rules based on filmic principles, such as parallelism,
subjective shot or repetition. Setting descriptions (see above) are used as terminal symbols that
offer the opportunity to realise event structures, which are then used to fire the rules of film
principles. Butler's notation, like that of Parkes (Parkes, 1989a; Parkes, 1989b), is based on
conceptual structures derived from conceptual graphs (Sowa, 1984). At the time of writing,
Butler is completing his Ph.D. research.
5: The representation of video content
96
Hence, we associate each descriptive unit in both the content and the cinematographic
section with a particular frame sequence. The connection between the different layers
of a shot is realised by applying a triple identifier to each layer, which indicates the
shot identifier, the start frame and the end frame. Thus, multilayered descriptive
structures of video content are created, where the multiple aspects of description are
held together by time (sameness of frames) and logical space (shotid). For example,
an actor may perform a number of actions in the same time span. The temporal
relation between them can be identified using the start and end point with which those
actions are associated. For example, they may all share the same start and end point,
and may be performed simultaneously. In this way, complex structured human
behaviour can be represented and hence the video retrieved on this basis. Figure 5.4
shows a layered description of a shot consisting of hundred frames, featuring the
actions of a single character.
0
100
eat
sit
talk
Shot
Annotations
Figure 5.4 Actions annotated in layers in a 100 frame shot
The horizontal lines in Figure 5.4 denote actions, whereas the vertical lines delimit the
various content based layers that can be extracted from this shot. Applying this
schema to all descriptive units enables the retrieval of particular material with no
restrictions on the complexity level of a query. Take the simple example described in
Figure 5.4. If there is a need for a character who eats, sits and talks simultaneously, we
are now in the position to isolate the essential part of a shot, as shown in Figure 5.5.
0
100
eat
sit
talk
Shot
Annotations
Figure 5.5 Relevant shot segment for a query for all three actions
The detailed description of the relevant procedures for performing such retrieval and
cutting are described in chapter 6.
Having introduced the overall structure of our shot representation, the next stage is to
specify the units in the different representational categories. The discussion begins
with cinematographic devices.
5: The representation of video content
97
5.2.2 Cinematographic devices
The representation of cinematographic devices, as presented in Table 5.1, is derived
from discussions with film editors and the analysis of film theory, as presented in
chapter 4. Our aim is to facilitate the application of those cinematic codes that are
related to the medium specific technology, i.e. camera, lens, filmstock, as it is this
technology which manifests itself in the medium's unique expressiveness (see Figure
2.4).
Name
Shot ID
Shotlength
Startframe
Endframe
Shot kind
Shot colour
Shot granularity
Shot contrast
Description
Identifier
in frames (25 frames for a second)
a structure including:
lens movement
[start camera dist., end camera dist.]
zoom-in
zoom-out [start camera dist., end camera dist.]
masking
left, middle, right
lens state (deep focus, foreground-focus,
background-focus)
camera distance (extreme close-up, close-up, medium,
medium long, long, extreme long)
camera movement (pan_left,pan_right, tilt_up,
tilt_down,roll_left, roll_right)
camera position (left, midle, right)
camera angle (overhead, high-angle, eye-level,
low-angle)
film speed (slow motion, normal, fast motion)
list of the dominant colours
colour
black & white
fine, medium, strong
high, medium, low
Table 5.1 Representational structure for cinematographic devices
The apparent redundancy of representing both lens movement and camera distance is
due to the fact that only one structure can be active at a given time. It should also be
stressed that the unit Shot colour is not necessarily cinematographic, but is rather a
shared feature of other codes. Shot colour has been designated as a cinematographic
device, because it is strongly related to shot granularity and contrast. Other
refinements and extensions to the representation of cinematographic devices may be
possible. For example, the representation does not deal with the stylistic device of a
split screen.
However, the representation is sufficiently rich to describe complex film specific
movements and expressive features, without unduly restricting possible connotative
combinations of the described material.
5: The representation of video content
98
5.2.3 Denotative aspects
The structures discussed in this section enable the description of visual information
that supports modifications to the meaning of a shot, based on common human
content, such as actions, characters and settings.12 The essential aim of the proposed
representation is, therefore, to describe the complex actions of characters or objects
in a geographical space within the three dimensional space of a frame, but without
ruling out the exploitation of the dynamic qualities of the medium. Additionally, we
need to represent such features as colour, lighting, oblique versus symmetric
composition, or depth perception, which, despite their physical denotative appearance,
offer a suitable basis for the use of codes in the creation of meaning by automatically
generated film sequences.
Since a shot posseses features from a large number of categories, it is useful to
separate the discussion of its components in two parts. We start with the description
of representations of character, object and action, and then provide descriptions of the
representations of space, lighting and time.
5.2.3.1 Character, object and action
Representing a character on the basis of his or her physical appearance is difficult, due
to the large number of features involved. We have already discussed the compeling
requirement to provide an objective character description and support essential
continuity factors, such as identity or appearance. Our approach to representing
character appearance (shown in Table 5.2) attempts to maintain a balance between the
divergent aims of continuity, objectivity and computational efficiency, by
establishing the essential distinguishable physical aspects of a human being.13
12
The term "setting" is henceforth used in the Blochian sense of "location" and not in the sense
used by Parkes (see earlier in this chapter).
13
It must be stressed here that our representation is mainly intended for human beings, though we
introduce the gender artificial, which hints at the possibility of exterrestrials being described.
However, most of these beings, as represented in films, posses many human features and are
thus likely to conform to the presented structure.
5: The representation of video content
Name
Shot ID
Startframe
Endframe
Identifier
Gender
Age
Race
Appearance
99
Description
Identifier for a character, e.g. a name or a number
male, female, hermaphrodite, artificial
e.g. young, old, 25, etc.
e.g. black, white, Asian, etc.
a structure including:
role
e.g. lawyer, plumber, stewardess, etc.
Costume
....kind
e.g. business suit, apron dress, overall, etc.
....colour a doublet list providing the major colour for
the top and bottom part
e.g. [black, white]
appeal
e.g. casual, formal, etc.
Table 5.2 Substructure "character appearance"
The most critical of the above attributes is Appearance. The redundancy it reflects (e.g
role and costume may reveal the same information) may appear to be problematic.
However, defining appearance in this way promotes computational efficiency, as the
need for inferences about identity and continuity is reduced.
The above representation provides only the meagrest details of a character. A
character is a dynamic, acting entity. As discussed in chapter 2, it is through particular
actions which he or she is defined, especially since actions provide cues to the
character's mental state or intentions.
The problem is that an objective content representation of actions should only contain
descriptions of objectively visible motion (see also Parkes (1989a)). These motions
need not be represented down to their atomic units, such as representing 'walking' as a
cyclic repetition of 'taking a step', as proposed by Davis (1995). By introducing
temporally related annotations, we can simply represent complex patterns of human
motion as single actions, examples being eating, reading, sitting, sleeping, and so
forth.
However, such a temporal-symbolic description of an action does not represent its
emotional connotations. Moreover, some human actions specifically suggest emotions
or intentions, i.e. gestures (Kendon, 1981; Ortony, Clore, & Collins, 1988; Wolff,
1972). The body, face, hands and limbs actions provide significant information either
through motion (deliberately indented and expressed in some accepted code, such as
winking, smiling, nodding or pointing) or statically (e.g. frowning and having the
arms folded). By introducing shape-based representations of the body parts full body,
head, hands and feet, and linking these to the temporal-symbolic representations of
actions, we gain access to the indexical, metonymic system of gestures. The
representation of gestures conforms to the principle of objectivity, because we specify
5: The representation of video content
100
that the meaning of a gesture is not explicitly stated in its representation but must be
interpreted by the related inference rule for cultural behaviour (for example, shaking
hands for greeting in western societies but bowing in eastern cultures)14. It must be
stressed, however, that our representation of hand gestures is currently rudimentary,
and should merely be understood as the first step towards a more complete
representational scheme, in which detailed gestures of hand and fingers can be
described.15
In addition to gestures as emotional indicators, we also include the speed of an action
in the representational scheme, since this may provide information concerning the
mood of a character. For example, an action which is performed slowly might indicate
that the character is not in a hurry and thus might either be relaxed and in a good
mood, or bored and in an ambivalent or bad mood. Table 5.3 describes our actiongesture-centred approach which covers a sufficiently wide variation of human
behaviour. The approach is inspired by Davis' body-centered structure for the
representation of action (Davis, 1995, pp. 107 -111). The attributes of the substructure Direction of action are taken from Bloch's representation (Bloch, 1986, pp.
140 - 141).
It may appear to be inefficient to relate information about body gestures to every
single action. However, through the use of temporal multilayered representational
structures, it is possible to automatically compare the time span for a newly annotated
action with existing actions for a particular character. In the case of a match, only a
link to the existing annotation for the body gestures must be established. If the result
of the comparison is partially overlapping, either the existing gesture annotation must
be temporally expanded or the undescribed gesture part of the new action must be
annotated. Cases of total temporal mismatch need complete annotation of the actor's
action, of course. In such a way, we determine that given information in a shot
description is not duplicated or altered, unless necessary.
14
For a discussion of cultural codes relating to gestures see also Bremmer & Roodenburg (1991)
and Efron (1972). For a description of gestures for actors see Siddons (1968).
15
Approaches to the generation of gestures for automated agents are described in Russel, Starner,
& Pentland (1995), Strassmann (1994) and Tosa et al. (1995). See also the work of the gesture
and narrative language group at the MIT:
(http://gn.www.media.mit.edu/groups/gn/)
5: The representation of video content
Name
Shot ID
Startframe
Endframe
Identifier
Relative Position
Action
Speed of action
Direction
of action
Bodygesture
101
Description
Identifier for a character, e.g. name or number
(Screen position first frame, screen position last frame),
e.g. (left, right), (left, middle), (right, right), etc.
e.g. eat, drink, walk, read, etc.
e.g. slow, medium, fast
left, up-left, up, up-right, right, down-right, down, down-left, front,
back, circular
a structure containing:
full body
horizontal, vertical, left-diagonal,
right-diagonal
Head
profile
right, left, half-left, half-right
movement up-down, left-right, up, down, left
right, circle
eyebrows
up, down, straight, etc.
line of sight left, right, straight, up, down, etc.
mouth
up, down, straight, open
Hand
action/related object e.g.(tap/table)
left
right
action/related object e.g.(holding/head)
Foot
e.g. tap, lift, etc.
left
right
e.g. tap, lift, etc.
Table 5.3 Substructure "actor action"
The representation of objects is based on similar structures as those defined for
characters, but is much simpler, since for objects emotional reactions need not be
considered. Nevertheless, objects possess shape, and can feature in events (i.e. have a
function). Table 5.4 describes the structure for the representation of objects.
Name
Shot ID
Startframe
Endframe
Identifier
Type
Shape
Relative
Position
Action
Speed of
action
Description
Identifier for an object, e.g. a name or a number
a structure containing
form, colour, size
(Screen position at the start, screen position at the end),
e.g. (left, right), (left, middle), (right, right), etc.
e.g. static, fast, slow, etc.
Table 5.4 Substructure "object "
5: The representation of video content
102
We are aware of the need to represent groups (e.g. in mass-scenes, such as an infantry
offensive or a demonstration), but we have yet to address this issue. However, a group
description of characters and objects would most likely focus on the size of the group
(small, middle, crowd), on its constituents (male, female, extraterrestrials, mixed), its
appearance (uniform, leisurely, etc.), its relative position on the screen and with
respect to other objects or actors, its function, and its direction of movement.
5.2.3.2 Settings: space, time and lighting
In the original French, mise-en-scène means "putting in the scene" and refers to the
compositional arrangement of subjects in a setting. Thus, mise-en-scène is mainly
concerned with the creation of screen space.
As described earlier, the syntax of cinematic space has two dimensions, the screen
space and the space being portrayed (the location). The former is concerned with the
limitations of the frame and the other with composition within the frame.
If the camera tends to follow the movement of a subject (character or object) then the
form of the frame is usually called "closed", whereas if the the character leaves the
frame and reenters, then the form is considered as open. The correspondence between
camera movement and movement within the frame forms one of the more
sophisticated cinematic codes. The attributes Relative Position and Speed of action
within the representation of character and object express the subject side, whereas the
sub-structures camera movement, camera direction, camera angle and film speed
represent the equivalent options for the cinematographic side.
Furthermore, a frame provides a compositional balance that distributes masses and
points of interest. The balance of a frame can either be strictly symmetric or provide a
loose balancing of the frame's left, middle and right areas. Moreover, as discussed
earlier, a frame provides depth. In chapter 4, we described the three compositional
planes representing the three-dimensionality of the two-dimensional image: the frame
itself, the geography (bottom line to horizon) and depth. Our representational ontology
deals with the different planes within a frame, and the problems of compositional
balance, by using the structures described in Table 5.5 and 5.6 in combination with
the attribute Relative Position, from the representational structure for character and
object.
5: The representation of video content
Name
Shot ID
Startframe
Endframe
Identifier
Spatial
relation
Identifier
103
Description
Identifier for an object or a character, e.g a name or a number
e.g. above, under, behind, in front, etc.
Identifier for an object or a character, e.g. name or a number
Table 5.5 Substructure "relations"
Name
Shot ID
Startframe
Endframe
Foreground
Description
Background
a structure containing:
actors
list of identifiers for single characters
agroup
list of identifiers for group of characters
objects
list of identifiers for single objects
ogroup
list of identifiers for group of objects
composition vertical, horizontal, canted-left,
canted-right,neutral
a structure containing:
actors
list of identifiers for single characters
agroup
list of identifiers for group of characters
objects
list of identifiers for single objects
ogroup
list of identifiers for group of objects
composition vertical, horizontal, canted-left,
canted-right, neutral
Table 5.6 Substructure "deep-space composition"
The representational structures for depth and horizontal space provide sufficient
information to support connotational inferences, such as the importance of frame sides
based on character load, or the importance of a character related to his position within
the frame or in relation to other subjects. Since it is possible to locate subjects in the
horizontal frame space, as well as in the "imaginary" three-dimensional depth plane,
our structures provide the essential information to establish continuity of common
space between shots (as discussed in section 4.3), of which the 180˚ system is the
most prominent example. The 180˚ system will be discussed in mored detail in
chapter 6.
A further critical feature of spatial content is the actual location, or setting, as it
appears in the shot. Location is more than a simple identifier, as André Bazin states:
'The drama on the screen can exist without actors. A banging door, a
leaf in the wind, waves beating on the shore can heighten the
dramatic effect. Some film masterpieces use man only as an
5: The representation of video content
104
accessory, like an extra, or in counterpoint to nature which is the true
leading character.'
(Bazin, 1967b, p.102).
One might think that an important component of the representation of spatial content
would be the identification of the actual location where the material has been recorded
(see, for example, Davis, 1995). This may well be important information for
interpretational purposes, as outlined in section 2.1.3. However considering the dual
semantics of a shot, as demonstrated in the Kuleshov experiments described in
chapter 4, it is apparent that this is not the case, unless the shot contains explicit cues
which determine location. If a shot shows a stretch of sand, only the context can make
clear if the portrayed location is a dune in the Sahara or part of the beach on the island
of Sylt. Nevertheless, there are critical aspects of location to be represented, such as
formal characterisations of the geography or the functionality of the location.
However, a setting not only provides information about location, but also temporal
cues. In discussing temporal cues, we do not mean those related to the duration of the
portrayed event or action, since these can be deduced from the time span defined from
the start to the end frame of the annotated unit. The temporal information we have in
mind is content oriented. The costumes may suggest the epoch, or give cues
concerning season or time of day. Representing such information explicitly may result
in redundancy, especially when the costumes are described using the structures for
object annotations described earlier. On the other hand, having the particular temporal
information explicitly stated, again reduces the need for inferencing.
Thus far, we have said much about the representation of objectively describable
components within a shot with respect to their compositional impact. The sole
essential category we have omitted is that of lighting. The importance of lighting is
precisely described by Bordwell and Thompson, who write:
'In cinema, lighting is more than just illumination that permits us to
see the action. Lighter and darker areas within the frame help create
the overall composition of each shot and thus guide our attention to
certain objects and actions. A brightly illuminated patch may draw
our eye to a key gesture, while a shadow may conceal a detail or
build up suspense about what may be present.'
(Bordwell & Thompson, 1993, p. 152).
Thus, lighting is significant and needs to form a part of the representational structure
for a setting, as seen in Table 5.7.
5: The representation of video content
Name
Shot ID
Startframe
Endframe
Time
Location
Lightning
105
Description
a structure containing:
epoch
e.g. middle-ages, 1994, 5000, etc.
season spring, summer autumn, winter
daytime e.g. dawn, noon, afternoon, midnight, etc.
a structure containing:
Geography
rural, populated, identifier
land
sea
outer space
indoor, outdoor, transparent16
function
a structure containing
direction front, back, side-left, side-right, high, low,
overhead
quality e.g soft, hard, light, opaque etc.
source
physics only applicable to outdoors, and featuring
atmospheric conditions, e.g. sunny, windy,
etc.
Table 5.7 Substructure "setting"
5.2.4 Conclusion
We believe that the structured representation presented in Tables 5.1 - 5.7 is sufficent
to describe the denotative aspects of film, in addition to its time and space
dependencies, without restricting possible connotative combinations - the latter is
especially a limitation of keyword based or unstructured free text annotations.
However, our work on shot representation is but a first stage. There are a number of
problems yet to be solved, such as linking subjects to the information provided by the
setting substructure lighting, which, at the moment, can be applied only to the setting
in general. A further problem is spatial in nature. Imagine a shot in which the
foreground of the left side of the frame shows half of a character's face, and in the
background on the right hand side, a group of people sit around a table and gamble.
The face is definitely a close-up, whereas the gambling scene is a long shot. Though
the representation is able to distinguish between shot types during the process of
juxatoposing shots, it is not possible so far to apply the same precision within the
border of the frame - unless particular inference mechanisms are provided that use
spatial and size information to establish such compositional relations automatically.
Further research is needed to address these problematic areas.
16
This represents a function in between that of indoor and outdoor, such as a carriage or a room
with a view.
5: The representation of video content
106
It should be stressed that the proposed organisational structure for the representation
of video content constitutes but a framework. Not every suggested attribute must be
annotated - though it is apparent that the "vision" of an autonomous editing system
depends entirely on the amount of information provided by the content annotations.
Hence, the remaining question of particular interest is how much of the presented
representation of video content can be provided automatically.
5.3 Technical environments for content annotation
The achievements of current research in video processing are far from being
sufficiently sophisticated to produce representational structures such as those
described above. In particular, the automatic parsing of high level semantic and
cinematic categories has yet to move beyond the investigation stage. However, there
are relevant developments that can contribute to the automated process of video
annotation with respect to:
•
identifying camera motions, such as pans and zooms (Tonomura, Akutsu,
Taniguchi, & Suzuki, 1994; Ueda, Miyatake, & Yoshizawa, 1991)
•
detecting fades, wipes and dissolves (Aigrain & Joly, 1994; Zhang, Kankanhalli,
& Smoliar, 1993)
•
recognising scene boundaries for news (Zhang, Gong, & Smoliar, 1994), which is
achieved by using a model designed to recognise particular types of shots
•
performing macro-segmentation for documentaries (Aigrain, Joly, & Longueville,
1995), based on transition effect rules, shot repetition and similarity rules, editing
rhythm and soundtrack rules
•
segmenting digital video by using explicit models of video production to design
feature extractors (Hampapur, et al., 1995a; 1995b)
•
structuring video based on the correlation of colour between two adjacent frames
in an image stream (Nagasaka & Tanaka, 1992)
•
recognising gestures based on the segmentation of characteristic silhouette
features, apriori knowledge of people and knowledge of the human body
(Pinhanez & Bobick, 1995; Russel, et al., 1995)
•
identifying object motion in constrained video (Herzog & Wazinski, 1994; Ueda,
Miyatake, Sumino, & Nagasaka, 1993)
5: The representation of video content
107
•
semi-automated annotation of sets of images by using several vision-based texture
models (Picard & Minka, 1995)
•
grouping structural image features, such as brightness, edges, and texture features,
which can then be transformed into a description of the most important attributes
of a set of frames. Detailed relationships between things, i.e. the geometry of a
scene or a human face, are captured by the Karhunen-Loeve transform, the Wold
transform being used for textural properties, e.g. orientation (Pentland, Picard,
Davenport, & Haase, 1994; Picard & Liu, 1994; Ashley, et al., 1995).
For the forseeable future, the bulk of the annotation, at the semantic level, will rely on
human activity, supported by intelligent and semi-automated annotation systems. In
this thesis, we are not concerned with environments to facilitate the annotation of
video content. For now, we have omitted this area from our research, and simply refer
to ongoing research on such interfaces by Davis (1995), Gordon & Domeshek (1995),
Mills, Cohen, & Wong (1992), Oomoto & Tanaka (1993), Tonomura, et al. (1994),
Ueda, et al. (1991), and Yeung et al. (1995), among others.
108
Chapter
VI
The representation
of knowledge for
automated editing
The task of representing editing knowledge may, on first impressions appear to be
simple, since at the physical level there are only two ways of joining shots. One can
either overlap them or put them end to end. The editing model presented in chapter 4
showed that editing is a much more complex process, which encapsulates the
retrieval, shaping and ordering of appropriate shots to support a cinematographically
coherent and clear relating of the story, where actions of characters are portrayed in an
undistracting way. Moreover, the three essential processes of the editing, i.e., the
retrieval, ordering and shaping of shots, are all simultaneously subject to the
narrational constraints of space, time and cause-effect.
The aim of this chapter is to present the representations of editing structures and
mechanisms that are required to establish the link between the available video
material and the narrative specification. It is important that the reader is aware that the
presented editing mechanisms and related structures do not alter the logic of the
narrative, but rather provide the knowledge for an appropriate presentation related to
the content and the narrative intention of the scene.
It must be re-iterated that, as already mentioned in section 4.3, we do not intend to
achieve an automated "fine cut" editing, but rather a joining of shots at the "rough cut"
level, and that our approach is not directed towards the production of "art". Finally,
the following investigation focuses merely on the creation of meaningful sequences.
The combination of shots at higher narrative levels than the event, e.g. the episodic
level, is not considered.
6: The representation of knowledge for automated editing
109
We begin our investigation with an analysis of the nature of joins between shots. In
section 6.2 we describe one system for dealing with the problem of spatial and
temporal continuity in video editing. Section 6.3 discusses related research into
automated video editing. The chapter closes with a detailed description of the
representations and strategies we have developed to facilitate automated editing.
6.1 Shot editing: Mixage and Cut
Joins between shots are of two main types. Firstly, shots can be overlapped (i.e.
double exposure, dissolves or wipes). As described in chapter 4, these joins serve as
punctuation devices within the syntax of larger narrative units, usually as end points.
Since such devices relate to a narrative level that we are not considering, such joins do
not feature in the ensuing discussion. A further reason for excluding such joins from
our investigation is that fade-outs, fade-ins, dissolves and wipes are optical effects,
and are usually achieved in the laboratory. These technical devices are regarded by the
current author as too complex to be considered at the current time, but should feature
in future developments.
The second, and more common, way of combining shots is the cut, which means
juxtaposing the last frame of a shot with the first frame of the shot to be joined.
However, there is a second option: the insert. An insert is when a shot or a chain of
shots is spliced into another shot. Thus, the intention of an insert is not to support the
continuous flow of information, as performed by a cut, but rather to introduce a
temporal transformation, i.e. the expansion of information (as discussed in section
2.1.3.).
Despite their importance as the primary means for editing, cuts are problematic in that
they constitute a physical break, which might reduce the viewer's involvement in the
presentation on spatial and temporal grounds. Thus, we need to perform cuts so that a
smooth information flow from shot to shot results.
In the early stages of the film industry, an editing system was established which uses
strategies of mise-en-scène and cinematography to ensure visual continuity between
shots. This system is called continuity editing, a style of editing used particularly in
narrative-oriented film, and thus most relevant to our research1. The following section
gives a brief introduction into the basic underlying structure (the 180˚ system) and
related cinematographic strategies.
1
There are of course other styles of editing, which might also suit the presentation of narrative
film sequences, such as spatial and temporal discontinuity (e.g. the 360ÿ space system, the
jump cut or the nondiegetic insert, where a metaphorical or symbolic shot is inserted which is
not part of the space and time of the narrative). Another editing style can be found in abstract
films, where graphic and rhythmic dimensions have a much more substantial impact. However,
these alternative models to editing are not investigated further in this thesis.
6: The representation of knowledge for automated editing
110
6.2 Spatial and temporal continuity in editing: the 180˚ system
Bordwell and Thompson describe the 180˚ system as follows:
'The scene's action - a person walking, two people conversing, a car
racing along a road - is assumed to take place along a discernible,
predictable line. This axis of action determines a half circle, or a 180˚
area, where the camera can be placed to present the action.
Consequently, the filmmaker will plan, film, and edit the shots so as
to respect this centre line. The camera work and mise-en-scène in
each shot will be manipulated to establish and reiterate the 180˚
space.'
(Bordwell & Thompson, 1993, p. 262).
The aim of the 180˚ system is to provide the viewer of a scene with a clear
understanding of the position of characters, i.e. the spatial relationship between
characters and the spatial relationship between each character and the setting. Figure
6.1 graphically describes the 180˚ system.
4
A
B
A
B
2
3
1
A
A
B
B
A
B
Figure 6.1 The 180˚ system (based on Bordwell & Thompson (1993, p. 263))
Imagine that A and B in Figure 6.1 are two conversing characters. The simplest way
of establishing the axis of action between A and B would be to use the shot provided
from camera position 1, because both characters are present in the scene, and their
spatial relationship needs not to be inferred by the viewer. For the viewer of the scene,
6: The representation of knowledge for automated editing
111
it is clear that the spatial relationship between A and B is oppositional, and that A is
located in the setting space to the left of B. Combining shot 1 with that taken from
camera position 3, the viewer can see from the background that some common parts
of the shot taken in position 1 appear, i.e. character A and parts of the scenery. Thus,
the viewer becomes spatially oriented with respect to the scene and understands that
shot 2 presents the same space but from a different angle. However, if we now joined
the shot taken from camera position 4, B would suddenly be surrounded by a different
background, and, even more importantly, would have changed sides with A. The
result would be a visual distraction, which should be avoided, unless such an
exchange of character positions is motivated.
Now consider a similar situation, except that now both characters are moving, e.g. two
people meet in a street. Assume that A moves from left to right and B approaches
from right to left. Now imagine that the screen direction of character A changes,
which means that he or she is now walking from right to left. Did the character turn
around while the walking of B was shown, maybe because A did not wish to meet B?
This may or may not be the case, but the important thing is that such a break in
continuity can cause confusion.
Thus, the 180˚ system allows the creation of a continuous space from autonomous
shots, but constrains the order of shots, based on content attributes of spatial
importance, such as the direction of characters' glances or their direction of movement
and spatial relationships between character and setting.
Figure 6.1 not only demonstrates the importance of camera position, but also that the
distance between the camera and the key event in the scene provides crucial
information. In chapter 4, we showed that there are two cinematographic devices for
controlling the distance, and thus the awareness space of the visual plot: the camera
distance and the lens movement.
Since there is no universal measure of camera distance, we use the classification
system developed by Dziga Vertov (described in Petric, 1987), which divides camera
distance into seven shot types, as described in Table 6.1.
6: The representation of knowledge for automated editing
Camera distance
extreme close-up
close-up
medium close-up
medium
medium long
long
extreme long
112
Covered content space
This shot isolates details, such as the lips or eyes
of a face.
This type of shot typically exposes the head,
hands, feet and smaller objects. The intention is
usually to highlight facial expressions, gestures or
particular objects.
A human body is shown from the head down to
chest.
Gestures
and
expressions
are
distinguishable.
A human body is shown from the head to around
the waist. Gestures and expressions become more
distinguishable.
Taking a human body as the measure, the subject
is framed from around the knees upwards.
Includes at least the full figure of subjects but the
background dominates.
The human figure is almost invisible. Used for
landscapes, bird's-eye views (e.g. of cities).
Table 6.1 Relationship between camera distance and size of presented
content space
Finally, there are a number of techniques for shot juxtaposition that support a smooth
flow of content space. The following presentation, adapted from Bordwell &
Thompson (1993, pp. 256 275), discusses only those strategies which are of
importance for automated editing at the level of the sequence:
establishing shot
This is the first shot of a sequence, and describes the general
location. The type of shot usually varies, depending on the
functionality of the location. For an indoor location, medium or
medium-long shots are preferred, whereas for outdoor
locations, long shots are usually more successful.
However, the establishing shot can also be a composed shot
sequence [as demonstrated by Kuleshov (usually featuring
shots of types between medium-long and medium close-up)].
Important analytical factors are camera position, camera angle,
camera movement and, for the composed version, a memory for
already established spatial relations. It is important that the
spatial relationship between subjects is kept constant from this
6: The representation of knowledge for automated editing
113
shot onwards, unless their movement motivates a reestablishment.
shot/reverse shot
A repetitive sequence of similar shot types, where each shot
shows the opposite end of the established axis of action. Shots
used for such a pattern are usually taken from behind the
subject that forms the opposite end of the established axis
(overhead angle). This pattern is used for action - reaction
situations and is usually used for the visual breakdown of a
scene.
re-establishing shot This shot is usually used when subjects are added to, or
removed from, a setting to re-establish the overall space. Thus,
a pattern such as establishing shot - scene breakdown - reestablishing shot is common. The mechanisms involved are the
same as for the establishing shot, except that now there is an
additional subject introduced (either a character or a group of
characters).
eyeline match
This tactic is used to combine two shots where one shot
presents a character or group of characters looking at
something, and the other shot presents what is being looked at.
The order of shots can also be in reverse order. The important
feature here is that in neither shot are object and looker
simultaneously present. (see also Bloch, 1986).
action match
This is an editing device that uses the beginning of an action
and reuses the same action in the following shot. The important
aspect here is to maintain constant on-screen direction of
movement. This is one of the most powerful editing devices for
providing continuity, because it activates the perceiver's
attention in the motion of the action, and thus lowers his or her
attention to differences resulting from the cut. (see also Bloch,
1986).
point of view shot
This technique combines shot/reverse with eyeline match. It is
normally used to show a scene observed by one or more
characters. This tactic, as applied to one particular human
action, is described in Figure 6.2, where the establishing shot of
6: The representation of knowledge for automated editing
114
B is taken from camera position 1, whereas camera position 2
establishes the axis for the object of B's gaze. The essential
analytical factors are the comparison of camera angle for
position 2, and the line of sight of the character portrayed from
camera 1, which must be equal or related.
A
2
B
1
Figure 6.2 Schematically description of the POV shot (based
on Bordwell & Thompson (1993, p. 273))
cheat cut
A perfect match between two shots, with respect to action or
graphical pattern, cannot always be ensured. A cheat cut tackles
the problem by using the power of narrative causality. The idea
is to emphasise overall similarities of graphical pattern, i.e.
constant screen position, constant direction of action and
eyeline match, as major control devices which need to be
fulfilled. Any extra accomplished constraint then adds to the
viewer's acceptance of the screen.
The above techniques indicate that visual, content-oriented continuity between shots
is mainly based on the direction of action, the relation between subjects (characters
and objects) and the position of subjects in a setting. Moreover, there is a need to
control the particular information presented, so that the viewer can be visually guided
towards the intended understanding of the sequence. Thus, there is a need to combine
the narrative logic (point of view, intention of an action in the given context, intention
of the sequence, etc.) with the representational structures and editing mechanisms
necessary for the automated generation of meaningful film sequences directed towards
a particular emotional outcome, i.e. humour.
6: The representation of knowledge for automated editing
115
Before introducing our approach to the representation of the features discussed in the
preceding paragraph, we provide a short review of existing systems that support
content-based automated editing.
6.3 Related work
Little work has been carried out on content-based automated editing. The two major
approaches are discussed in this section.
6.3.1 Splicer
Sack & Don (1993) describe a prototype video resequencing system called Splicer.
The system consists of two main components:
•
a knowledge base of around 50 video clips, that deal with the Iran-Contra
hearings. The clips are annotated on the basis of Sack's representation of point of
view and bias in the news (Sack, 1993) and Don's work on narrative construction
and point-of-view (Don, 1990). The annotations contain information about the
speaker in the clip, the topic, and other features.
•
montage rules, created by the user. These rules are written in a Prolog-oriented
language, and represent, for example, the strategy point-counterpoint. The rules
are used to compose sequences.
Splicer offers a spreadsheet-oriented interface, where the user chooses or creates
strategies to establish relations for rows (Group_of_speaker) and columns
(Topic_of_dialog). Creating a query, such as "Group_of_speaker =
Topic_of_dialog = Contra-Issues" the system fills in the particular cells
relevant clips. Selecting one of the clips and using one of the (rhetorical)
rules (i.e. point-counterpoint) the system starts with the selected clip and
Viewers,
with the
Montage
adds the
related clips according to the given rule.
However, the order of shots used by Splicer does not reflect any cutting constraints.
The Montage rules used are not editing rules in the sense that they create a visually
coherent composition (there being no representation of cinematographic devices or
spatial characteristics of actors, and so on), but rather create a coherent intellectual
space. Thus, Splicer's contribution is to represent video material in such a way that
rhetorical rules can be used to create micro-documentaries expressing distinct
ideological points-of-view. This is a similar endeavour to our own, except that we are
interested in provoking an emotional reaction in the viewer, and we intend to achieve
this by presenting narratives.
6: The representation of knowledge for automated editing
116
6.3.2 Bloch's machine for audio-visual editing
Bloch (1986) bases his approach to automated editing on the following two
assumptions:
•
Specific narrational conditions will enforce specific constraints on cuts.
•
The generation process is based on knowledge of the number of cuts that are
needed to create a sequence.
Bloch's research focuses on continuity editing (especially the maintenance of fluency
of motion between shots), and specifically considers the constraints position, motion
and glance:
Position
The formalisation is based on Burch's taxonomy for joining positions
Burch (1981), e.g., when two characters are together in a relatively
close shot, following shots should respect the established positions of
characters (A on the left, B on the right). The relevant attributes are the
character's physical position in terms of both the location and the
screen.
Glance
Bloch distinguishes between shots where two characters or groups of
characters are facing each other, and shots where one character looks at
nothing in particular. His concern is with the eye-match of two
characters. The constraints he introduces are that two characters facing
each other shown in different shots must look in opposite directions,
where the opposition is based on the line of sight of the character in the
first shot. The directions of a character's sight are detected using
vectors in the plane of the screen (discussed in section 5.1.1).
Motion
For motion, Bloch identifies as essential control attributes the speed
and direction of actions performed by the characters. For speed, Bloch
states that this should be the same between two joined shots. For
direction, he states that changes should be avoided, and if necessary
should be introduced by a shot in which the direction of the action is
unidentifiable, directions he describes as front and back directions
(discussed in section 5.1.1).
The above constraints are ordered in terms of importance, giving motion precedency
over the other two, which are attributed with equal importance.
6: The representation of knowledge for automated editing
117
Bloch uses the related constraints for each control attribute as the basis of a guide for
construction, where the construction of the video sequence covers three main tasks.
First, the construction process separates the given story into appropriate segments
(usually an autonomous sequence, as described by Metz (1974)), based on parsing the
punctuation and linguistic temporal forms (e.g. and, and then, etc.) in the story text.
The second task within the process of sequence construction is to translate each
segment into conceptual dependencies and determines the number of necessary shots
for the segment. The translation of an action into a CD is based on work by Schank
(1982), Schank & Abelson (1977) and Schank & Riesbeck (1981). The representation
of a CD contains the action to be performed, the id of the performer, the object the
action is performed on, and a marker, stating if the action is an interaction (inst =
yeux) or a movement (inst = direction of movement) or all other kinds (inst = ' '). The
instance of an action is provided by a pattern matcher that can associate particular
actions with interactions or movement. The decision concerning the number of shots
used for generating the sequence is based on the number of CDs (one or several), and
on the action type.
For example, a story such as 'Gilles and Said speak with each other' can be
represented by one conceptual dependency, but since the instance of the action is
'yeux', the action can be presented in one or two shots, which allows the following
representation (text in * * added by the current author):
* described action to be shown in one shot*
(and-simul
(attend (actor ("said")) (object "gilles") (inst yeux))
(attend (actor ("gilles")) (object "said") (inst yeux)))
Type du segment -> ordinaire
*described action to be shown in two shots*
((attend (actor ("said")) (object "gilles") (inst yeux))
(attend (actor ("gilles")) (object "said") (inst yeux)))
En 2 plans
Type du segment -> ordinaire
The third part of the construction process is to match the established CD against the
content representation as described in Figure 5.1, based on rules that provide the
relevant constraints concerning action, movement and glance, such as:
if a segment is to be constructed from two shots and
the number of characters is 2 and
the instance = "yeux" then
directions of sight must be opposite of each other and
screen position of characters must be opposite to each other.
6: The representation of knowledge for automated editing
118
Bloch's approach is useful in that it introduces a distinction between guided
construction and case dependent constraints. Furthermore, his approach provides a
practical solution to the problem of representing the essential elements for continuity
editing, i.e. direction and speed of movement, direction of sight and the position of
characters on the screen and in relation to each other. Finally, he ranks editing
strategies according to importance, e.g. the action match is more important than the
eyeline match.
However, Bloch's approach to the automated juxtaposition of shots suffers from
shortcomings that are partly related to problems in his scheme for representing video
content, which was discussed in section 5.1.1.
Bloch ignores several continuity problems on the graphical and spatial levels. An
example of the former is the problem of light and colour changes, which is avoided by
providing only black and white material, featuring the same level of light, which
performs adequately in any possible combination. An important omission at the
spatial level is that there is no comparison of the content space of the two shots to be
juxtaposed, in Bloch's scheme. In his thesis, Bloch shows that is aware of this
problem, by pointing out that his scheme cannot support decisions for cases where the
number and identity of characters might be correct, but the location in both shots is
different. Bloch's system would simply join the shots. Bloch mentions that he cannot
provide a solution to such problems due to the rudimentary state of the proposed
representation of content space (Figure 5.1 shows Bloch's shot representation). He
therefore suggests that the background is kept single coloured and free of objects,
which renders the comparison of backgrounds unnecessary.
Though Bloch demonstrates an understanding of the process of presenting an event in
various ways, by providing mechanisms for decomposing actions based on their type
(looking, moving, other), he overlooks the problem that not only the visual
presentation of an action can be decomposed, but also the presentation of a character.
This means that his system is in the position to present the action of "X looking at Y"
in two shots, but could not find solutions for "A walks towards B", where the first
shot shows feet walking to the right, and the second shot shows feet walking to the
left.
Related to the decompositional shortcoming of Bloch's approach is his apparent lack
of attention to the relationship between shot kinds and their expressional influence on
the presentation of a narrative oriented video sequence. Both problems, the
decomposition of characters and the relationship between different shot kinds, are
6: The representation of knowledge for automated editing
119
resolved by providing video material of a particular style (e.g. shots where an eyeline
match is required are only available as close-ups).
Finally, Bloch himself acknowledges a problem relating to a deficiency of his scheme
with respect to the connotative aspects of cinematographic devices in narratives, i.e.
the inability to shape a single shot, or to order shots, according to their duration. Shots
can only be used in their entire length, which means that trimming a single shot, or
rhythmically structuring a sequence are not possible. However, these are essential
features of editing, particularly for emotionally stimulating sequences.
Nevertheless, Bloch's work certainly exerts an influence on our approach to automated
continuity editing, which we present in the following sections. We use features of
Bloch's work, such as guided construction and case based constraints, and the use of
action, position and sight as major control constraints, in deciding on the suitability of
juxtaposing two shots. However, we also attempt to overcome a number of the
shortcomings of Bloch's approach.
6.4 A novel approach to automated video editing
The first assumption reflected in the ensuing discussion is that plot structure provides
the logical relationships between characters and their actions within a setting, but that
automated editing visually supports the storytelling, which depends on the available
material. However, the available material cannot be predicted in advance, and so
Bloch is mistaken in assuming that the number of shots necessary for a presentation
can be predefined.
We are interested in the ability of an automated editing system to react flexibly to
requirements for the visual composition of characters, their actions and the
surrounding location, by making use of available video material, without being
restricted to a predefined number of combinatorial possibilities. Moreover, it is
possible that a narrative-based request for visual material cannot be fulfilled. In such
cases, the logic of the narration must be altered, which is done not by the editing
process, but by the procedures through which the narrative is constructed.
Our second assumption relates to the editing model that was described in section 4.3.
We assume that as the created visual material is experienced linearly, every item
introduced, i.e. subject, action or setting, holds true for the sequence in which it is
shown, and its parameters serve as the basis for logical relations within the causal
chain of the narrative structure. For a character, this means, for example, that features
of his or her appearance introduced remain constant until changed by the logic of the
6: The representation of knowledge for automated editing
120
narrative. The effect of this assumption is that the temporal and logical framework of
the presentation is based on the narrative world, which may not coincide with the
temporal or logical dimensions of the real world. In other words, the presented event
is taken to be real, and that which we do not see does not exist.
Related to the linear aspect of our second assumption, is our third assumption, i.e. that
whenever we juxtapose two shots A and B, we actually merely compare the last frame
of shot A with the first of shot B, in the sense that we consider only those features of
the shot representation that apply to the end of one shot, and the beginning of the next.
We now present our scheme for continuity editing. We first describe the plot
structure, which provides the basis for the editing process. This structure represents
the result of constructive processes performed on narrative structures to be presented
later in this thesis (chapter 7). The constructive processes themselves are described in
chapter 8.
We then present our approach to the control of spatial and graphical relations during
the juxtaposition of two shots, with respect to both shot content and the created
awareness of space.
Finally, we discuss the temporal and rhythmical constraints applicable in the joining
of two shots. However, it must be stressed that this separation of constraints on
continuity editing into distinct types is done merely to promote ease of presentation.
As we will see in chapter 8, each of the different types of constraints applies
simultaneously during the relevant editing processes of retrieving, ordering and
shaping shots.
6.4.1 Plot requirements
In the editing model presented in chapter 4, the process of scene creation begins with
a discussion of the scene with respect to the available material, its intention and its
part in the overall story. In technical terms, this means that editing is based on a
structured framework, which provides the narrative intentions of the scene and the
required events, characters and locations for a particular stage in the event
development (plot order was discussed in section 2.1.2). Thus, an event can be
constructed from one or more sequences, depending on the event phases involved, i.e.
motivation, realisation or resolution. The representational form of a sequence is
presented in Figure 6.3.
6: The representation of knowledge for automated editing
121
Sequence-Structure
Kind
Motivation or realisation or resolution
Intention
relations, details, action, interaction
(and, or combinations are possible)
Form
e.g. H-Strategy X, internal, external, single, composed,
first-person, third-person
Appearance accelerate, steady, slacken
Setting
substructure "setting", as described in Table 5.7
Subjects
A structured set of descriptions for each subject
required in the sequence. The description usually only
contains the subject ID. However, if a subject is newly
introduced or the appearance changes (e.g. the age of a
character), the set is enlarged. Note, that this set also
contains information concerning the mood of a
character.
Action actions performed in relation to subjects in this
sequence. The actions are represented by the structure
[tempform, single action, parallel actions, serial
actions].
Figure 6.3 Plot requirements for the editing process
The parameter Kind represents the constructural phase for the overall event to be
portrayed (i.e. motivation, realisation or resolution as described in section 2.1.2.1).
The parameter Intention represents the main goal of the sequence. The parameter
Form provides information about the chosen H-Strategy, the point of view for the
scene (viewer, i.e. third-person, or character, i.e. first-person, oriented), and how the
overall composition is to be arranged (e.g. a single shot or composed sequence). The
parameter Appearance contains information about the visual rhythm of a sequence. A
humorous sequence, for example, is likely to be of accelerated speed (as discussed in
section 3.2.1.1). The parameters Setting, Subjects and Actions are self explanatory.
Setting, subject and actions provide much of the information needed to support the
retrieval of relevant visual material. Following retrieval, a pool of shots is available,
which forms the raw material for the editing process. Decisions as to selection and
ordering of shots are case-based, and derived from the actual information provided by
Kind, Intention, Form and Appearance. Editing strategies for each control area are
applied, if necessary, to provide the most suitable presentation of the narrative
request. Thus, we extend Bloch's notion of case-based editing by introducing
consideration of relevant stylistic features into the editing process.
The necessary representations for supporting such sequence structures are described in
section 7.2, and the mechanisms for sequence construction are presented in chapter 8.
For the moment, it is necessary to know only that the editing process can apply this
6: The representation of knowledge for automated editing
122
structure to the related pool of shots (i.e. their relevant content descriptions) to order
the shots in a cinematographically acceptable way.
6.4.2 Shot intention and the shape of the awareness space
We now investigate the shaping of the viewer's awareness of space, which is based on
the logical relationship between camera distance and lens movement for the two shots
to be joined. Earlier in this chapter, we noted that this relationship between two shots
plays an important role in the ordering and visual presentation of the video material,
as it strongly constrains the retrieval of potential shots for juxtaposition, which was
not considered by Bloch.
Take the establishing shot, as described in section 6.2, as a first example. The
establishing shot usually provides a certain frame space, depending on the
functionality of the location. This leads to the following editing strategy:
E-Strategy 1
If sequence.kind = Motivation
then
Camera distance of Shot to be chosen is
long
=> location.function = outdoor
medium long or medium => location.function = indoor
This means that one criteria for choosing an establishing shot from the pool of
relevant shots is covered by the constraints described in E-Strategy 1.
The above strategy enables us to determine the appropriate camera distance for a
single startshot, but we cannot predict how the system should react if the pool of
available shots does not provide a single shot which contains exactly the visual
information required by the particular narrative situation. The problem is then to
select shots to provide the same establishing effect. To look at the problem from a
different angle, assume that an alternative representation of content space is required,
i.e. the creation of spatial relations from component parts shown in sequence, as
examined by Kuleshov (1974, pp 52 - 53), which might be applicable for story
structures requiring suspense. It becomes clear that the establishing shot, as described
in E-Strategy 1, is no more than a useful exception. In fact, we need a representation
that enables us to specify which types of shots can be joined to others.
Vertov's classification of camera distances, previously introduced in Table 6.1, is of
particular value for the creation of clearly perceptible scenes, since its description of
the logical relationships between different camera distances provides many
possibilities for shot juxtapositions, as described in Table 6.2.
6: The representation of knowledge for automated editing
Shot B
Shot A
(1) extreme close-up
(2) close-up
(3) medium close-up
(4) medium
(5) medium long
(6) long
(7) extreme long
(1)
(2)
X
X
X
X
X
X
(3)
(4)
X
X
X
X
X
X
X
X
O
(5)
123
(6)
(7)
O
X
X
X
O
X
X
X
X
O
X
X
Table 6.2 Spatial relationships between shots A and B in
terms of camera distance
The information provided in Table 6.2 can be represented as a matrix of the form
SDij, for all i,j where 1 i 7 and 1 j 7 and
i
represents the camera distance of a given shot
j
represents the camera distance in the shot to be joined
Dij = X
if the shots can be acceptably joined
Dij = O
if the shots can be acceptably joined, with the constraints:
a) no bridge can be created and
b) the complexity of the shot with the longer camera distance is
low (fall back rule)
Dij = ' 'if the shot combination is impossible
Such a matrix serves not only to specify the acceptable direct joining of shots, but also
enables the calculation of suitable shot patterns when shots cannot be joined directly.
Assume a sequence is required which represents the joke of a man who approaches a
freshly painted bench, avoids sitting on it and, in doing so, falls over a litter bin.
Assume that the establishing shot of the man walking is a long shot. The next thing to
be done, according to the humour strategies described in chapter 3, is to motivate the
mishap. A favourable shot type might be a close-up. Locating the relevant field in the
matrix SD with a vector Vij, where i represents the shot type of the given shot and j
the type of the shot to be joined, it is determined that the direct join is not
recommended. Thus, a bridge must be created between the two shots. E-Strategy 2
describes the algorithm for this operation. The bracketed numbers represent an
example, where a long shot (6) is to be joined with a close-up (2), resulting in a bridge
6: The representation of knowledge for automated editing
124
described as a join between a long and medium shot (4) and a join between the
medium shot and the close-up.2
E-Strategy 2
if a bridge between two shots must be created do [6,2]
fill vector Vij up so that
ij form a progression
[6,5,4,3,2]
shorten the new vector V2
depending on the timing_strategy
[6,4,2]
transform vector V2 into vectors
of the form of Vij
[6,4], [4,2]
The resulting list represents the bridge. However, the required shot types may be
unavailable. In such cases, the system can use fall back rules, indicated by "O" in
Table 6.2. It should be appreciated that the mechanism provides only one decision
concerning the applicability of a join of two shots. It may be that the control
mechanisms for the continuity of content space, as described later in this chapter,
reject the suggested join. If that happens, the system must change the plot structure, so
that it can be realised by the available visual material.
The representation of spatial relationships between two shots in terms of camera
distance supports decisions concerning the next shot type to be joined. In addition, the
system is now also in the position to provide zoom-ins (i > j) or zoom-outs (i < j)
automatically, in such cases where a zoom is stylistically required, but cannot be
provided by the available material.
However, Vertov's classification not only demonstrates logical relations between
camera distances, but also hints at how the camera distances can aid decisions taken to
guide the interest of the viewer. The aim is to encourage the viewer to understand,
through visual means, which aspect of an event is important and why. Thus, at this
point, we are concerned with the awareness space of the viewer, rather than the
presentation of the content space, which is discussed later in this chapter.
Imagine that a medium shot showing two actors is to be combined with a shot
portraying a particular emotion of one of the actors, where the emotion is of
importance. A problem arises when the system must decide between two shots, both
of which provide the appropriate content, but vary in their camera distances (say one
is a medium shot, and the other a close-up). The representation discussed thus far
would derive no solution, as the joining of these shots is defined as acceptable.
2
The timing_strategy is related to the ongoing narrative. The strategy might be an equivalence
or contraction. In the given example, the strategy is contraction.
6: The representation of knowledge for automated editing
125
However, for a human, it is obvious that the close-up more suitably fulfils the
requirement, since it better represents the specific visual representation, highlighted
with a decrease of visual space. Thus, there is obviously a close relationship between
the visual relevance of a scene, which can be determined by the nearness of the viewer
to the mise-en-scène, and the narrative functionality of the particular sequence, which
is totally dependent on the visual content. Figure 6.4 describes these relationships,
according to their influence on the process of generating and interpreting video
sequences.
Increase visual space => First shot space < Second shot space = favour relation <= Generalisation
Increase visual space => zoom_out = favour relation <= Generalisation
Descriptive visual space => (First shot space = Second shot space) = favour
action / interaction <= Description
Decrease visual space => First shot space > Second shot space = favour detail <= Specification
Decrease visual space => zoom-in = favour detail <= Specification
Decrease visual space => masking = favour detail <= Specification
Interpretation
Generation
Figure 6.4
Conceptual relationship between the space of
visual awareness and narrative functionality
Based on the description of the relationship between camera distance and narrative
functionality, an automated system is now in the position to compare the effects on
space development of each join, by using editing strategies of the type described next.
E-Strategy 3
if sequence.intention = details and
camera distance of shot A ÿ extreme close-up and
camera distance shot A > camera distance shot B
then
favour this shot
A more suitable strategy arises if shot B contains a zoom-in, as, in this case, the intershot transition is much smoother.
E-Strategy 4
if sequence.intention = details and
camera distance of shot A ÿ extreme close-up and
camera distance shot A start camera distance shot B and
lens movement = zoom_in
then
favour this shot
For the above example, it would turn out that a join of two medium shots would not
fulfil E-Strategy 4, whereas the join between the medium and close-up would. As a
result of this, a book-keeping mechanism is triggered, which is represented in EStrategy 3 and E-Strategy 4 by the "then" clause favour this shot. This mechanism is
required because differing control attributes, e.g. Intention, Form and Appearance,
with differing constraints, are queried before two shots are juxtaposed.
6: The representation of knowledge for automated editing
126
The "favour shot" book-keeping mechanism adds an applicability value to the
evaluation value assigned to each relevant shot. The same evaluation value is used by
all strategies applied to assess the applicability of shots for juxtaposition, whether or
not they are concerned with the awareness space, the content space or temporal
aspects. The evaluation value of a shot becomes important in cases where a choice
between a number of applicable shots must be made. Since the presentation should be
as faithful to the narrative request as possible, the system can then choose the shot
with the highest evaluation value, as described in E-Strategy 5.
E-Strategy 5
Compare the evaluation values for all relevant shots and choose
the one with Evaluation_value = MAX. If there are several shots
fulfilling this constraint, then use the first.
It is desirable for a list containing the alternative candidates to be kept, for cases
where a re-editing of this particular join is required due to unexpected subsequent plot
changes.
It must be stressed that relationships similar to the one between camera distance and
narrative functionality can easily be established between further particular narrative
requirements and other cinematographic devices, though the mechanisms (e.g. the
"favour shot" book-keeping) would remain. For examples of such relationships see
our description of tonal montage in section 4.2.2.2.
By using the representational structures and editing strategies described in this
section, an editing system will be able to decide which shot types can be juxtaposed
with which others. Moreover, in cases where alternative candidate shots can be
identified, inferences can be drawn on the basis of the formal and stylistic suitability
of each shot introduced.
The spatial relationship between two shots, based on the logical relationship between
their respective frame size, is an important feature of continuity editing. However, it
is, of course, essential to provide a consistent content space, as discussed next.
6.4.3 Automated establishment and maintenance of content space over several
shots
Imagine that a sequence structure, as described in Figure 6.3, requires the introduction
of a new location featuring a number of interacting characters. Referring to the editing
strategy Establishing shot, as described in section 6.2, the establishing shot can be
presented as:
6: The representation of knowledge for automated editing
E-Strategy 6
127
If a sequence is to be established where location of shot A ÿ
location of shot B, or the sequence is the first sequence to be
established
then
create a memory structure of the spatial relations between all
characters of Shot B
The memory structure mentioned in E-Strategy 6 represents a hierarchically organised
structure for describing the current location and the relationships between subjects in
that location, as presented in Figure 6.5.
Location-Memory-Structure
Start
End
List of structures of
Stable_position of which
each contains
List_of_content_relations
List_of_used_shots
Shot-id of start for location
Shot-id of end of location
Spatial relations between subjects
List of shots used for this particular
configuration of spatial relations
Figure 6.5 Memory structure for spatial relationships
between subjects over a number of shots
The highest level of the Location-Memory-Structure contains the ID of the shot where
the current location first features, and the ID of the last shot in which the current
content location is found. The structure Stable_position holds details of the spatial
relationships between subjects in that location until changes occur. The organisation
of different stable positions is sequential, where the ordering represents the narrative
structure as provided by one or several sequences. The List_of_used_shots portrays
the acceptable visual representation of events, based on the specified spatial
relationships between subjects, as described in the List_of_content_relations.
Thus, the List_of_used_shots represents the end result of the editing process. All other
structures, i.e. the list of content relations and the top level of the memory structure
for a particular location, serve a merely supportive role.
We now return to the decomposition of character relationships. E-Strategy 5 presents
an ideal case for the introduction of content space, this being when the narrative
request can be satisfied by one shot. However, we have already discussed the
possibilities of a composed alternative, either because no shot of the required type is
available, or because the marked and abrupt shifts produced by cuts are required to
direct attention. Figure 6.6 describes our approach to the decomposition of content
space represented by a sequence of shots, which is based on:
6: The representation of knowledge for automated editing
•
the number of required characters
•
the spatial position of characters in a shot
•
the spatial relationship between the characters.
break up into
sub-groups
of the kind
described
to the left and
then use
their
configuration
decomposition (1):
based on the
hierarchical
knowledge
representation of
subjects, the
decomposition starts
on parts level :
Class
decomposition (2 - n):
the content of each shot
should present the same
hierarchical level within
the knowledge
representation, e.g
Instance - Instance, or
Parts - Parts.
Subclass
Instance
Parts
Subparts
1 subject
128
2 subjects
3 subjects
4 subjects
n subject
(where n > 4)
number and size of
characters to be
portrayed and their
spatial relationships
involved.
Legend:
Symbols:
shot, cut,
Type of spatial relations:
subject
line,
triangle or half circle,
square or circle,
over cross
Spatial relationships in the shot and created via a cut
Figure 6.6 Influence of sequence decomposition on the number and
order of shots
The dimmed (red) text in Figure 6.6 outlines our representation of narrative structures,
which is described in detail later in this chapter.
6: The representation of knowledge for automated editing
129
From Figure 6.6 a number of decompositional editing strategies can be derived, of
which E-Strategy 6 is an example of the composition for the Location-MemoryStructure. E-Strategy 7 and E-Strategy 8, which follow, are examples of
decomposition strategies.
E-Strategy 7
If the relation between characters must be decomposed, then keep
the number of characters for shot A as high as possible. Then
create a memory structure for the spatial relationships between the
characters in shot A. Then establish spatial relationships with the
characters of the remaining shots, which all become part of the
memory structure.
E-Strategy 8
If number of character = 2
then
these combinations of screen positions of character are possible
shot A ([right]) with shot B ([middle]) => line
shot A ([right]) with shot B ([left]) => line
or its permutations.
E-Strategy 9
If number of character = 3 and
camera distance of both shots
medium long
then
these combinations of screen positions of character are possible
shot A ([left | right]) with shot B ([middle]) => circle / triangle
shot A ([left | middle]) with shot B ([right]) => circle / triangle
shot A ([middle | right]) with shot B ([left]) => circle / triangle
shot A ([left]) with shot B ([middle]) with shot C ([left]) => line
or its permutations.
The editing strategies and memory structures introduced so far enable us to model the
180˚ system, though the location may need to be created over several shots.
The number, or the positions, of characters may change within a location. Assume, for
example, that three characters are sitting at a table. If one of them leaves the room, we
can use the already established spatial relationship between the remaining two, but we
must ensure that the relationship between the remaining two characters now
represents a line. Hence, additional strategies must be introduced to cover cases such
as adding or removing characters from a scene, or rearranging the relationship
between characters due to their movement. The following strategy is an example of
the changes made due to the disappearance of a character.
6: The representation of knowledge for automated editing
130
E-Strategy 10 If sequence = Realisation or Resolution and
sequence_setting A = sequence_setting shot B and
number of character in sequence A < number of character in
sequence B
then
introduce a new structure of stable position, where
the spatial relationship between remaining characters is
downgraded and
the relations between remaining characters and removed
characters are deleted.
Thus far, we have assumed that the location remains constant. However, this may not
be the case. Imagine that the plot establishes a situation where two characters decide
to leave a cafe. The current location would no longer be applicable and a new location
would need to be introduced. A further example might be a change in time, e.g. the
same location but in a different era, which would also necessitate the re-establishing
of spatial relationships. The relevant editing strategy is as follows:
E-Strategy 11
If setting of shot A ÿ setting of shot B
then
assign shot id of A to the parameter End of the Location-MemoryStructure and use E-Strategy 5 with the id of shot B.
We are now in a position to control the spatial relationships between subjects within a
location. As stressed before, film is an dynamic medium. The position of an actor at
the point of a juxtaposition between two shots is, therefore, part of a temporal scheme,
usually based on actions. Thus, the influence of actions on continuity editing must be
considered. Of particular interest are such cases where there is a change in the
awareness space between two shots to be joined, or when the narrative content needs
to be decomposed.
6.4.4 The influence of action on continuity editing
If two shots are to be joined and an action is involved, referring to the sequence
structure as presented in Figure 6.3, there are more applicable constraints than
direction of movement, the latter suggested by Bloch.
The first continuity problem for an action is related to the functionality of the action.
Imagine a situation where an action such as tapping the fingers on a table is to be
highlighted. Assume a medium shot of a character sitting at a table. Using the
strategies introduced so far, an automated editing system could decide which shot
type it should choose in order for the action to be highlighted, i.e. a close-up.
However, the problem that arises is how to relate the different subjects of the two
shots, i.e. the character in the medium shot and the hand in the close-up, since so far
the system can only infer the spatial relations between the identified character and the
6: The representation of knowledge for automated editing
131
table. Thus, a structure needs to be introduced to enable the system to establish
relationships between the different detail and identification levels of subjects, as
presented in the shot content.
Referring back to Figure 6.6, we are now ready to consider the structure represented
by the dimmed text. This knowledge structure describes a subject (character or object)
as a hierarchical tree structure, where the root represents the class (e.g. human) and
the leaves represent subparts (e.g. finger). Such representational structures are
standard, and can be found in Davis (1995), Lenat & Guha (1990) and Parkes
(1989a). The inheritance provided by such a structure allows the drawing of
inferences based on the relationship between a detail and the whole (e.g. finger and
character).
Moreover, the different levels of detail provided by the hierarchical representation of
subjects as discussed immediately above, corresponds to the content space of the shot
types, as described in Table 6.1. This correspondence is particularly useful, since it
allows the establishment of relationships between camera distance (e.g. extreme
close-up) and a level in the hierarchical representation of subjects (e.g. the leaves), as
presented in Table 6.3. Such relationships enable the system to restructure the
presentation of the narrative request at the action level, without changing the required
logic. In other words, the introduction of relationships between a filmic device and
conceptual structures of the "narrative world" support a decomposed presentation of
actions through continuity editing.
Value of camera
device
Level of content detailness
extreme close-up
object:subparts[form,colour]
close-up
character: subparts
one detail of either head, hand, feet
object:Instance shape
medium close-up
medium
medium long
long
extreme long
character: Instance
or parts of either Head+Id, Hand or Foot
object:Instance shape
character: Instance Appearance, Head
object:Instance shape
character: Instance Appearance
setting: Time, Location, Lighting
object: Instance shape
setting: Time,Location, Lighting
Table 6.3 Relationship between camera distance and
hierarchical representation level of subjects
Thus, the system is now also in the position to create the action of a particular
character, even if the required visual material is not directly available. Consider the
6: The representation of knowledge for automated editing
132
above example of the character tapping his or her fingers on a table, but assume this
time that there is no request for highlighting and no reference to fingers. Now suppose
that there is no single shot showing the particular character doing the tapping. The
system is still able to create the required action, by splitting the character and the
action into separate sequences where one presents the character and the second the
action, as described below in E-strategy 12. It must be stressed that E-Strategy 12
represents a case where the previous shot showed several actors so that the system
has to focus on the relevant actor. Thus, it becomes apparent that the knowledge
structures provided enable the system to react more flexibly than would be possible
with only a two shot solution.
E-Strategy 12
If an action for a character is required and
there is no shot available to portray that action
then
isolate the character in a shot
retrieve the body part related to the action
retrieve a suitable shot where a body part performs the required
action
build a bridge into or out of this sequence if necessary
continue with the sequence which was interrupted by
this subsequence.
E-Strategy 12 guides us directly into a further spatial continuity problem concerning
actions and the juxtaposition of shots. When discussing the representation of shot
content earlier in chapter 5, we argued that some actions are static in their nature,
which means that the subject is not moving significantly in the content space (e.g.
gestures), and some are moving, which means that the subject is changing position
within the content space. If a subject only performs one action per sequence, we could
ignore the above distinction. However, the sequence structure introduced in Figure 6.3
makes it clear that this is not necessarily the case.
Each sequence provides a number of options concerning the actions performed by
each character, such as:
single action
The narrative requires that a character perform only this
particular action, e.g. to walk. Since a single action can either
be moving or static there are no compositional problems here.
parallel actions
The narrative requires that a number of actions be performed at
the same time (e.g. scratching one's head while drinking
coffee). As described in sections 2.1.2.1 and 5.2.3, such a
combination of several actions defines other functional
elements, such as the intentions or moods of the characters.
Here, the order of actions is of no importance as long as the
impression of simultaneity is provided.
6: The representation of knowledge for automated editing
serial actions
133
The narrative requires that a number of actions be performed in
a particular logical order, e.g. the events involved when a
character obtains coffee from a coffee machine. Here the proper
order of events should be maintained.
The combination of static and moving actions presents a problem only in cases where
the automated system must decompose several actions in a chain of single actions,
but should still continue to provide the impression of simultaneity.
E-Strategy 12 shows that a sequence can be decomposed into sub-sequences in cases
where a particular visual event must be created. We use the same mechanism for the
decomposition of actions performed in parallel and introduce the following strategies:
E-Strategy 13
If several actions for a character are required and
the actions should be performed simultaneously
then
use actions of type moving first
insert actions of type static which should be presented in the
same shot type
and finish with the actions of type moving
build a bridge into or out of this sequence if necessary
continue with the sequence which was interrupted by
this subsequence.
E-Strategy 14
If several actions for a character are required and
the actions should be performed simultaneously and
there are no actions of the type moving
then
order the actions of type static, which should be of
the same shot type, sequentially;
build a bridge into, or out of this sequence if necessary;
continue with the sequence which was interrupted by
this subsequence.
A similar approach can be applied in cases where a particular single action is required,
but the retrieval process can provide only one shot in which several actions are
performed simultaneously, among which the required action is but one. In this case,
the difference to the above strategies is that the required action should be highlighted,
which means that all other actions are to be ignored.
The decomposition of serial actions can be achieved by using strategies of the type EStrategy 11.
Actions imply an actor, which leads to the fourth problem concerning actions and the
juxtaposition of shots. Earlier, we demonstrated that the 180˚ principle supports the
continuity system, so that a smoothly flowing space for the narrative action can be
achieved. However, most actions are performed for a reason, often to provoke a
6: The representation of knowledge for automated editing
134
reaction, or as a reaction. Editing techniques such as shot/reverse shot, or POV shot,
embody editing principles for presenting an action as an action or as a reaction. These
techniques are related to the point of view of a particular character in the film. On the
other hand, the viewer is onlooker, and does not usually see things from the point of
view of a particular character. It is therefore essential for an automatic editing system
to distinguish between these two cases, so that the appropriate editing technique, with
its related spatial constraints for the visual presentation, can be chosen.
Referring to the sequence structure, as described in Figure 6.3, this distinction of point
of view is provided by the structural element Form. If a sequence focuses on a
presentation assuming the viewer as external observer, the form will be external. On
the other hand, if the presentation requires that the viewer's perspective coincides with
one of the characters, the form will be internal. It must be stressed, however, that
though we are aware of the importance of this feature, our approach to it is but a first
step towards a true representation. At the current stage in our research, we have dealt
with only the external type of presentation. Though we will subsequently refer to the
POV shot, we do not refer to the merging of the point of view of the viewer and the
portrayed character. Rather, we intend the POV shot to describe a reaction to the
action "look".
Since the system is now in a position to distinguish particular styles of presentations, a
set of new editing strategies can be introduced to specify the spatial constraints on
actions, based on their motivation and the number of characters involved. These
strategies represent a number of editing techniques, and the following examples are
the shot/reverse shot (15, 16), the eyeline match (17) and the eyeline match for a
moving character, i.e. POV (18).
E-Strategy 15
If sequence.kind is Realisation or Resolution and
sequence.intention = interaction
number of actors = 2 and
shot A contains both characters and
actions are of type static and
camera angle shot A is overhead
then
camera distance shot A = camera distance shot B and
character relation shot A opposite character relation B and
camera angle shot B = overhead
=> favour this shot as shot B
6: The representation of knowledge for automated editing
135
E-Strategy 16
If sequence.kind is Realisation or Resolution and
sequence.intention = interaction
number of actors = 2 and
shot A contains both characters and
actions are of type static and
camera angle shot A is overhead and
available shot B contains only character for reaction
then
camera distance shot A camera distance shot B and
camera position shot B is middle and
camera angle shot B = Head.line_of_sight of action character
in shot A
=> favour this shot as shot B
E-Strategy 17
If sequence.intention = relation
number of actors = 2
shot A contains 1 character
then
camera distance shot A = camera distance shot B
line of sight character shot A = camera position in shot B
camera angle shot A = camera angle shot B
=> favour this shot as shot B
E-Strategy 18
If sequence.intention = relation and action and interaction
number of actors = 1 and
action of type moving
then
camera distance shot A camera distance shot B
line of sight character shot A = camera position in shot B
camera angle shot A = camera angle shot B
direction of action in shot A = camera movement in shot B
=> favour this shot as shot B
The number of editing rules such as E-strategy 15 - 18 are limited, as the number of
characters involved is assumed to be no more than 3.
The strategies and representational structures introduced so far would enable an
automated editing system to react flexibly to narrative demands by providing visual
presentations that show a constant content space with respect to the position, sight and
direction of movement of subjects. The remaining problem in juxtaposing two shots,
while preserving continuity of content, is the comparison of the surrounding space of
the characters in both shots. This is discussed in the next section.
6.4.5 The comparison of surrounding content space and graphical pattern
Though the above structures and strategies provide a constant content space over
various shot juxtapositions, a problem that remains is to compare the overall
6: The representation of knowledge for automated editing
136
appearance of two shots to be joined, with respect to the similarity of surrounding
space and the overall graphical pattern.
Earlier in this chapter, we discussed the influence of different camera distances on the
shape of the awareness space of the viewer. We also mentioned that the comparison
between shots is based on the last frame of shot A and the first frame in shot B. These
two schemes form the basis for our comparison mechanisms.
Camera distances are either identical for two shots, or they differ, i.e. the camera
distance in shot A is larger or smaller than the one of shot B. Furthermore, in cases of
a change of location, comparisons need only be performed on the id of the relevant
subjects appearing in both shots. However, it is also advisable to compare the
characters' screen positions (to keep the viewer's attention on the same spot) and
directions of actions, to fulfil the viewer's expectations of continuity (action match).
These mechanisms are described in detail by Bloch (1986).
A different situation applies if the location remains constant over two shots. Then it
becomes necessary to compare the relationship between subjects and their position in
the setting. However, it is not always essential to compare every single relationship
due to the difference in camera distances. For example, if the order of the shots is
medium shot followed by long shot, it is advisable that all the elements within the
medium shot are represented in the long shot. This is not the case if the long shot is
followed by the medium shot.
With reference to the above, a new set of E-Strategies can be introduced. Again, not
all possible strategies are expressed, but rather examples of the different types of
strategy.
E-Strategy 19
If camera distance of shot A = camera distance shot B and
location of shot A ÿ location of shot B and
subjects(s) of shot A appear in shot B
then
position of subject(s) in Shot A = position of subject(s)
in shot B
action direction of shot A = Action direction of shot B
=> favour this shot as shot B
E-Strategy 20
If camera distance of shot A = camera distance shot B and
location of shot A ÿ location of shot B and
character(s) of shot A do not appear in shot B
then
action direction of shot A = camera direction shot B
=> favour this shot as shot B
6: The representation of knowledge for automated editing
E-Strategy 21
If camera distance of shot B = medium close-up and
location of shot A = location of shot B and
sequence.intention is interactive and
number of character in shot B = 1
action of character in shot B is of type static
then
verify that all objects of the Location-Memory-Structure
located character of shot B appear in shot B
=> favour this shot as shot B
E-Strategy 22
If camera distance of shot A is long and
camera distance of shot B is medium long and
location of shot A = location of shot B and
character(s) of shot A appear in shot B and
character action in shot A is of type moving
then
verify that the objects related to the character(s) in shot A
in front of the action direction appear in the back of the
the characters action direction of shot B
=> favour this shot as shot B
E-Strategy 23
If camera distance of shot A is medium and
camera distance of shot B is close-up and
location of shot A = location of shot B and
subject of shot A appears in shot B and
action in shot A is of type static
then
verify that the background shot A = Background shot B
=> favour this shot as shot B
E-Strategy 24
If camera distance of shot A < camera distance of shot B and
location of shot A = location of shot B and
character(s) of shot A appear in shot B and
character action in shot A is of type moving and
position of character(s) in shot B is opposite of character(s)
position in shot A
then
no content comparison necessary.
137
The need to individually specify the comparison mechanisms for different camera
distances is unavoidable, in cases where the camera distance is decreased, as the
different levels of detail provided by different types of shots require particular
mechanisms for comparing content. In cases where the camera distance between shots
increases and the actions involve movement (E-Strategy 24), no comparison of
content is necessary, since the action itself is sufficiently powerful to provide
continuity. Thus, in such cases, exactly the right direction of movement and position
of characters must be maintained.
6: The representation of knowledge for automated editing
138
Related to the comparisons made by E-Strategies 19 - 24 is the evaluation of the
general graphical pattern (e.g. lighting, colours, composition). The link between two
shots on the basis of patterns can be motivated by two requirements: first to achieve
contrast, and second, to establish a similarity.
It may initially seem strange to assume that contrast can contribute to a smooth
narrative flow. Within a particular sequence of a film it might be necessary to
emphasise the spatial opposition of two characters by showing each of them with
different light and colour backgrounds. However, it is usually similarities in graphical
pattern which serve as a continuity control. It is not necessary that the graphical match
between colours, compositional features or direction of movements (camera and
subject) are identical across the cut but the precision should be appropriate. An
example of an editing strategy related to such problems is described below.
E-Strategy 25
If setting.location shot A = setting.location shot B and
setting.time shot A = setting.time Shot B and
setting.lighting shot A = setting.lighting Shot B
then
Shot granularity shot A = Shot granularity Shot B
Shot colour shot A = Shot colour Shot B
Shot contrast shot A = Shot contrast Shot B
The knowledge structures and their analysis mechanisms presented so far provide an
automated editing system with the ability to combine shots in such ways that satisfy
the required narrative logic, even if the exact material required is not available.
The ability to shape the viewer's awareness space enables the system to provide a kind
of rhythmical structure to the sequence, even though, at the current stage of the
discussion, this structure is solely content oriented. The remaining problem to be
discussed is how this rhythm can be controlled in its temporal appearance. This issue
is discussed in the next section.
6.4.6 Temporal and rhythmical relations between shot A and shot B
In chapter 2, we defined the three dimensions through which plot time leads the
viewer to construct the story time, i.e. the order, frequency and duration of actions.
The preceeding sections of this chapter discussed the implications of the ordering of
actions on the creation of visual space. Furthermore, we stated that continuity editing
merely supports the intelligent sequencing and orchestration of the narrative chain of
causality by providing appropriate visual material. Changes in the logic of the plot are
a feature of story generation.
6: The representation of knowledge for automated editing
139
The principle ordering mechanism for continuity editing is sequential. However, EStrategies 13 and 14 show how editing can intelligently order shots of actions
sequentially, and yet provide the impression of simultaneity.
A common violation of the temporal order of events found in film is the flashback or
the flashforward, which are signalled either by a cut or by cinematographic devices
such as dissolves. Since we exclude the use of dissolves at the current stage of our
research, our approach supports only the cut, and we have shown above how the
system can react to environmental changes (see E-Strategies 6 and 10). Though these
mechanisms describe spatial changes, it would not be difficult, by simply focusing on
the temporal attributes of a setting, i.e. epoch, season, daytime, to provide similar
strategies for the appropriate visual presentation of a temporal change. Moreover, if
emphasis is placed not on the creation, but rather on the interpretation of a sequence,
we can imagine how such strategies for the detection of temporal differences may be
used to interpret a cut between two shots as a temporal switch. Such results then can
be used to evaluate and thus prefer certain cuts, to establish the appropriate
presentation of a flashback or flashforward.
The second feature controlling the temporal shape of a sequence is the frequency of
actions or events. While describing the relationship between events in section 2.1.2.1,
we showed that repetition and heaping are common ways of intensifying significant
information. Again, we face a similar problem to that of the control of temporal
ordering. It is the plot structure which requires a repetition. However, having included
the substructure Form in the sequence structure (see Figure 6.3), we are in a position
to outline how an automated editing system can cope with the structural repetition of
an event. Due to the sequential order of shots in the Location-Memory-Structure, it is
possible to detect the sequence to be repeated, and along with it, all of the stylistic
elements used for its presentation (e.g. camera distances, camera angles, colours, etc.).
These stylistic and structural features can now either be copied or used for a similar
scene with slight structural differences.3
The remaining feature of the temporal control of actions or events, i.e. duration, is
more relevant to shaping the established order of a sequence. The shaping of duration
by the editing process can be applied to the actual length of a shot, which exerts an
influence on the content, and the overall appearance, of the whole sequence, which, in
turn, influences the temporal rhythm. Both features are obviously correlated, since
both are related to the shot length.
We now discuss editing strategies for automated physical clipping that support the
narrative flow of information with respect to temporal structure, paying particular
3
An example of such relationship between events is the film Groundhog Day, by Harold Ramis,
where the same day is repeated again and again but with slight changes in the event structure.
6: The representation of knowledge for automated editing
140
attention to transformational classes related to duration, i.e. expansion, equivalence
and contraction (as discussed in section 2.1.3). Furthermore, we discuss the problem
of creating an overall appearance for a sequence.
6.4.6.1 Preliminary remarks
Referring to the sequence structure (see Figure 6.3), we can identify two structures
that hold information relevant to making decisions on the temporal portrayal of an
action or event. One is the sequence substructure Action, where the attribute tempform
indicates if the requested order of actions represents a screen time greater than that of
the story time (expansion), a screen time equal to story time (equivalence) or a screen
time shorter than the story time (contraction).
The second temporal indicator is provided by the sequence substructure Appearance,
which provides the editing process with information concerning the overall rhythmical
appearance of the sequence, i.e. a steady pace (shots are approximately of the same
length), a dynamic slowing pace (steadily lengthening shots), or a dynamic
accelerating pace (steadily shortening shots).
The problem with these temporal control attributes is that they are not only
responsible for shaping the duration of the shot, but will have already influenced the
previous steps, i.e. retrieval and ordering. Take the expansion of an action as an
example. An expansion requires that an action is decomposed for presentation
purposes, which obviously influences the retrieval and order of shots. However, if the
required decomposition cannot be achieved, the related temporal clipping cannot be
performed, and must be replaced with the now appropriate temporal control
mechanisms, say related to equivalence.
As we see editing as a task which supports the automated generation of meaningful
film sequences, the claim can be made that our aim is to fulfil the content request,
based on the temporal requirement if possible, and to perform the clipping necessary
for the temporal requirement once a sequence is completely specified, and then only if
the required material can be provided. The necessary information concerning the shot
numbers used for the presentation of an action can be retrieved from the LocationMemory-Structure.
The above indicates that there are two stages to the editing. Firstly, the retrieval and
ordering of shots and, secondly, the shaping of shots and the overall appearance of the
scene. This corresponds to the description of our editing model in section 4.3. All
forms of temporal clipping discussed in the following sections conform to this
scheme.
6: The representation of knowledge for automated editing
141
6.4.6.2 Temporal clipping for action expansion
The effect of expansion is usually valid for a single action and can be achieved by
repetition. The repetition is performed either through the use of an action match, or
through the overlapping of the portrayal of the same event with shots taken from
different camera positions. Since we do not support overlapping, we discuss only the
former case.
Figure 5.5 shows how our representation of video content, based on time intervals,
supports the isolation of particular actions within a shot. The expansion of an action
uses the same mechanism of temporal clipping, which means that an action is
presented twice, where the repetition is either shown from a different camera position
or as a detailed continuation, i.e. a zoom-in. Since it is confusing to show the same
action twice, it is necessary to shorten the first presentation, which can be achieved
through temporal clipping, i.e. a certain number of frames must be removed from the
first shot.
Two additional strategies are required to expand an action through automated editing,
where E-Strategy 26 represents the constraints for retrieval and ordering, and EStrategy 27 describes the temporal clipping.
E-Strategy 26
If sequence.action.tempform = expansion and
action is a single action
then
favour decomposed forms of presentation where
the camera distance of shot A > camera distance shot B or
the camera position of shot A ÿ camera position shot B
E-Strategy 27
If sequence.action.tempform = expansion and
action is portrayed decomposed
then
clip the last half of shot A
show complete action for shot B.
6.4.6.3 Temporal clipping for the temporal equivalence of actions
A temporal equivalence between the progression of the narrative and the presentation
time actually means that every action by characters or objects is shown without gaps.
For editing, this means that the actions are shown in the required order and in full
length. The standard indicator for temporal continuity based on visual presentation is
the action match, as already described above.4 The relevant strategy is as follows:
4
The reader is reminded that we exclude sound, which is the other indicator of temporal
continuity between shots.
6: The representation of knowledge for automated editing
E-Strategy 28
142
If sequence.action.tempform = equivalence
then
order parallel or serial actions as required and
favour strategies based on action_match.
The problem which may arise here relates to our content representation of actions,
which is based on a temporal-symbolic description. This information states only at
which frame a certain action begins and at which frame it ends. Naturally, a system
can calculate how long that performance takes (24 frames = 1 sec). However,
provided with the sequence structure and the content description, the system is not yet
in a position to detect how long the presentation should be. This problem is addressed
later, when we consider physical cutting.
6.4.6.4 Temporal clipping for action contraction
As described in section 2.1.3, temporal contraction is a form of ellipsis. For editing,
this means that the time taken by the portrayal will be less than that suggested in the
story.
For the most part, narrative-oriented ellipsis is provided in advance by the sequence.
Imagine the following situations:
•
A sequence shows a character wash his hands, button his shirt, drink coffee and
leave a house. The unwanted time is eliminated here by the content. The
appropriate juxtaposition of shots can be achieved by the strategies introduced
above.
•
A sequence shows stages in the career of a writer. The unwanted time is
eliminated here by the content and the appropriate juxtaposition of shots can again
be managed by the above strategies.
•
There are some clichéd presentations of time elimination, e.g. calendar leaves
fluttering away, a shadow of an static object moving along a wall, clouds moving,
clocks ticking, etc., which also represent content based elimination of time.
All the above sequences embody the elimination of a considerable period of time.
However, if the aim is to reduce a short period of time for a single action, then
physical, elliptical editing can support the presentation.
Imagine a sequence where a character must go from one location to another. Without
using dissolves, there are two major ways of eliminating time:
6: The representation of knowledge for automated editing
143
•
insert an event into the action, which shows something different, and then return
to the previous event. For the above example, this may be portrayed as showing
the character walking, then a door, and then the character already doing something
in the flat.
•
show the character leave the frame, hold on the empty location, cut to the empty
location of the new frame and let the character enter. As an example imagine a
shot where the character begins to climb a flight of stairs. The shot ends with the
character disappearing and a few frames of the empty stairs. The next shot first
shows the stair case on a higher level and then the character coming up.
The first example is again content related and can be covered by the editing strategies
introduced thus far. The second of the above examples, however, requires the
introduction of a new set of editing strategies, where E-Strategy 29 represents the
constraint elements for the retrieval and ordering process, and E-Strategy 30 describes
the temporal clipping.
E-Strategy 29
If sequence.action.tempform = contraction and
action is a single action
then
favour decomposed forms of presentation where
the camera distance of shot A camera distance shot B
E-Strategy 30
If sequence.action.tempform = contraction and
action is portrayed decomposed
character leaves frame
then
clip not later than 24 frames after the first frame without
character in shot A and
clip not later than 24 frames before the first frame where the
character is in shot B.
6.4.6.5 Rhythmical shaping of a sequence
So far we have examined various aspects of physical editing to support the visual
presentation of temporal continuity. There are two more representational requirements
of temporal clipping, both of which usually support the meaning of a video sequence.
The first is related to the length of a particular shot. When we introduced our scheme
for shaping the viewer's awareness space, we explained in detail the relationship
between camera distance and the shot content.
A similar relationship applies between the camera distance and the viewer's ability to
take in the information provided by a shot. It is possible to take in the image in a short
duration close-up shot in a relatively short time ( 2-3 seconds), whereas the full
6: The representation of knowledge for automated editing
144
perception of a long shot requires more time.5 Moreover, we explained above that the
composition of shots may vary in number of subjects, number and speed of actions,
and so on, which also influences the time taken to perceive the image in its entirety.
Finally, the stage of a sequence in which a shot features also influences the time taken
to perceive the entire image. For example, a long shot used in the motivation phase
takes longer to appreciate, since the location and subjects need to be recognised,
whereas in the resolution phase the same shot type may be shorter in duration, since in
this case the viewer can orient himself or herself much more quickly.
Based on the preceding discussion, the following editing strategies can be derived.
These apply, like the previous strategies concerning temporal cutting, during the
second phase of the editing process:
E-Strategy 31
If camera distance of a shot close-up and
then
clip it to a length 60 Frames.
E-Strategy 32
If close-up < camera distance of a shot < long
sequence. kind = motivation
then
clip it to a length 108 Frames.
E-Strategy 33
If close-up < camera distance of a shot < long
sequence. kind = motivation
number of characters is > 2
then
clip it to a length 12*(Number of character -2) + 108
Frames.
E-Strategy 34
If camera distance of a shot > medium-long
sequence. kind ÿ realisation or resolution
then
clip it to a length 136 Frames.
Strategies such as E-strategies 31 - 34 indicate the need to trim a shot. However,
referring to the discussion of temporal equivalence, the above strategies may cause
problems, as the actions portrayed may simply require more time than suggested by
the rules. Imagine, for example, a long shot during the motivation phase, which shows
a character walking down a path, picking a flower, continuing to walk, and finally
sitting down on a bench. If all of these actions are required by the narrative, it is not
5
The time values used in all the following examples of editing strategies are based on estimates
provided by the editors at the WDR.
6: The representation of knowledge for automated editing
145
possible to trim the shot exactly to a length of, say, 120 frames, since we would then
lose some of the actions. However, it is still possible to clip a number of frames from
the beginning or the end of the shot.
In Figure 5.4, we demonstrated how actions can overlap. In Figure 5.5, we showed
how the start and end frame for an action can be used to fulfil a particular narrative
request for actions, even if larger parts of the shot must be removed. The same
mechanism can be used to trim a shot. The necessary steps to be performed are to
identify the startframe of the first relevant action, and to detect any overlap with the
second relevant action. If such an overlap exists, then it is possible to cut away the
section of the shot in which the first action is performed in isolation. The same
mechanism applies to the end of the shot. It is, of course, important that the
established spatial and temporal continuity between the shot and its predecessor and
successor are still valid. Figure 6.7 describes the application of edge trimming for a
shot of 140 frames. The shot should portray an actor walking and then sitting, but
should not be longer than 108 frames.
0
140
eat
walk
sit
talk
Shot
Annotations
Figure 6.7 Trimming of a shot from 140 to 108 frames
E-Strategies 35 and 36 represent examples of temporal clipping as discussed
immediately above. E-Strategy 35 focuses on elimination of frames from shots that
feature close camera work, and are involved in an action match.
E-Strategy 35
If camera distance of a shot = extreme close-up or close-up and
the type of action is moving and
the number of frames for the shot is > 60
then
Start at the last frame of the shot
count 60 backwards
cut the remaining frames to the startframe of the shot
E-Strategy 36 focuses on frame elimination from a single shot that sequentially
portrays a number of actions.
6: The representation of knowledge for automated editing
E-Strategy 36
146
If close-up < camera distance < long and
sequence.action.tempform = equivalence and
number of frames > as calculated in E-Strategy 32 or 33 and
number of performed actions in the shot > 3
then
verify the frame overlap of first and second action as X
verify the overlap of last and last - 1 action as Y
cut away the pure frame numbers for the first action if X > 36
cut away the pure frame numbers for the last action if Y> 36
Thus, a system is now in the position to adjust the length of a shot to provide an
appropriate perception time, without removing essential narrative elements. This
means that the automated editing process, as described thus far, shapes the appearance
of a sequence into a temporal rhythm which is merely content related.
The remaining problem to be addressed is the shaping of the overall temporal
appearance of a sequence with respect to a steady or dynamic pace, i.e. shaping the
stylistic intention of a sequence by means of temporal clipping.
To provide the system with mechanisms that can achieve the above task, it is
necessary to adjust the shot length designed for the temporal perception of a shot with
the pattern required for a dynamic slowing or accelerating of pace. However, at the
current stage of our research, we merely state that this should be done, as we are not
yet in a position to provide a usable scale to direct the slowing or accelerating of the
pace. It is therefore not possible to describe the temporal influence of these
mechanisms on the shaping of single shots. For a more complete scheme for the
temporal shaping of sequences further research is required.
6.4.7 Conclusion
Based on the assumption that the editing process supports narrative through
appropriate visual and cinematographic presentation, we have introduced a novel
scheme for automated video editing. In this, we were influenced by structures
introduced by Bloch (1986) and Parkes (1989).
We have shown how cinematic constraints for one editing style (continuity editing)
can automatically be applied to narrative structure, by supporting the essential
narrative aspects of context and form (see Figure 2.4). We have introduced a method
of relating the intention of a narrative sequence to the shot presentation, and
demonstrated how the knowledge structures and analysis mechanisms introduced can
provide an automated editing system with the ability to visually guide the viewer of a
video sequence through a narrative, so that he or she is in the position to classify
information as relevant or purely descriptive. Furthermore, we introduced a scheme
for the establishing and maintaining of content space over several shots, based partly
on strategies for the visual decomposition of a character and/or related actions, and
6: The representation of knowledge for automated editing
147
partly on the decomposition of spatial relationships between characters and other
characters and between characters and their screen position. Using this scheme, we
showed how an automated editing system can react flexibly to narrative requests, by
making use of the available visual material, even if such material does not directly
represent the narrative specification. In addition, we briefly discussed how the same
mechanisms not only support the creation, but also the interpretation, of film
sequences. Moreover, we presented a novel approach for comparing the content space
of different shots, which is important for providing an overall spatial continuity within
a sequence of juxtaposed shots. Finally, we demonstrated how a two stage editing
process, as presented in section 4.3, provides the ability to shape the visual
presentation of a plot sequence by means of physical clipping, so that the temporal
rhythm of the sequence supports the narrative intention.
However, the above approach is but a small step towards intelligent automated
editing. The majority of the strategies can be refined in various ways, and, as outlined
in the previous sections, there are a number of problems yet to be solved, e.g. the
influence of the introduction of additional cinematic codes on the plot realisation, as
described in Figure 6.3. Other problems relate to the temporal aspects of automated
editing. Earlier, we discussed the two-way relationship between the overall rhythmical
structure of a sequence and the temporal shape of single shots, but we did not
adequately address the influence of the speed of an action on the rhythmical structure.
Finally, it is yet to be determined how successful the editing scheme would be in
producing higher order narrative structures, i.e. episodes. Despite these shortcomings,
the scheme presented is a small step towards capturing the conflicting relationship
between metric, rhythmic and tonal montage as described by Eisenstein.
148
Chapter
VII
The representation
of narrative and
thematic knowledge
The two preceding chapters discussed how film material and techniques can be
represented so that they can be used in the cinematic portrayal of narrative events. We
showed how various denotative aspects of video content can be described and thus
made accessible. We presented particular mechanisms for the manipulation and
physical shaping of video, and showed how these mechanisms can support the spatial
and temporal credibility of the assembled video material.
The editing model in section 4.3, and the analysis of the story generation process as
described in chapters 2 and 3, make it clear that the development of a simplified linear
structured narrative is part of the production stage described as post-production. This
means that the linear flow of the plot, i.e., the introduction of characters and settings,
and development of the plot through various situations and causally linked events,
exist before the film itself can be created. Referring to Figure 2.4, it can be stated that
the description of the plot is designed in such a way that it supports the actual
portrayal of the narrative, i.e. in the cases considered by this thesis, the plot must be
designed in the form of a screenplay (as discussed in section 6.4.1).
The major components from which a narrative is built (previously specified in Figure
2.4) are events (actions and happenings), physical entities (characters and settings)
and cultural codes. Thus, a narrative depends on knowledge of the world, which is
processed and arranged in a particular way to provoke perceptual activity within the
audience. A key influence on the effective automatic generation of the meaningful and
emotionally stimulating portrayal of narrative events is to suspend the disbelief of the
viewer (see also section 2.1.2 and 3.2.1). That is, the viewer must be able to imagine
that the logical relationships between characters and their actions within a setting are
real.
7: The representation of narrative and thematic knowledge
149
The aim of this chapter is to present an ontological representation describing the
physical world and abstract mental and cultural concepts that can support an
intelligent system in creating credible narrative sequences. The intention is not to
equip the system with extensive common sense knowledge. Rather, we intend to
create an intelligent system that generates credible film sequences for the narrative
world they present. Referring to the genre discussion in chapter 2, it is justified, we
believe, to argue that the narrated world is a stereotypical world, which only
resembles the world as human beings know it. Thus, a shallow level of knowledge of
intentions, emotions and human activity, the physics of human beings and the physics
of the micro-world in which the characters act will be sufficient, for our purposes.1
Furthermore, the underlying representations and semantics of the "narrative world
knowledge" must be related to the cinematographic structures and mechanisms
introduced earlier, so that the translation between one representation and the other
does not result in the loss of salient features of either representation.
The theoretical background for the ensuing discussion has two main features. Firstly,
we use the results of our discussion of the relationship between the iconic sign and the
idea it represents, according to the creator's intention, in section 4.2.1.2. In this
discussion, we argued that differing connotations can be attributed to the sign,
depending on the circumstances and abductive presuppositions of the receiver at the
time of perception, along with the various legitimated codes and subcodes the receiver
uses as interpretational channels. At the same time, we introduced the paradigmatic
and syntagmatic axes of meaning (Figure 4.1), Peirce's trichotomy of a sign as being
either symbolic, iconic or indexical (Peirce, 1960), and Eco's structural analysis of the
cinematic image (Eco, 1977; Eco, 1985), and his classification of the underlying code
system for the triple articulation of an image. As the organisational structure for the
above systems of cultural units we identified semantic fields, as described by
Bordwell (1989, p. 106): '...a conceptual structure which organises potential meanings
in relation to others'2. As discussed in section 4.2.1.2, a semantic field can be
constructed according to various principles, i.e. clusters, doublets, proportional series
and hierarchies. These structural elements form one part of the ontology.
1
For similar approaches to the representation of credible animated characters in interactive
virtual worlds see Bates (1992, 1994), Bates et al. (1992), Hayes-Roth (1995), Hayes-Roth et
al. (1995), Hayes-Roth et al. (1994).
2
See also Eco (1977, pp. 73 - 150).
7: The representation of narrative and thematic knowledge
150
The second theoretical influence on our approach to the representation of narrative
knowledge arises from the discussions of narrative in chapter 2, and humour in
chapter 3, which provide a basis for our approaches to temporality and causality.
Before presenting our techniques for representing the physical characteristics of
human beings and micro worlds, and of action and event structures, including the
representation of human intentions and emotions, we first discuss approaches to
knowledge representation that influence our own approach.
7.1 Approaches to knowledge representation
7.1.1 Quillian and semantic networks
The notion of using semantic relations to provide background knowledge about the
world was introduced by Quillian (1966; 1985)3. He presented a memory model that
consists mainly of a large number of nodes and tokens interconnected by different
types of associative links. In Quillian's model, a node is defined as an English noun,
where the associative link refers directly to a configuration of other nodes that
represent the meaning of the noun. A token, on the other hand, refers indirectly to
another word concept, by having one special type of associative link that points to a
concept's type node. By combining these individual semantic relations, Quillian's
memory model establishes a network of semantic relations referring to a common
word, which enables the system to draw inferences from word pairs such as "plant"
and "man" represented by statements such as "A plant is not an animal structure."
(Quillian, 1966, p. 253).
7.1.2 Miller, Bateman, Lenat and large databases of semantic relations
More recently, the creation of large databases of semantic relations have been
considered by Miller et al. (1993), Bateman et al. (1996) and Lenat & Guha (1990).
Miller and his colleagues (Beckwith, Miller, & Tengi (1993), Fellbaum (1993),
Fellbaum, Gross, & Miller (1993), Miller (1993), Miller, et al. (1993)) describe
WordNet, a manually constructed semantic network that serves as an on-line lexical
reference system, the design of which is inspired by psycholinguistic theories of
human lexical memory. The vocabulary used in WordNet represents approximately
95,600 different word forms (51,500 words and 44,100 collocations) describing some
70,100 word meanings. WordNet is organised around categories of nouns, verbs,
3
See also Woods (1985).
7: The representation of narrative and thematic knowledge
151
adjectives and adverbs. The basic unit of WordNet is a synset, which represents a set
of synonyms. Each synset contains different meanings of a word. The most important
lexical relationship between word forms within WordNet are synonyms and antonyms,
which are available for all categories. Other semantic relations between word
meanings are hypernyms and hyponyms (e.g. maple is a hyponym of tree, and tree is a
hyponym of plant). The semantic system of hypernyms and hyponyms supports the
lexical inheritance system. Other semantic relations provided by WordNet are, for
example, substance-of, part-of and member-of for nouns, causes and entails for verbs
and pertains-to for adjectives.
The Generalized Upper Model developed by Bateman et al. (1996; 1994), is intended
to be a domain and task-independent general organisation of information in the
context of text generation (English, German, Italian), but its linguistic model also
suits Natural language Processing applications. The Generalized Upper Model
occupies a level of abstraction between surface linguistic realisations and conceptual
or contextual representations. It enables abstraction beyond the concrete details of
syntactic and lexical representations, while still enabling linguistic realisations to be
solidly founded on objective criteria. The upper model is organised in terms of
generalised linguistically-motivated ontological categories, such as:
•
abstract specifications of process-type/relations and configurations of participants
and circumstances (e.g., NONDIRECTED-ACTION, ADDRESSEE-ORIENTEDVERBAL-PROCESS, ACTOR, SENSER, RECIPIENT, SPATIO-TEMPORAL,
CAUSAL-RELATION, GENERALIZED-MEANS),
•
abstract specifications of object types, e.g., for semantic selection restrictions (e.g.,
DECOMPOSABLE-OBJECT,
ABSTRACTION,
PERSON,
SPATIALTEMPORAL),
•
abstract specifications of quality types, and the types of entities to which they may
relate (e.g., BEHAVIOURAL-QUALITY, SENSE-AND-MEASURE-QUALITY,
STATUS-QUALITY),
•
abstract specifications of combinations of events (e.g., DISJUNCTION,
EXEMPLIFICATION, RESTATEMENT).
Given the detail and consistency of both WordNet and the Generalized Upper Model,
their organisation appears appropriate for the enforcing of ontological consistency in
general domain modelling.
7: The representation of narrative and thematic knowledge
152
The goal of the Cyc system, developed by Lenat & Guha (1990), is to capture the
common sense knowledge shared by most people, or in Lenat's words:
'The Cyc knowledge base is to be the repository of the bulk of the
factual and heuristic knowledge, much of it usually left unstated, that
comprises "consensus reality": the things we assume everybody
already knows.'
(Lenat & Guha, 1990, p. 28).
Cyc uses a very large database of units, which either describe real-word objects, a type
of process, a particular event or an abstract idea. Each unit is represented by a frame
based data structure, comprised of slots, each of which has a corresponding value,
where the value of a unit is always a list of individual entries. This large, fixed
ontology supports rule based inference mechanisms, described as predicate calculuslike constraints. The combination of a semantic ontology structure and formal logical
inference mechanisms requires a consistent and correct representation of all
knowledge units, which excludes the possibility of contradictory knowledge
representations, and so logically coherent microtheories were introduced, which can
translate a representation from one context into another.4 A microtheory is internally
consistent but can contradict other microtheories.
Cyc is a key project in knowledge representation, as it represents an attempt to
overcome the brittleness and domain specificity of other approaches, through
extensive representation of general knowledge (i.e. about people, objects, substances,
events, sets, ideas, relationships, etc.), with which the system can analogise. However,
the belief that first-order logical mechanisms solve the problem of translating between
different representational structures seems inadequate to the current author (in section
5.1.4 we addressed the shortcomings of Cyc representations and inference
mechanisms if used to represent and retrieve images).5 Nevertheless, the Cyc system
serves as a helpful guide as to how to represent people, objects and settings.
7.1.3 Haase's approach to memory-based representations
FRAMER, developed by Haase (1994), is a knowledge representation library designed
to provide a platform-independent persistent object facility to support large database
4
As argued by Minsky (1988), a feature of intelligent behaviour is the ability to manage
contradictory representations.
5
See also Bobrow & Winograd (1985), and Winograd (1985).
7: The representation of narrative and thematic knowledge
153
functionality. The organisational structure of FRAMER is a non-deterministic
prototype-based inheritance mechanism. The basic descriptive element in FRAMER is
a frame, a class object which can have other frames (called annotations) subordinate
to it (Figure 5.3 showed a typical Framer representation). Frames are grounded by
pointing to other frames or domain objects, such as numbers, strings, vectors,
procedures, bitmaps, etc. The annotation hierarchy and the prototype network allow
the creation of semantic and episodic memory structures, where the behaviour of
particular prototypes and their derivatives can be easily specified, which in turn
eliminates the need for a second, independent constraint language such as Cycl (Lenat
& Guha, 1990). One component of FRAMER is FRAXL, an extension language that
allows simple expression of common search and iteration schemata (e.g. A is the
inverse of B or C is a generalisation of D).
Built on top of FRAMER is Mnemosyme (Chakravarthy, Haase, & Weitzman, 1992),
an analogical knowledge representation system. Mnemosyme uses FRAMER's
organisational structures, i.e. the annotation hierarchy, the prototype network and the
ground pointer, to index and match examples of relations between descriptors under
their common prototypes. In doing so, Mnemosyme offers a base ontology of
differences and similarities between descriptors (prototypes) whose relational indices
form the basis of matching and generalisation.
Haase's method of developing semantic memory structures from episodic memory
representations enables the creation of contradictory representations, through the
indexing of variant examples under prototypes, which do not, in contrast to Cyc
(Lenat & Guha, 1990), suffer from loss of information in the translation between
different representational structures. The potential of FRAMER and Mnemosyme for
the representation and retrieval of video is demonstrated in Davis' work (Davis, 1995),
as described in section 5.1.5. However, it is the use of prototypes and the focus on
differences and similarities which makes Haase's work relevant to our knowledge
representation scheme.
7: The representation of narrative and thematic knowledge
154
7.1.4 Schank's conceptual dependencies and dynamic memory
One of the major contributions to artificial intelligence research into the
representation of action, story understanding and generation, and memory-based
representation has been made by Schank and colleagues.
The Schankian understanding of creativity in a cognitive system is centred on the
ability to analyse stories effectively. The advantage of stories as a primary
organisational structure of human cognition is that they can be indexed in multiple
ways, due to numerous references, which in turn makes them multiply accessible and
relevant.
Initial attempts to implement Schank's insight into human cognition have been
focused on the problem of programming computers to understand stories. Major
achievements of this research are the introduction of:
•
conceptual dependencies, which describe human action through a small set of 12
composable primitives (e.g. ATRANS, PTRANS, MOVE, MTRANS, etc.)
(Schank, 1972),
•
scripts, which describe stereotyped sequences of events in a particular context
(situational, personal or instrumental) (Cullingford (1978), DeJong (1983),
Schank & Abelson (1977)).
•
goals and plans, which represent high level structures that control understanding,
and in particular story understanding (Carbonell (1978), Schank & Abelson
(1977), Wilensky (1978)).
Extensions and revisions of the script formalism have led to the development of
dynamic memory theory (Schank, 1982), in which reminding, indexing and retrieval
are understood as basic cognitive processes, each of them represented as a memory
structure which itself functions as a representational and retrieval mechanism.
Schank's theory of dynamic memory introduces memory concepts such as MOPs
(Memory Organization Packets), TOPs (Thematic Organization Points) and TAUs
(Thematic Abstraction Units), which gave rise to a number of new approaches to story
understanding and generation (Dyer (1982), Kolodner (1984), Lebowitz (1980),
Lehnert (1983), Lehnert et al. (1983), Schank (1982), Schank & Riesbeck (1981),
Wilensky (1983a, 1983b, 1983c)).
7: The representation of narrative and thematic knowledge
155
From dynamic memory theory, Schank develops the notion of case-based reasoning,
an alternative to rule-based reasoning (Kolodner (1993), Riesbeck & Schank (1989),
Schank (1982), Schank, Kass, & Riesbeck (1994)). The aim of the case-based
reasoner is to solve 'new problems by adapting solutions that were used to solve old
problems' (Riesbeck & Schank, 1989, p.25).
The task of generating a visually oriented narrative sequence can itself be thought of
as a case-based problem (see the genre discussion in chapter 2), which involves tasks
such as reasoning about, composition of, and adaptation of, the narrative sequence.
Though Schank et al. have developed video databases for interactive training and
education, they have merely used film in an accompanying role. Their systems react to
the user's state of interest, by using internal story strategies to decide what type of
story should be told, which then triggers a simple indexing scheme for the retrieval of
the appropriate video (Burke (1993), Edelson (1993), Schank (1994), Smith et al.
(1995)). A case-based reasoning approach which is designed particularly for the
retrieval of video is described by Domeshek & Gordon (1995), Domeshek &
Kolodner (1994), Gordon & Domeshek (1995), and was discussed in section 5.1.4.
Our proposed knowledge representation is influenced particularly by scripts, TOPs,
and TAUs, these being augmented in a network of detailed semantic fields of actions
and subjects that is designed for the requirements and properties of the visual
medium. Furthermore, we adopt certain features of case-based reasoning, though in a
rudimentary way, since the advantages of case-based reasoning apply to higher
narrational levels than those considered in this thesis.
7.2 Knowledge representation to support the creation of emotion
provoking narrative sequences
The semantic and episodic memory structures introduced in this section are designed
mainly to support the generation of narrative structures at the sequence level, based on
the results of chapter 2. However, the structures to be described must meet the
particular requirements of a visual presentation, and thus serve to create the plot
requirements as presented in section 6.4.1 (see also Figure 6.3), and the humour
strategies presented in chapter 3. Finally, the design of the semantic and episodic
memory structures must mirror the ontology of the representation of film content, as
presented in section 5.2, so that retrieval of appropriate video material can be
achieved.
7: The representation of narrative and thematic knowledge
156
Our proposed structures will describe events only to the level of the sequence. We
will but outline the connection of events to larger narrative structures, e.g. episodes.
Furthermore, there will be no indication as to how the initial intention (external
reason) for a story can be created from the structures to be introduced, since this thesis
does not deal with this problem. In fact, it will be shown in chapters 8 and 9 that the
starting point for the generation of visual stories is provided to the system in form of a
start shot.
We begin the description of our approach to the representation of background
knowledge by considering actions, since actions are the core functional elements
within an event. Actions define other narrative elements, such as the intentions, or the
moods, of characters, and the importance of objects or locations. Using the then
established semantic network for actions, we enhance it with the introduction of event
structures and the description of abstract mental and cultural concepts.
7.2.1 Actions
Our ontology for the representation for video content, as described in section 5.2, is a
structured textual representation of the semantic, temporal and relational features of
video. The textual terms used in the ontology are generic, as any overly directive
choice of labelling is to be avoided, as discussed in section 5.2.1. Thus, we also use
generic terms for the representation of actions within common sense knowledge.
Our organisation of actions adopts the approach taken in WordNet (Fellbaum, 1993),
discussed above, which classifies verbs mainly on the basis on semantic criteria in the
following domains: bodily care and function, change, cognition, communication,
competition, consumption, contact, creation, emotion, motion, perception, possession
and social interaction. In the ensuing discussion, we will focus on motion, in order to
exemplify the conceptual structure action, and its role within the semantic net.
7.2.1.1 Conceptual structure
In describing the representation of denotative aspects of video (see section 5.2.3.1) we
pointed out that there is no need to represent motions down to their atomic units, such
as representing "walking" as a cyclic repetition of "taking a step". Rather, we
represent such a complex pattern of human motion by a term walking.
However, during the discussion of the influence of action on continuity editing
(section 6.4.4) we showed that the mere generic term of a pattern (e.g. walking) is not
sufficient to solve the problem of decomposing a narrative sequence into different
7: The representation of narrative and thematic knowledge
157
shot types. As a solution to the problem of identifying different levels of detail in
relation to the subject performing the action we suggested a tree structure, where the
root represents the class (e.g. human) and the leaves represent subparts (e.g. fingers).
Subpart and action are related. One feature of an action is the body part involved. For
"walking" these are the feet, which are, due to the inheritance provided by the treestructure, automatically related to the legs and the body. However, there is not always
a direct relationship between a body part and the performed action. Take, for example,
the action "gliding", which is most likely to be related to skates rather than feet, or
"eating", which can be performed with the fingers but usually involves cutlery. Since
these objects are related to particular body parts in a performed_by relation they also
inherit the same feature.
Some motionary actions, such as "slipping", require certain objects of a particular
physical appearance so that the action can happen. For example, to slip on something,
the something must either be slippery (e.g. ice) or it must be round (e.g. marbles). In
our attempt to support such divergent aims as objectivity and computational efficiency
we suggest the feature related objects for the conceptual structure, which allows the
designer to consider the retrieval requirements set by the representation of the video
content. We are aware that the introduction of such a simplifying feature reduces the
extent to which inferences can be drawn. On the other hand, the inference chain via
the substance or shape, passing through size and finally concluding with the object
would lead directly to a similar result. We believe that complicated structures should
only be introduced if their necessity becomes apparent during system development.
Actions are usually performed in locations. The set of objects relevant to an action
may alter according to the location. We can slip, for example, on a number of things,
e.g. a banana peel, marbles, soap, or ice. However, if the location is indoors we
would, in a stereotypical case, not expect to slip on ice. Thus, the nature of the
location is important to the conceptual structure of an action.
Referring to the discussion of spatial continuity editing (sections 6.4.4 and 6.4.5, and
Tables 5.5 and 5.6), we can further conclude that the conceptual structure of an action
must provide information concerning the spatial relationship between body part and
objects, as well as the relationship between objects and location.
The representation described thus far covers only the physical aspects of an action.
However, an action provides more information. We usually expect an action to be
performed for a reason, and for it to lead to a certain result. In other words, the
perception of a behavioural pattern triggers assumptions concerning the intention or
7: The representation of narrative and thematic knowledge
158
goal behind an action and guides our expectations, depending on the context, towards
a specific outcome. Moreover, we also assume that the performance of an action leads
to a stabilisation of, or a change in, the character's emotional state, e.g. the
achievement of a goal may make the character happy. The humour strategies
described in chapter 3 manipulate these expectation patterns to encourage the
perceiver of a scene to laugh (see 3.2.4 on incongruity and 3.2.5 on derision). The
gesture-action centred approach illustrated in section 5.2.3.1 relies on the viewer's
expectations in making emotional states and intentions visible. The conceptional
structure of an action therefore requires representational features describing the
intention or goal of an action, a set containing actions which might be performed as a
result of the action, and a description of the emotional state of an actor after the action
is performed.
Referring to the above requirements for action representation, Table 7.1 introduces a
simplified conceptual structure for the action "slip", which describes physical and
mental features of the action.
Name
slip
Domain
motion
Nature of location
outdoors
Set of objects
[banana_peel, dog_shit, soap, ice]
Body part / related object
[shoe]
Location
[road]
Relation Location -> Object
under
Relation Object -> Body part under
Intention
[unintentional]
Result actions
[sit, lie, kneel, shake, look_back]
Result mood
[anger, rage, astonishment]
Table 7.1 Conceptual structure for a representation of the action "slip"
It must be stressed that the parameters of such a conceptual structure represent causal
links to other conceptual structures of actions (e.g. Result actions) or conceptual links
to objects (e.g. Set of objects), emotions (e.g. Result mood) or to definitions (e.g.
Nature of location).
A number of parameters of the conceptual structure presented in Table 7.1 have lists
as their values. These lists represent the cultural and subjective influence of the
designer on the conceptual structures (subjectivity in narrativity and humour are
7: The representation of narrative and thematic knowledge
159
discussed in chapter 2 and chapter 3), since the order represents the importance of a
link (left to right representing decreasing importance).6 Associating a value with a link
or unit in a structure is important, since the generation of meaningful narrative
sequences relies on the detection of dominant information and, at least in cases for
generating humour, on pattern breaking information. The valuing of links is also used
in the hierarchical structures provided in the conceptual representation of objects and
abstract mental and cultural concepts, where, firstly, the highest level in the hierarchy
represents the most general and thus most unimportant structure, and, secondly, the
first item within a hierarchical level represents a more important link than the
following items. Thus, retrieval of relevant material for narrative story generation
operates on the basis of the difference between link values within conceptual substructures or on the hierarchical distance between prototypes and their subtypes, where
the importance of a subtype is detected by its position within the hierarchical level.
7.2.1.2 Semantic relations
Conceptual structures for actions, as described in the previous section, form but one
type of node in our proposed semantic network. Other nodes, such as conceptual
structures for subjects and emotional states, will be described later in this section. The
semantic relations between conceptual structures of actions support the application of
the humour strategies described in chapter 3, and the retrieval mechanisms related to
the editing strategies presented in chapter 6. Each action concept is part of a network
that supports the following links:
Synonym links
During the discussion of the influence of action on continuity editing (section 6.4.4),
we discussed problems associated with the different levels of detail that might occur
while decomposing a sequence based on different shot types. Since our ontology for
the representation of video content pays specific attention to the maintenance of
objectivity, a close-up of a moving head cannot be described with a term like
"walking" because the actual action is not visible. Thus, the description of the action
in this case is likely to be "moving". However, providing a synonym relation between
"walking" and "moving" would make the shot accessible.
6
The current author is aware of the fact that lists represent but one way of representing valued
links. Another would be to introduce numerical values or additional hierarchically oriented text
tags, as used in the semantic relation Subaction described in section 7.2.1.2.
7: The representation of narrative and thematic knowledge
160
Subaction links
In discussing the establishing of derisive gags based on several actions performed by
one character in the same temporal interval (represented by H-Strategies 16 - 21, in
chapter 3), we pointed out that the crucial element is that of a conflict of shared
resources, i.e. subactions. The example we gave was of the man who is reading the
newspaper and at the same time is dipping a croissant into a cup of coffee. At a certain
point, he dips the croissant into a jar of mustard. The conflict in this case is based on
the shared subaction "looking". Hence, it is necessary to provide subaction links that
point to actions performed concurrently to, and that form a conceptual unity with, a
main action. However, it is not sufficient to place these types of links in order, to
indicate their value to selection mechanisms. Since a conflict is only a conflict if the
particular subaction is relevant for both main actions, subaction links are given tags
from an arbitrary qualitative modal scale, e.g. necessary, non-essential, etc. Again, it
is the decision of the implementor of the semantic network to allocate these tags.
Opposition links
Oppositions are the key elements in the generation of humour. They are required for
the generation of exaggeration, incongruity and derision. Thus, opposition links point
to antonymous actions and their functional form is that of doublets.
Ambiguous links
The possibility of interpreting an action or expression in two or more distinct ways
provides an excellent source for the generation of meaningful narrative sequences,
particularly those of humorous inclination (H-Strategies 6 - 8, in chapter 3). Hence, it
is necessary to provide links that point to actions serving as subactions for other
actions, e.g. for the actions "cry" and "laugh" a subaction might be "shake".
Association links
To increase the fullness and accuracy of our contextual representations we introduce
this type of link, which points to actions that may be associated with the current
action, e.g. “sit” and “listen”.
We predict that the 5 types of semantic links between conceptual action structures
defined above, in addition to the causal and conceptual links defined, will be
sufficient to meet the representational requirements of a substantial portion of
narrative generation strategies, beyond those we have described in chapter 3.
7: The representation of narrative and thematic knowledge
161
However, further research into the representation of semantic knowledge-based story
generation must be carried out to verify this.
Figure 7.1 describes a simplified part of a semantic network for the motion "walk".
look
move
meet
Road
walk
Pleasure
listen
glide
Shoe
Opposition link
Synonym link
Subaction link
Association link
Conceptual link
Causal link
slip
collide
Motion
Conceptual
structure action
Other
conceptual
structures
Figure 7.1 Semantic subnet for the action "walk"
7.2.2 Abstract concepts
The conceptual structures for the physical representation of characters and objects
were described previously (in section 7.1.3) as being located in a hierarchical
prototype based structure, following the analogical knowledge representation
developed by Haase (1994). The relevance of analogical reasoning based on
differences and similarities for the retrieval of appropriate video material is described
by Davis (1995), as outlined in section 5.1.1.5. The significance of analogy in the
narrative generation process was discussed in section 3.3.
In the preceding paragraph, the term "object" is meant in two ways; firstly, physical
objects (e.g. a house or a car) and, secondly, abstract objects (e.g. time or justice). We
make this distinction to allow for two different types of visualisation. If physical
objects or characters appear in the causal chain of a plot, their attributes provide the
necessary information to establish and maintain the sequence structure as described in
Figure 6.3. If, on the other hand, an abstract object is required, then the internal cases
for visualisations are activated. These cases form a collection of cases for visual
clichés, such as montage ellipsis (passing time), e.g. leaves fluttering away,
newspaper presses churning out an extra edition, shadows passing over a wall, and so
forth. The form of such a conceptual structure is presented in Table 7.2., showing a
simplified example for the conceptual structure of "time".
7: The representation of narrative and thematic knowledge
Abstract concept name
Representation structure
[character/object, action]
time
[[shadows], [passing]]
162
Table 7.2 Simplified conceptual structure for the
abstract object "time".
A conceptual structure for an abstract object can also be used as an example case for
generating new visual representations for the abstract object, by either using the
character/object slot or the action slot as the basis for analogical inferencing. While
we have not investigated introduction of learning strategies into our system, we are
aware of the potential of such representations in this area.
To support the task of generating emotionally stimulating narrative sequences, there
are two remaining abstract concepts to be described: emotional and visual concepts.
Both types of concept are related through actions. Since actions are the main way in
which the perceiver can infer the actor's mood or intentions (as discussed in section
5.2.3.1), both concepts are strongly related. However, for the sake of clarity, each will
be presented separately.
7.2.2.1 Emotions
In chapter 3, we described how particular narrative strategies can arouse distinctive
emotions in the perceiver of a scene, i.e. laughter. To support the generation of
emotionally stimulating narrative sequences, it is necessary to provide a conceptual
representation of emotions, such as rage, grief, love or happiness.
In describing the representation of an action as a conceptual structure, we showed how
the action is related to an emotion (see the "Result mood" slot in Table 7.1). The
emotional concept related to the emotional type by the conceptual link takes the form
of a doublet, as described in Table 7.3.7
Type specification
Emotional token
Rage
[Displeasure, Distress]
Table 7.3 Representation of an emotional doublet
The type specification contains descriptions of an emotional result state related to a
particular action. The emotional token represents an emotional classification, based on
two hierarchically ordered classes, i.e. pleasure and displeasure, where each class in
7
Our model of emotions is inspired by the theory of Ortony, Clore, & Collins (1988), and
Wolff's psychological analysis of gestures (Wolff, 1972).
7: The representation of narrative and thematic knowledge
163
itself denotes a hierarchical representation of emotional states, i.e. for pleasure
(delight, ecstasy, euphoria) and for displeasure (unease, dissatisfaction, distress). Each
of the states represents a link to visual concepts, which are described in the next
section.
The representation of emotions suggested here enables us to distinguish between
moods, not only within a class (due to the hierarchical ordering of emotional states)
but also between classes, which provides a means of detecting whether or not an
emotional state should be up- or downgraded, an important factor in the creation of
humour (as, for example, described in for derision in section 3.2.5).
7.2.2.2 Visualisations
The following approach to the visualisation of emotions is similar to that of the
visualisation of abstract objects described earlier. However, for emotions, the aim is
slightly different, because we actually use a hierarchical descriptive set of positive and
negative emotions as the basis for gesture or action oriented visualisations. From our
representation of emotions as described in Table 7.3, we can create structures to
represent a wide range of emotional types. Table 7.4 shows a visual concept for the
emotional class "pleasure".
Emotional class name
Body part
Action
pleasure
Head
[lip, up]
--
whistle
Table 7.4 Simplified representation of the emotion
class "pleasure"
The structure in Table 7.4 reveals that an emotion is either represented as a gesture,
which in its form mirrors the representation of actions in the video content (Table
5.3). An alternative way of representing an emotional state is by simply using an
action as an index to that state.
7.2.3 Events
Earlier in this thesis, we discussed narrative principles (sections 2.1 and 3.1) and
showed that a plot is a causal-chronological entity of related events within a given
temporal and spatial framework. As a basis for the mechanisms that create or interpret
such a dynamic structure, we identified three types of schemata: prototypes
(identifying types of person, actions, localities, etc.), templates (common story
7: The representation of narrative and thematic knowledge
164
formats) and procedures (to organise the search for appropriate relations of causality,
time and space). Furthermore, in section 2.1.2.1 we argued that an event is a discrete
structure of archetypal behaviour patterns and cultural stereotypes, consisting of preconditions (motivation), main-conditions (realisation) and post-conditions
(resolution), where each condition comprises a number of dominant or free functional
elements. Thus, it is the event structure which provides, at the sequence level at least,
the logical and temporal order for the process of generating a narrative sequence.
Referring to our comments concerning the representation of the temporal aspects of
video content (section 5.1.2.1), our remarks concerning the cause and effect chain of
the narrative structure in the automated editing process (section 6.4), and deriving
from Schank's representation of events, we are now in a position to present our
approach to representing event structure.
Our event structure representation names the event, the number of actors or objects
involved and their gender (the description "any" in Table 7.5 represents the fact, that
the gender can be either male or female), the intention behind the event, the main
actions of the actors involved, and a link to the next higher element within the story
structure. The main action is divided into the three event stages, motivation,
realisation and resolution, each containing a sequence of actions for each actor. Table
7.5 shows a simplified event representation for a "meeting".
Name
meeting
Actor
number
2
Gender
Intention
any
any
meet
Motivation Realisation
[walk]
[wait]
[look at]
[look at]
Resolution
Episode
[shake_hand]
[shake_hand]
date
Table 7.5 Structure of an event, i.e. "meeting"
The overall description of the mental situation in the event is provided by the
parameter Intention, which establishes a conceptual link to the described action and
thus inherits the related information concerning moods and emotions.
The relevant actions for each actor in an event stage are sequentially organised. This
means that there is either one action or a list of actions which are causally related. The
relation between actions performed by different characters for each event stage is
oriented towards the crosscutting technique in film editing, which alternates shots
from one line of action in one place with other events in another place, and thus
creates a sense of cause and effect by binding the actions together. The same
technique is used here, but instead of binding shots we combine lists or single actions.
7: The representation of narrative and thematic knowledge
165
For example, imagine the event of a character attempting to obtain coffee from a
coffee machine. Table 7.6 describes a simplified action representation for the three
event stages of the "getting coffee" event.
Motivation
Realisation
[approach]
[[search_money,
[]
[wait]]
[[process],[provide_cup+]
Resolution
insert_money+], [look_change, take_cup+,
leave]
[]
Table 7.6 Actions in the event "getting coffee"
During the motivation stage, the actor approaches the machine. In the realisation stage
of the event, we may see the actor searching for, and then inserting, the change. Then
we see the machine operating, then the actor waiting, and finally the machine
providing the cup and the coffee. In the resolution stage of the event, we may see the
character looking for change from the machine, then taking the coffee, and then
finally leaving. Some of the actions performed by the actor and the machine are
essential for the plot understanding and others are not. Those of importance are tagged
(i.e. marked with "+" in Table 7.6), indicating that these actions must be presented for
an unmistakable perception of the scene.
We now see that it is the information gathered in the event structures which serves as
the material for the sub-structures Characters and Actions of the Sequence structure,
as described in Figure 6.3. However, the event stage, or actions related to the event
stage required, depends on the external reason for telling the story, i.e. on the thematic
and narrative strategy in use at the time.
Assume, for example, that the above event of the character and the coffee machine is
to be transformed into a gag. Depending on the context, we might need only the
realisation part, and from the actions only those tagged with a +, to fire some of the
rules described in section 3.3, so that a humorous sequence can be created (see the
example jokes created by H-Strategies 22 and 23, described in chapter 3).
Referring to our earlier discussion of the temporal and rhythmical relationships
between shots, we can now also describe how the creation of a narrative-oriented
ellipsis (as described in sections 2.3.1 and 6.4.6.4) can be supported. In chapter 3, we
asserted that timing is essential for the creation of a visual joke, which usually means
that the introduction and punch line should be kept short. Alternatively, imagine a
sequence where a character washes his hands, buttons his shirt and leaves the room.
The event structure, as described in table 7.5, supports such aims, since it provides an
7: The representation of narrative and thematic knowledge
166
indexing scheme for essential causal event elements. Thus, only the barest level of
information can be selected, and, if necessary, depending on the narrative strategy,
even a sub-selection of those elements can be provided.
It should be made clear at this point in the discussion that the different event stages in
combination with the related semantic network for actions, not only influence the
logic of the ongoing story but also effect the choice of appropriate visual material
during the generation process, as schematically represented in Figure 7.2.
Motivation
=>
Realisation
=>
Resolution
available video material
scene relevant material
Figure 7.2 The relation between the narrative logic and the choice of the
visual material used to represent it
Finally, the parameter Episode of the event structure establishes a link to prototypical
episodes. Episodes represent the description of stereotypical chains of events. They
are described by scripts that feature two parts, resources and cases. Resources are
represented as descriptions of the number and type of characters involved, their
appearance and the environment in which they act. The structure parameters
characters, appearance and environment serve as pointers to related conceptual
structures, which provide the detailed representations. Cases resemble a collection of
events, organised according to the narrative stages motivation, realisation and
resolution. Note that events in each narrative stage are also indexed as dominant or
free.
Episodes are grouped into four classes: social (date, party, etc.), public (e.g. a
demonstration), business (e.g. an appointment, or a conference) and private (e.g.
having sex). The classes and related prototypes of episodes support the application of
story templates (genres), which determine:
7: The representation of narrative and thematic knowledge
167
•
The theme of a plot
•
The relevant narrative primitives and their potential presentation. For example, a
typical detective story introduces the characters (introduction - setting), then
shows the murder (conflict) and then introduces the detective (introduction - point
of view).
The discussion of episodes, and their relationship to story templates, are mentioned
here for the sake of completeness, since this thesis does not deal with the creation of
such complex structures. However, it is important to suggest how the established
event structure could be accommodated into a larger development environment, since
it is events that form the building blocks of episodes and thus stories. Further research
is needed to establish if this is indeed the case.8 Figure 7.3 describes a simplified
subnet containing the structures discussed above.
look
move
Conceptual
structure action
Conceptual
structure
physical object
Date
Meeting
meet
Road
Emotional
concept (doublet
& class)
Event
Shoe
walk
Episode
listen
glide
slip
Rage
collide
Opposition link
Synonym link
Subaction link
Association link
Conceptual link
Causal link
Figure 7.3 Semantic subnet
8
For interesting and related approaches see Brooks (1995, 1996), Davenport (1994), Davenport
& Murtaugh (1995), Galyean (1995), and Ryan (1991). Examples of advanced story
development software are Dramatica (1996) and Storyline Pro (1993).
7: The representation of narrative and thematic knowledge
168
7.2.4 Conclusion
The above scheme for the representation of events provides a coherent action-reaction
dynamic, which supports both the external and internal reasons for telling a story
(see section 2.1.1). Due to the introduction of the event stages motivation, realisation
and resolution, the indexing of dominant action functions, and the ability to provide a
number of cases for a particular event, the scheme is also sufficiently flexible to serve
different types of narrative strategies.
Throughout this thesis, we have argued that film, though based on common human
content, provides its own reality. At the current stage of our research, we believe that
our approach provides sufficient causal structure to support the automatic generation
of credible narrative sequences. In terms of the editing strategies described in chapter
6, we also showed how actions of different event stages can be ordered in such a way
that a sequence of actions can be created, where certain actions are inferred but not
shown (as demonstrated by the Kuleshov examples in section 4.2.2.2). Thus, our
approach to representing the knowledge of the media world (genres) in combination
with related presentational and cultural knowledge (e.g. editing and emotional codes)
seems more appropriate to support the viewer's cognitive process of sequence
construction (inferential activity) than traditional AI representations of the world
(common sense). Hence, we have not introduced spatio-temporal logic (Allen (1990,
1991), Del Bimbo et al. (1992, 1993), McDermott (1978, 1990), Parkes (1989a), and
Shoham (1987)). However, we concede that our approach is specifically designed for
the needs of the current task, and it may prove inadequate for other tasks, such as
interpreting a video sequence. Further research on the inferencing schemes required to
interpret video sequences would be worthwhile.
Earlier in this thesis, we discussed the nature of film editing and narrativity, and drew
attention to the cultural and subjective influences on both of these memory-based
processes. Stories and images are vehicles by which thoughts are transmitted from one
mind into another. Thus, how we understand is affected by what is in our mind. That
is, image and scene understanding is different when there are different memory
structures controlling the process. In a small way, the influence of subjectivity can be
7: The representation of narrative and thematic knowledge
169
found in our use of valued links and the ways in which our semantic nets are
organised, which also reflects the cultural influence.9
A significant drawback of the above representation scheme could be argued to be the
complexity of the representation and its implication for the cost of implementation, as
any concept or event structure (which represents a concept in an episode) must be
created before it can be assigned to a case in memory. However, the structures
introduced above could, we believe, be understood more readily by the user than
could schemes based on logical equations. At the current stage of our research, we
have no experimental evidence as to how long it would take to teach users to create
and maintain the knowledge structures we have specified. The design and
implementation of environments to facilitate users in providing the necessary
representations is a problem for future research, and is not considered further in this
thesis.
Chapters 5, 6 and 7 have introduced the structures and operational mechanism for
representing video content, editing knowledge and narrative related "common sense"
knowledge. The next chapter describes the architecture of AUTEUR, our
implemented prototype based on the proposed representational structures. In chapter 9
we present example film sequences edited by AUTEUR, and describe how the system
created those sequences.
9
However, a much stronger cultural influence can be found in the design of the editing and
humour strategies, which at the current stage of the research clearly reflect European culture.
170
Chapter
VIII
AUTEUR:
An architecture for
automated video
story generation
The previous chapters provided the representational structures, and the narrative and
editing strategies, which will enable an automated system to generate emotionally
stimulating video stories at the level of events. The aim of this chapter is to
demonstrate the applicability of the proposed ontologies and methodologies, by
introducing the architecture of a prototypical system that can generate humorous
visual stories.
However, before we introduce our architecture, we first give a brief overview of
existing systems for the automated generation of visual stories.
8.1 Related work
Despite recent rapid developments in multimedia research, little attention has been
given to the development of architectures to support automated generation of visual
stories.1 Two important approaches, by Sack & Davis (1994) and Bloch (1986) are
briefly discussed in this section.
8.1.1 Sack & Davis' video generator IDIC
Sack & Davis (1994) describe a video generator called IDIC, which creates new nonverbal "Star Trek: The Next Generation" (STTNG) trailers from an archive of existing
trailers for STTNG episodes. The theme of the trailer is specified by the user, who can
chose from four basic narrative structures: threat, negotiation, fight and rescue.
1
AI research on multimedia, with respect to the automatic generation of presentations, has been
more concerned with media such as text (Riesbeck & Schank, 1989) and graphics (André,
1995; Feiner & McKeown, 1991). Some contemporary AI research investigates the generation
problem for video (Brooks, 1995; Brooks, 1996) or animation (Butz, 1995; Strassmann, 1994).
8: AUTEUR: An architecture for automated video story generation
171
IDIC's video generating architecture consists of two main components: a modified
version of Newell and Simon's General Problem Solver (GPS) (Newell & Simon,
1990; Norwig, 1992) for planning the new trailer, and Media Streams (Davis, 1995),
which is used to annotate and retrieve the appropriate footage.
The GPS operators used by Sack and Davis are based on four goals (threaten,
negotiate, fight and rescue). The structure of a GPS operator (a list of preconditions,
an add-list and a delete-list) is used to represent the content requirements of the first
scene of a sequence (preconditions), the same for the next scene of the trailer (addlist) and the type of action that can be ignored (action of operator). The action of the
GPS operator is assumed to be inferred by the viewer in the cut between two scenes of
the sequence. Figure 8.1 illustrates the structure of a GPS operator used in IDIC.
;fight -> threaten
(make-op : action
'threaten-renewed-violence
:preconds '(fight)
:add-list '(threaten)
:del-list '(fight))
Figure 8.1 GPS operator "threaten renewed violence" in
IDIC (Sack & Davis, 1994, p.5)
In order to satisfy a narrative goal, the GPS created by Sack and Davis applies
sequences of operators. However, unlike a planning system, which is oriented towards
short solutions, the story GPS selects the operator with the most unsolved
preconditions, which provides more complex stories with unexpected turns. The story
board produced by the GPS then serves as the basis for IDIC to retrieve the
appropriate visual material. The available STTNG trailers, which serve as visual
source, are annotated in such a manner that sequences of a trailer are indexed with
appropriate GPS goals.
IDIC's major contribution is to show that GPS-style planners are adequate tools for
the automated generation of video stories. Due to their pre-conditions, post-conditions
and the goal oriented action that serves as a linkage between the different operators,
GPS planners function in a very similar way to shot transitions provided by cuts in a
video sequence. Moreover, IDIC demonstrates that the goal satisfaction of the planner
itself must be adjusted to the task, which in the case of story telling, is to provide
longer and more convoluted event sequences.
8: AUTEUR: An architecture for automated video story generation
172
Though the re-implementation of GPS in IDIC yields an appropriate approach to
generating coherent trailers, IDIC is unable to adequately deal with narrative
continuity and complexity.
For a trailer, it is not necessary to tackle issues such as spatial or temporal continuity
(discussed in section 6.4), since the viewer accepts that the scenes presented are taken
out of context. Hence, IDIC's matching mechanism of linking one GPS goal to each
video segment is insufficiently powerful for creating credible sequences. Moreover,
the keyword oriented approach of IDIC's video annotation does not allow
modifications based on connotative combinations, since an action labelled "fight" or
"negotiate" completely specifies a single meaning. Thus, more complex techniques
than matching goals to video sequences are required, in order to relate multiple video
annotations to the automatically generated story board.
Related to the problem of matching video annotations to a story board, is the problem
of creating an optimal presentation. Since IDIC's aim is to create event structures
which are based on the fulfilment of one single goal, it is of no importance if several
video segments can be matched against the goal. The system can use any of them.
The structure of stories is more complex than that dealt with by IDIC. As argued in
the preceding chapters, the appropriate match for the next video sequence not only
requires an analysis of the intention of the current narrative state, in terms of the
internal and external reason of the story, but also requires an analysis of the important
stylistic features of preceding shots. The constant switching between the forward
chaining development of the story and the backwards chaining required to specify the
stylistic presentation of the sequence, cannot be achieved by a single-dimensional
planner, such as IDIC. Thus, while IDIC's approach is promising, with respect to
establishing the framework of a story (i.e. supporting the establishment of genres and
themes), we argue, as in previous chapters, that planning a story and presenting it are
two different, though related tasks, which require distinct strategies. Therefore, the
aim should be to develop a multi-planner based architecture which combines
mechanisms for managing narrative continuity and complexity on the action, event
and episode level, along with mechanisms for controlling stylistic features related to
narrative issues (e.g. themes and genres) and medium-oriented presentational aspects
(e.g. editing techniques).
8: AUTEUR: An architecture for automated video story generation
173
8.1.2 Bloch's machine for audio-visual editing
In section 5.1.1, we discussed Bloch's approach to the representation of video content.
In section 6.3.2, we discussed his approach to automated editing. We now discuss the
architecture for Bloch's audio-visual editing machine (Bloch, 1986).
Bloch's audio-visual editing machine consists of two videodisc players, a video
monitor and a SM190 mini computer. Both videodisk players play identical copies of
the same videodisk, which holds sixty purpose-filmed shots of an actor engaged in
simple physical actions, such as walking, picking up an object or looking at another
actor. Visual continuity between shots is achieved by using the principle of "virtual
editing", where one videodisc player displays a shot, while the other searches for the
first frame of the next shot. Following the first shot, the second videodisc player plays
the next.
The control program for Bloch's editing machine is object-oriented, and distinguishes
between five operators: choice, construction, appreciation/correction and projection.
Each operator, effectively an object, consists of a collection of specialists for
particular tasks in the class. The generation process for the video story, as described in
section 6.3.2, uses the conceptual representation of a given simple story as a
startpoint, and then steps through the stages of the editing model. Figure 8.2. shows
the main stages of Bloch's editing model.2
•
Choice
•
•
•
Construction
If Appreciation is bad then Correction
Projection
Figure 8.2 Bloch's editing model (Bloch, 1986, p. 133)
The tasks of the different operators from Figure 8.2 are:
Choice
This operator determines the type of segment needed and establishes
Construction
the style of the visual presentation (e.g. portray Gilles and Said
looking in two shots, in the form "Gilles is looking to the right" and
"Said is looking to the left").
Depending on the type of segment, different types of specialist are
activated. For example, one specialist task is to present a simple
2
Translation by the current author
8: AUTEUR: An architecture for automated video story generation
Appreciation
174
action by juxtaposing a large number of shots with the aim of
providing the impression of fluid motion.
This operator provides particular functions for evaluating motion,
position and looks in shots. For example, two shots should be
juxtaposed on the basis of positions of characters. This triggers an
attempt to establish the shared context of both shots, such as whether
the shots contain the same actors, and in the appropriate positions. If
this is not the case, the join is indexed as faulty.
Correction
This operator provides specialists in the form of heuristics to
improve faulty joins. For example, juxtaposition judged to be faulty
due to lack of continuity in motion may be corrected by inserting an
undetermined motion shot.
Projection
This operator organises the "virtual editing" of the videodiscs.
The relevance of Bloch's architecture is twofold. Firstly, he introduces specialised
procedures for the control of the editing process, which embody an editing model
similar to our own, as described in section 4.3. However, Bloch's aim is neither to
provide an automated story generator (the narrative is given), nor to provide
mechanisms to react to cases where the required material is not directly available.
Rather, Bloch's aim is to create the optimal presentation for the given story from the
available video material. Hence, there is no need in his top level algorithm to iterate
through choice, construction, appreciation and construction, as our model suggests.
Bloch's algorithm applies correction only when the visual appreciation for a particular
join is determined to be low.3
The second important feature of Bloch's architecture is the object-oriented nature of
his editing model. This approach is of particular importance, as it enables the handling
of the divergent narrative mechanisms, presentation mechanisms and representational
structures, on the basis of communication between object classes. The object-oriented
implementation of Bloch's architecture provides an extremely useful basis for tackling
the problem of synchronising the story planning process with the generation of an
acceptable visual presentation.
3
It has now become apparent, that, to Bloch, the generation of film sequences is seen in terms of
montage rather than editing. This is why his representation of video content does not allow the
further editing of the shot itself (for a detailed discussion of Bloch's representation of video
content see section 5.1.1), which also influences his approach to automated video editing, as
described in section 6.3.2.
8: AUTEUR: An architecture for automated video story generation
175
8.2 A proposed architecture for the editing of theme oriented video stories
The aim of our proposed architecture is to establish a system that synchronises
automatic story generation for visual media with the stylistic requirements of narrative
and medium related presentation.
In section 2.2, we discussed the relationship between the two main layers of a story,
i.e. structure and content, and showed that both simultaneously serve the narrative
purposes of form and substance, which is described in Figure 2.4. In order to achieve
the required interaction between the structure and content levels, we identified a two
layer planning system as an appropriate approach to generating both the plot and its
visual presentation. This was contrasted with a grammar-oriented approach, which
could support only plot generation.
Furthermore, we showed that it is the substance and expression of content which
establishes the impression of a well-formed plot and also enables the creation of
distinguishable plots.
The subsequent investigation of plot content, and its formal expression for the
example of humour in chapter 3, led to strategies for the automated generation of such
content. These strategies were underpinned by an ontological representation
describing the physical story-world and abstract mental and cultural concepts in
chapter 7.
Moreover, our investigation of the film realisation of narrative led to the introduction,
in section 4.3, of a procedural model of the editing process, our representation of
video content and, in chapter 6, knowledge structures and related processing
mechanisms to enable automated video editing. Our analysis of the editing process
suggested that the process of arranging visual material is best considered to be a
planning problem.
In order to demonstrate the applicability of our described representational structures
and narrative and editing strategies, we now present an architecture for a prototype.
The architecture is realised in the current version of our experimental system
AUTEUR (Artificial Intelligence Utilities for Thematic Film Editing using Context
Understanding and Editing Rules), implemented in Sicstus Prolog on a SUN Sparc
workstation.
8: AUTEUR: An architecture for automated video story generation
176
We first provide a general overview of the architecture, and then describe the
components in detail. Chapter 9 describes humorous film scenes actually generated by
AUTEUR.
8.2.1. Overview
The aim of AUTEUR is to automatically generate a video story. The user provides
AUTEUR with the identifier of a startshot and the thematic orientation of the event to
be created. The thematic orientation for the presented examples is humour. A possible
query might be go(10, humour), where 10 is the id of the startshot, and humour
represents the theme. The overall tasks carried out by AUTEUR are described in
Figure 8.3.
1
2
Analysis of the startshot in terms of establishing the appropriate
thematic strategy for the event to be established.
Development of the story in accordance with the thematic strategy
4
chosen.
Establishing of the appropriate form of presentation for the
thematic orientation and story content.
Retrieval of the appropriate visual material.
5
6
Editing of the material.
Presentation of the visual story.
3
Figure 8.3 Tasks performed by AUTEUR
Referring to the editing model described in section 4.3, all steps between task 1 and 6
are performed repeatedly, since any of the steps may fail, and restructuring must then
be carried out.
Since the aim of AUTEUR is to perform the task of story generation and its visual
presentation as autonomously as is possible, the design of the architecture is inspired
by work on planning (see André (1995), Cawsey (1990), Hayes-Roth (1985), HayesRoth & Hayes-Roth (1990), Korf (1990), Newell & Simon (1990), Sacerdoti (1977,
1990, 1990b), Schank (1981), Smith & Witten (1991), Tate et al. (1990), Wilensky
(1983a, 1983b, 1990), Wilkins (1990)), and ideas propagated by the research on
Autonomous Agents (see Bates (1994), Hayes-Roth (1995), Maes (1991, 1994)).
The proposed architecture, as described in Figure 8.4, consist of three major units: one
unit embodies the DB for video material, the DB for video representation, and the
Knowledge Base (the resource unit), the second unit covers the Editor and Retrieval
8: AUTEUR: An architecture for automated video story generation
177
System (the construction unit), and the third contains the Development Tools and the
Interface (the development unit).
The resource unit reflects, in its structure, our assumptions concerning representation
(see sections 5.2, 7.2 and 6.4) and contains the actual visual material (i.e. video clips),
the content annotations for each video clip, and a knowledge base holding the
ontological representation of the "media-world", mental and cultural concepts, and
some knowledge related to abstract filmic and editing representations (e.g. the spatial
relationships between two shots, represented as a two dimensional array - see Table
6.2).
Retrieval
DB
Video
Edito
DB
Video
Structure
Knowledge
Semantic
Content
Interface
Visual
Visual
Filmic and
knowledg
Development
conceptual
information
Figure 8.4 Proposed architecture for the creation of a visual story of
emotional impact
The construction unit of the architecture embodies a controller module (the Editor in
Figure 8.4), which consists of two separate planning systems. The Structure Planner
deals with structural details of the plot and its visual presentation. The Content
Planner deals with the actual plot content and its stylistic presentation. The Retrieval
System serves as the link between the controller module, the resource unit and the
user interface.
The development unit reflects our discussion of semi-automated video content
annotation systems (see section 5.3) and the creation and maintenance of
representational knowledge structures (i.e. the editing rules and the semantic net).
8: AUTEUR: An architecture for automated video story generation
178
From the discussion in previous chapters, it should be clear that this part of the
architecture is merely theoretical. We include it only for completeness.
We now discuss the components of the architecture and their respective functionality.
We also discuss the interface between the components. We begin with the resource
unit, and then describe the construction unit.
8.2.2. The video database
The video database is a collection of digitised video material. Our approach to the
representation of video content allows for the use of video clips of arbitrary content
and length. However, it is expected that the clips will usually refer to a single domain.
Furthermore, due to the lack of appropriate tools for automated video annotation (as
discussed in section 5.3), we assume that the shots will usually be short in duration
and restricted in their range of actors, actions and objects.
Which of the standard formats for digital video is chosen depends on the capturing
tools provided by the development unit. Our current implementation of the video
database uses the MPEG format. For experimental purposes, the video database is
currently restricted to approximately 40 shots, each between 1 and 15 seconds in
length. The shots are purpose-filmed for our research and feature simple physical
actions of an actor or object, but in a normal physical environment.
8.2.3. The video representation
The content representation for the available video clips uses the structured textual
approach for the categories cinematographic devices, character, object, action,
composition and setting, as described in section 5.2. The annotations were created in
MicroEmacs. An intelligent, semi-automated graphical interface providing a simple
video editing suite and an intelligent text editor, which supports the user with
appropriate, and preferably pre-filled forms for the category to be annotated, would
have been useful. This is not so much because the annotation process is time
consuming, but rather because it is tedious. Hence, it is not surprising that quite some
time was spent on finding misspellings, or values, which were attached to the wrong
attribute. Nevertheless, the representational content structures proved to be
manageable in this form.
8.2.4. The Knowledge Base
The Knowledge Base contains the conceptual structures for characters, objects,
actions, events, episodes, and abstract concepts such as emotions and visualisations,
8: AUTEUR: An architecture for automated video story generation
179
in the form of a semantic net, as described in section 7.2. Since AUTEUR is
implemented in Sicstus Prolog, the value mechanism for links in the semantic net are
held in the form of lists, as described in sections 7.2.1.1 and 7.2.1.2.
8.2.5. The Editor - a controller module
The architecture of the editor embodies the separation of the two main story layers,
defined in section 2.2, i.e structure and content. Each layer is provided with its own
planning system. The content planner is additionally assisted by two specialised
planners, i.e. the Visual Designer and the Visual Constructor. The communication
between the different planning systems is based on the memory structures SequenceStructure (shown Figure 6.3) and Location-Memory-Structure (shown in Figure 6.5).
8.2.5.1 The Structure Planner
In the discussion of themes in chapter 2, we showed that the
external point
influences the scene structure. The task of the Structure Planner (Figure 8.4) is to
organise the strategies for realising the required theme.
The Structure Planner is involved in the creation process from the outset, by
providing the analysis of the given startshot, which in turn is used to establish the first
Sequence-Structure. This analysis, based on the startshot description in the video
representation, provides information about the number of actors, groups of actors,
objects and groups of objects. Each of these units is related to particular information
about their actions, i.e. sequences of actions, actions performed concurrently (i.e.
performed during the same sequence of frames), and single actions. The information
acquired is stored in the Setting, Subject, and Action fields of the first SequenceStructure.
Since the mood of characters is important in visual humour, the startshot analysis also
attempts to derive information about this. The analysis uses heuristics referring to
visual expressions or the speed of an action (discussed in sections 7.2.2.1 and 7.2.2.2),
e.g. a smile supports the assumption of pleasure. The result is a mood description,
such as "pleasure", combined with a numeric value that represents the system’s
certainty that this mood is suggested by the chosen material.
The next task of the Structure Planner is to use the results of the startshot analysis to
establish the appropriate thematic strategy (see task 1 in Figure 8.3). The discussion of
humour in chapter 3 revealed that two of the five identified humour primitives, i.e.
incongruity and derision, have implications for the construction of humorous events.
Recall that each of these primitives covers various humorous concepts (see section
8: AUTEUR: An architecture for automated video story generation
180
3.2.4 for incongruity and 3.2.5 for derision). The process of providing the general
direction of the visual gag is based on strategies, which are described in chapter 3 as
supportive (Table 3.1). As the first example of a supportive strategy take H-Strategy
1, which says: An action forms the most suitable subject for a joke, then an actor, then
an object and finally a location. Hence, the system will first investigate options
concerning actions. As a further example of a supportive strategy consider H-Strategy
14, which states: For a single action, mischief is easier to achieve than ambiguity. If
the available shot portrays a single action, the overall aim of the story construction
will be to generate a derisive joke.
The preparation process ends with the completion of the Sequence-Structure, by
instantiating the fields Kind, Intention, Form and Appearance.
The field Kind is instantiated with the current narrative stage of the event, which is
either motivation, realisation or resolution (see section 2.1.2.1). The decision for a
narrative stage is driven by the relevant supporting humour strategies, in combination
with the results of the startshot analysis. In the preparation phase it may be the case
that the startshot provides all the necessary information to set the scene. For example,
the relevant strategies for derisive jokes focus on single actions and the mood of a
character. Assume that, firstly, only one action is represented in the video
representation of the startshot, and, secondly, that the certainty factor for the mood is
sufficiently high (e.g. the value is equal to 1). Such a case would cause the system to
instantiate Kind with the value Realisation, rather than Motivation. It must be
stressed that the same evaluation mechanisms are used to determine the use of the
remaining narrative stages, except that in this case it is the previously used shots
which are analysed, rather than the startshot.
The instantiation of Intention and Form is based on the actor, action and location
analysis of the startshot. For example, if one of the results of the startshot analysis is
that the joke is based on an action, Intention will be instantiated with action, Form
will be instantiated with the chosen strategy type (e.g. misfortune), and the strategy
name (e.g. H-Strategy 4).
Since the overall theme for the sequence is humour, the system automatically attaches
the value "accelerate" to Appearance (the reason is described in sections 3.2.1 and
3.2.2, which discuss the humour primitives readiness and timing). Other themes, e.g.
tragedy, might result in values being chosen for the Appearance slot.
8: AUTEUR: An architecture for automated video story generation
181
Once the essential structural elements for the event to be created have been declared,
the Structure Planner supports the construction process, by providing the Content
Planner with additional information concerning the current event phase. The decision
process of the Structure Planner is based on feedback from the Content Planner. The
Content Planner provides information about the meaning of the story, the motivated
event, mood, action, actors involved, etc. The Structure Planner compares the actual
event provided by the Content Planner with the suggested content in the original
Sequence-Structure. The results of this comparison serve as a guide to establishing the
Sequence-Structure for the following event phase. For example, a new character may
have been introduced due to motivational content requirements, and must now be
included in the storyline. This might affect the presentational structure in terms of
changing from a single event presentation to a parallel event presentation. A further
example is when the Content Planner produces a storyline based on the given
Sequence-Structure, but there is insufficient visual material to portray the story. A
way out of this dilemma is to apply an alternative strategy. The Structure Planner
may, for example, switch from misfortune to stupidity, because the existing material
supports such a change of strategy. In such cases, it is the responsibility of the
Structure Planner to initiate the reorganisation of the material, for example by
revisiting an earlier creation phase (e.g. moving from realisation to motivation), which
may cause the re-establishment of the startshot, insofar as it has already been edited.
As the aim of our research, at the current stage, is not to achieve the automatic
generation of complex story structures such as episodes or features, the tasks of the
Structure Planner are fairly limited. In particular, the preparation phase is where this
planner does most of its work. However, we assume that the introduction of more
complex story structures would require additional meta-compositional rules, similar to
the supportive strategies for humour. The establishment of genre structures, which are
highly related to the development of a theme, as described in section 2.1.1, would be
particularly suitable. In this case, the Structure Planner would need to be developed
further, and may need components such as a Theme Analyser and a Genre Analyser
(analogous to the Visual Designer and Visual Constructor which support the Content
Planner).
8.2.5.2 The Content Planner
While the Structure Planner is concerned with the external point, or theme, of the
video story, the Content Planner is concerned with its internal point. The Content
Planner specifies the content of the relevant event phase, and is therefore responsible
8: AUTEUR: An architecture for automated video story generation
182
for the actual application of particular strategies. Hence, the Content Planner uses the
humour strategies that were described in chapter 3 as constructive.
Depending on the strategy and its specification provided by the Scene Planner, the
story planner uses the conceptual structures gathered from the semantic net of the
Knowledge Base (shown in Figure 8.4) to attempt to construct a coherent scene.
Take H-Strategy 4 as an example:
H-Strategy 4
If the action portrays an intention (goal), interrupt the action,
in a way that is expected by the character, so that the goal is
unfulfilled and the character's mood is downgraded or he
suffers in some way.
(Mischief + Schadenfreude +
Superiority)
The strategy is designed to create a derisive joke (Mischief + Schadenfreude +
Superiority). For its motivation phase it requires a goal-oriented action and a mood
which can be downgraded. The realisation requires an interruption of the action,
which hinders the success of the goal. The resolution requires the representation of a
downgraded mood, or an event or action that represents the suffering of the actor.
However, recall from Figure 7.2 that as the story progresses through the time phases,
the choice of content elements decreases.
The Content Planner consists of three independent planners, one for each creation
phase, to fulfil the above strategy. The motivation, realisation, and resolution planners
are essentially specialists for the content generation of their respective event phase.
Each planner uses the event-related Sequence-Structure and the semantic net of the
Knowledge Base as primary sources for their story planning processes. Additionally,
each collaborates with the Visual Designer and the Visual Constructor to establish the
visual presentation of their respective part of the scene.
For example, assume that a favourable mood and necessary action (e.g. "walk") are
available from the startshot. The first event phase to be created is thus the realisation.
The goal of the realisation planner is to interrupt an action which is expected by the
character. An interruption requires the planner to find an oppositional action for the
action actually being performed. Hence, the planner traverses the oppositional links
(as described in section 7.2.1.2) of the semantic net, from the conceptual structure of
the current action (i.e. "walk"), in such a way that the highest valued link (e.g. "slip")
is chosen first. The connected conceptual structure serves as the content material for
the realisation phase. The Content Planner constructs a query which may request a
8: AUTEUR: An architecture for automated video story generation
183
story element showing the character slipping. This query is sent to the Visual
Designer, which may, or may not, arrive at a suitable visual solution. If the Visual
Designer does not succeed, the Content Planner continues the investigation of the
semantic net, by first processing any unexplored oppositional links. If this also proves
unsuccessful, the synonym links from the actual action are processed, and then the
oppositional links from the connected conceptual structures for actions are processed.
If these also prove to be unsuitable, the synonym links from the action itself are
traversed so that the oppositional links from the synonymous actions can also be
processed. In cases where no solution exists, the Structural Planner is informed that a
change of strategy is required.
However, recall that there are usually several versions of a particular strategy, which
provide the system with the ability to create stronger or weaker jokes (compare, for
example, H-Strategy 2 and H-Strategy 4 in chapter 3). Analogous to the descending
order of the constraints for the content retrieval query, the first applicable event phase
planner will attempt to establish the strongest content direction for a joke (i.e. using
the content strategy with the highest number of associated content primitives) and
then gradually weaken the visual gag as necessary. Thus, H-Strategy 4 would be
chosen first, followed by H-Strategy 2. This strategic behaviour reflects the aim of
producing the best possible joke in the circumstances.
A further important task of the Content Planner is to determine the point of view from
which the story is seen to be taking place. For example, the Content Planner must
decide whether it is appropriate to tell the story from a particular character's
perspective. Such a decision is influenced by the Form field in the Sequence-Structure
of the related event phase. The aim of AUTEUR is to stimulate an emotional reaction
in the viewer of the story, so the general goal is to tell the story from the viewer's
perspective. However, H-Strategy 4 indicates a change in point-of-view, i.e. the
expectation by the character that the attempted action may not succeed in its aims.
Such cases will cause a point of view change from the third person narrative style to a
character reaction. The influence is manifested in a change to the Form field (from
third-person to first-person) and will have consequences on the work performed by the
Visual Designer, which now has to search for shots that succeed as eye-line matches,
instead of shots that provide continuity on the basis of motion.
Finally, it is the Content Planner which evaluates the constructed visual joke. While
establishing the event phases, the Content Planner calculates a numeric value that
represents the extent to which the content and stylistic requirements in the different
event phases have been fulfilled. The evaluation heuristic used is based on the
8: AUTEUR: An architecture for automated video story generation
184
evaluation function described in section 3.2.3. The calculated evaluation index is then
compared with a percentile system, resulting in a the rating of the visual gag.4
8.2.5.3 The Visual Designer
The Visual Designer supervises the retrieval of the required video material. The
content based query (order of shots) is provided by either the motivation, realisation
or resolution planners of the Content Planner, depending on the current generation
phase. The aim of the Visual Designer is to collect the most appropriate video
material in terms of content and style. The architecture of the Visual Designer is based
on our discussion of the space, action and content requirements of continuity editing
(section 6.4). The mechanisms used are based on the representation of the permissible
relationships between shots (shown in Figure 6.2), the relationships between camera
distances, the conceptual structures for subjects, i.e. characters and objects (shown in
Figure 6.3), and editing strategies, as described in sections 6.4.1 - 6.4.5.
The structures representing the relationship between shots and the relationship
between particular denotative aspects of the video content and structural elements of
the conceptual structures of subjects (i.e. characters and objects), are part of the
Knowledge Base (shown in the overview of the architecture given in Figure 8.4). The
editing rules are part of the Visual Designer. Thus, the resources used by the Visual
Designer during the process of establishing a visual representation for a particular
event phase are:
•
the related Sequence-Structure and connected with it the Location-Memory-
Structure (see Figure 6.5),
•
the Knowledge Base of editing knowledge (e.g. the spatial relationships between
two shots, represented as a two dimensional array, as described in Table 6.2) and
the semantic net,
•
the data base containing the video representation.
An example of the use of these resources is the reaction of the Visual Designer when
it is unable to retrieve the visual material specified by the Content Planner. In such
4
The percentile system is defined as :
90 - 100 % => good
80 - 89 % => reasonable
70 - 79 % => weak
60 - 69 % => very weak
0 - 59 % => completely unfunny
8: AUTEUR: An architecture for automated video story generation
185
cases the Visual Designer uses the strategy of query decomposition, as described in
sections 6.4.3 and 6.4.4.
The output from the Visual Designer is a shot list for the related event phase,
representing the content query in the most appropriate visual terms, which is then
transferred via the Content Planner to the Visual Constructor, as described in the next
section.
Note that, since the Sequence-Structure for all event phases and the LocationMemory-Structure are kept in memory, each joke is remembered as a case,
represented by the steps involved in its generation. The Visual Designer can therefore
use these as example cases, or to avoid recreating the same joke.
8.2.5.4 The Visual Constructor
The Visual Constructor receives an annotated shot list from the Visual Designer, and
actually performs the detailed joining of the specified shots. The Visual Constructor
operates at the cutting level. This can mean that a shot is truncated, if it is too long for
the required purpose (e.g. two seconds exposure to a close-up is often sufficient for
the viewer to appreciate what is being shown). Cuts may also be motivated if only part
of a particular shot is required. This can be particular important to maintain
continuity, for example, in the case of inserts that break the flow of action, but where
the actual screening time of the insert must be considered. The strategies used for such
operations are based on our discussion of the temporal and rhythmic relationships
between two shots, in section 6.4.6.
The output from the Visual Constructor is an ordered list of the shot identifiers, along
with frame numbers for each shot. This list specifies the scene that is to be displayed.
8.2.6 The Retrieval System and Interface
The Retrieval System adapts the final stream of shot ids and frame numbers (stored in
the List_of_used_shots, from Figure 6.5) into a file specifying the actual presentation
of the video story at the event level. The file lists the appropriate MPEG files along
with associated start and end frames, which can be shown in a small "projection"
window on the workstation. The presentation environment is written in X and uses
SUN Video Technology/XI Library.
At its current stage of development, the interface features two windows. There is an X
window, where the user enters the query to start the story generation process, e.g.
go(10, humour), and a window in which the video is actually displayed.
8: AUTEUR: An architecture for automated video story generation
186
The simplicity of the interface reflects the status of AUTEUR as being purely a
research platform. However, during the above description of the architecture we
outlined some tools that would be required to make AUTEUR user friendly. Our
research project is necessarily focused on the representational and structural aspects of
the automated generation of video stories. We thus defer considerations of the user
interface to a later date.
8.3 Conclusion
Based on the representational knowledge structures for video content and narrative
and thematic knowledge, the narrative strategies for humorous stories at the event
level, and the techniques for automated editing discussed in the previous chapters of
this thesis, we have introduced the overall architecture of AUTEUR, a system for the
automated generation of emotionally stimulating visual stories at the level of events.
We introduced the three major units and their components, i.e. construction (Retrieval
System and Editor), resources (Data Base for video material, Data Base of Video
representation and the semantic Knowledge Base) and maintenance (Development
tools and User Interface). Furthermore, we described the tasks of each component and
the interaction between them. Finally, we showed how the representation structures
generated by AUTEUR, i.e. the Sequence-Structures for the different event phases,
and the Location-Memory-Structure, can be used to assist the generation process by
serving as example cases, which help to avoid producing the same visual gag twice.
The next chapter discusses examples of humorous film stories actually created by
AUTEUR.
187
Chapter
IX
The operation
of AUTEUR:
Show me a joke
In this chapter, we consider three examples of humorous films actually produced by
AUTEUR, to demonstrate the operation of the architecture described in the preceding
chapter.
We begin with an example of visual humour that is based on a single action by one
character. The major feature of this example is the decomposition of action and
character appearance. The second example highlights the generation of a visual gag
that involves addressing the problem of parallel actions performed by a character, and
also how AUTEUR avoids producing the same joke twice. The third example
describes the development of a joke based on the interaction between an object and a
character. Each example is discussed according to the relevant generation phases, i.e.
preparation, motivation, realisation and resolution. A detailed description of example
1 is given, while examples 2 and 3 are described in less detail.
It is extremely difficult to present a sufficiently detailed description of the operation
of a system such as AUTEUR. The following examples have therefore been
simplified. For example, we provide only a sketch of the content representations
involved (see Tables 5.1 - 5.6). The Appendix contains an illustration of a typical
generation trace produced by AUTEUR, which represents the analytical development
of a story. Note also, that actual shots of film are used by AUTEUR, but that these
are described below in the form of a single representative image from each shot
9: The operation of AUTEUR: Show me a joke
188
9.1 The banana skin joke
Suppose that the query by the user is go(12, humour), where 12 represents the ID of
the shot shown in Figure 9.1.
Figure 9.1 Startshot for the banana skin joke
9.1.1 Preparation phase
As described in the previous chapter, the Structure Planner first analyses the content
representation of the given start shot, in terms of number of characters, objects or
groups, related actions, moods and location. The construction of the mood is based
on visual expressions and actions (type, speed, order) related to an actor (see also
7.2.2). The result is a list of possible moods, such as pleasure, hurry, etc., each tagged
with a certainty value that must be higher than 0.35 for a mood to be considered as an
element for the list of assumed moods. Figure 9.2 presents an example of the type of
information which may be extracted by the Structure Planner.
Shotid: 12
Shotkind:
long
Actors: Single (1, [Frank]), Group (0,[])
Objects:
Single (2, [path, meadow]), Group (1,[trees])
Actions:
sequence:no, parallel:no,
single action: walk
Mood: [Frank [ pleasure+0.5, hurry+0.5]]
Figure 9.2 Result of startshot analysis for the banana skin joke
The Structure Planner uses the information described in Figure 9.2 to establish the
Sequence-Structure (for an example of such a structure, see Figure 6.3) for the
current generation phase.
9: The operation of AUTEUR: Show me a joke
189
The first decision made concerns the selection of the most suitable humour strategy.
As there is only one person, one action and there is nothing ambiguous about the
action (one action in a long shot) AUTEUR will suggest misfortune as the humour
type for the joke (see H-Strategy 14).
The next step for the Structure Planner is to determine the phase of construction.
Since the concept of misfortune requires a mood deterioration, the Structure Planner
evaluates the mood of the relevant character. Both moods, "pleasure" and "hurry" as
indicated by Figure 9.2, do not provide the required certainty value of 0.75 that
would indicate that the mood of the character is clearly perceptible from the shot, and
thus, the mood is indexed as "to be motivated". Since there is but one character,
performing one action only, there is no need to motivate the action. However, since
the conceptual structure of "walk" can be associated with a larger logical sequence,
e.g. a meeting, the motivation of an event is also suggested (according to H-Strategy
16 of chapter 3).
On completion of the preparation phase, the first Sequence-Structure may be
instantiated as shown in Figure 9.3.
Sequence-Structure
Kind
Intention
motivation
mood+event
Form
Appearance
Setting
Subjects
Action
misfortune
accelerate
Single (2, [path, meadow]), Group (1,[trees])
[Frank [pleasure+0.5, hurry+0.5]
[Frank, [walk]]
Figure 9.3 Motivation Sequence-Structure for the banana skin joke
The Structure Planner now informs the Content Planner to proceed, by sending a list
of supportive humour strategies (e.g. H-Strategy 15).
9.1.2 Motivation phase
The first task of the Content Planner is to identify the appropriate strategy for the
humour type, which is misfortune (see the field Form, in Figure 9.3). Based on the
set of supportive humour strategies provided by the Structure Planner, the Content
Planner evaluates the available information concerning the visual material. HStrategy 15 suggests that a more complex strategy will increase the chance of
9: The operation of AUTEUR: Show me a joke
190
producing a good joke. For a single character performing a single action the
appropriate strategy is H-Strategy 41. In the light of the chosen strategy, the Content
Planner attempts to create a motivation for a mood and an event. The first aim is to
create a visual representation that suggests that the character either feels pleasure, or
is in a hurry. The second aim is to establish an event that corresponds with the chosen
mood.
AUTEUR first attempts to establish a suitable mood. The Knowledge Base contains a
number of mood concepts, along with related actions (such as those shown in Tables
7.3 and 7.4). For example, pleasure may be associated with smiling, whistling, and
picking flowers. Since the first action in the list of associations represents the
strongest visualisation of the mood, AUTEUR uses smiling for its first attempt to
establish a visual representation of "pleasure". Traversing the associative links of
"smiling" and "walk", AUTEUR infers that a person can walk and smile at the same
time. As a result, a query is sent to the Visual Designer to retrieve an appropriate
visual representation for "Frank walks and smiles". The query also specifies the
underlying intention for the material (i.e. highlighting a mood) and the ID of the shot
to which the motivation is related (i.e. 10).
To solve the task of finding appropriate visual material for the query, the Visual
Designer uses two knowledge structures: the Knowledge Base, i.e. the spatial
relationships between two shots, represented as a two dimensional array, as shown in
Table 6.2, and the conceptual relationships between visual space and narrative
functionality, represented in form of clauses (see Figure 6.4). From the
representational structure for the cinematic devices in shot 12 (see also Table 5.1),
which is stored in the DB of Video representations, the Visual Designer detects that
the start shot is of type "long". From the conceptual relationship between visual space
and narrative functionality, the Visual Designer infers that motivation favours detail,
which is related to a decrease of space. This information results in the Visual
Designer exploring the array of spatial relationships between shots. Suppose that the
Visual Designer could neither retrieve shots of the most appropriate type (e.g.
medium), nor build a bridge (according to E-Strategy 2, in section 6.4.2). The Visual
Designer must now rely on a fallback option, which allows the join of a "long" shot
and "close-up" shot for purposes such as the motivation of a mood. Now, assume that
the Visual Designer can retrieve a number of shots from the DB of Video
1
A more appropriate strategy is actually H-Strategy 13. However, we have choosen HStrategy 4 to promote ease of presentation.
9: The operation of AUTEUR: Show me a joke
191
representations, which are annotated with the required action, "walk", for the given
character (Frank), and which also represent a facial expression of a smile, as shown
in Figures 9.4, 9.5, 9.6.
Figure 9. 4 - 9.6 Three possible motivation shots for the banana skin joke
The representation of each of the shots is compared with that of the shot to be joined
(i.e. 12) on the basis of spatial continuity (appearance of character, spatial relations
between character and object, appearance of setting - including such features as
lighting, location, season, daytime, and so forth, etc.), action match (e.g. comparison
of direction), and temporal continuity (e.g. speed of action). Furthermore, each of the
shots is analysed on the basis of stylistic features. Since the intention of the join is a
motivation, a zoom-in is stylistically desirable. Based on the evaluation process, the
Visual Designer chooses the shot represented by Figure 9.5, as its content description
indicates that it most obviously presents the mood, is consistent with the direction of
action, and represents a zoom-in.
The successful establishing of the mood "pleasure" enables the Content Planner to
specify an event by constructing an appropriate goal for the character. Thus, the
Content Planner traverses the causal links provided by the attribute Intention of the
conceptual structure for "walk", to detect usable event structures (see Table 9.1). One
such goal might be to meet another person. Thus, the Content Planner attempts to
construct the representation for a meeting by using the relevant conceptual structure.
Possible event structures might be a male meeting a female for a date, and so on.
Name
meetin
g
Actor
number
2
Gender
Intention
any
any
meet
Motivation Realisation
[walk]
[wait]
[look at]
[look at]
Resolution
Episode
[shake_hand]
[shake_hand]
date
Table 9.1 Conceptual structure for the event "meeting"
Since the Content Planner is currently operating in the "motivation phase", it also
targets the motivation parts of the event structure "meeting", such as one person
waiting and the other walking, or, perhaps, both characters walking. In our example,
9: The operation of AUTEUR: Show me a joke
192
the action of one character is already specified, i.e. "walk", so AUTEUR searches in
the motivation field of the conceptual structures for "meeting", for a corresponding
action performed by the other character. Assume that the conceptual structure
described in Table 9.1 is the one being chosen by the Content Planner. Thus, the
corresponding action is "wait", which can be performed by a character of either sex.
The above information is used by the Content Planner to send a query to the Visual
Designer, which this time has to retrieve a shot of either a female or a male character,
who waits (or its synonyms) in a similar surrounding to that provided by the start shot
(see Figure 9.1). The Visual Designer may suggest the shot represented by Figure
9.7.
Figure 9.7 Event shot for the banana skin joke
Due to the chosen strategy (H-Strategy 4), the Content Planner orders the actions
according to their appearance in the shot sequence (event action, character action
(startshot and motivation)) and transfers this information to Visual Designer.
The task of the Visual Designer is to decide how the shots should be joined. The
juxtaposition between "Event shot" and "Start shot" is a simple join of two long shots
in the given order. The juxtaposition of "start shot" and "motivation shot" is to be
performed as an insert, since the combination of the motivational intention for the
join, in combination with the particular combination of shot types (close-up motivates
long shot), highlights this solution.
The actual juxtaposition is performed by the Visual Constructor. The insertion and
related trimming for the motivation shot is carried out according to E-Strategies 31 36.
The Content Planner now generates a Location-Memory-Structure (see Figure 6.5),
in which the ids of the shots in their established order are stored (9.7, 9.1, 9.5, 9.1 the numbers represent the above Figures). Since all content and stylistic requirements
for the visual presentation of the motivation phase have been achieved, the Content
Planner marks the evaluation value for the motivation as successful. Finally, the
9: The operation of AUTEUR: Show me a joke
193
Content Planner indicates the status of the ongoing story to the Structure Planner, by
providing a list such as that shown in Figure 9.8.
Strategy used
meaning of the story:
motivated event:
mood of main character:
action main character: walk
number of characters: 2
H-Strategy 4
date
meeting
[Frank, pleasure+1.0]
Figure 9.8 Status of the banana skin joke after the end of the motivation phase
The Structure Planner uses the story status, as described in Figure 9.8, to decide on
the next phase in the generation process.
The comparison between the Sequence-Structure for the motivation phase (see Figure
9.3), and the status of the motivation phase (see Figure 9.8) reveals, that a new
character, the waiting man, has been introduced. Since both characters are still apart,
and the new character is passive (i.e. waiting), the inference drawn by the Structure
Planner is that the joke is still to be based on the action of the main character.
However, the introduction of the new character changes the single-person
environment into a person-person environment, which is indicated by a change of the
Intention field for the motivation Sequence-Structure from event to parallel_event.
Moreover, the fields Setting, Subjects and Action in the motivation SequenceStructure need to be updated with information about the second character. Since no
change in strategy is indicated, the humour type (misfortune) remains, and the
established strategy (H-Strategy 4) is added.
The next activity by the Structure Planner is to decide on the next phase of the
generation process. Since the emphasis continues to be on one character, there is no
change in action, and there is no change in the strategy type, the Structure Planner
suggests a realisation phase and then instantiates the relevant Sequence-Structure, as
specified in Figure 9.9.
Sequence-Structure
Kind
Intention
Form
Appearance
Setting
Subjects
Action
realisation
action
misfortune, H-Strategy 4
accelerate
[Single (2, [path, meadow]), Group (1,[trees]),
[Frank [pleasure+1.0]
Figure 9.9 Realisation Sequence-Structure for the banana skin joke
9: The operation of AUTEUR: Show me a joke
194
Finally, the Structure Planner instructs the Content Planner to continue the
generation process.
9.1.3 Realisation phase
The first task of the Content Planner is to retrieve the name of the action or event
which forms the basis of the joke. This is indicated by the uninstantiated Action field
of the realisation Sequence-Structure (see Figure 9.9).
To retrieve the required action, the Content Planner uses the realisation requirements
provided by H-Strategy 4. The aim of H-Strategy 4 is to the violate a character’s goal
under two constraints. Firstly, that the mishap should be simple, and secondly, that
the mishap should be expected by the character.
Thus, the Content Planner first retrieves the action (walk) of the main character
(Frank) from the motivation Sequence-Structure. The violation requirement of HStrategy 4 causes the Content Planner to traverse the outgoing opposition links of the
conceptual structure "walk". The oppositional links are chosen because the concept of
"interrupt" is related to the concept of "perform opposition". The links lead to
conceptual structures for oppositional actions for "walk", such as fall, slip, stumble, or
collide with. In decreasing order of the importance of the opposition links, the
Content Planner attempts to instantiate an oppositional concept. Take the conceptual
structure of slip as an example, which is described in Table 9.2.
Name
slip
Domain
motion
Nature of location
outdoors
Set of objects
[banana_peel, dog_shit, soap, ice]
Body part / related object
[shoe]
Location
[road]
Relation Location -> Object
under
Relation Object -> Body part
under
Intention
[unintentional]
Result actions
[sit, lie, kneel, shake, look_back]
Result mood
[anger, rage, astonishment]
Table 9.2 Conceptual structure for a representation of the action "slip"
An important task for the Content Planner is to detect if the result mood of "slip" can
fulfil the required mood detoriation. Since anger, rage and astonishment are related
9: The operation of AUTEUR: Show me a joke
195
to the emotional token of the opposite classification type of pleasure, i.e. displeasure
(as discussed in section 7.2.2.1), the mood detoriation can be established. Thus, "slip"
is a feasible action for foiling the action "walk".
The Content Planner can now present the Visual Designer with a number of queries,
considered in descending order of suitability, such as:
•
find a shot where the actor slips, where the object the character slips on is
•
•
•
found in the startshot
find a shot where a body part slips on an object, where the object is found in
the startshot
find a shot where the actor performs an action that is associated to slip, and a
shot showing an object that is also associated with slip
find a shot where the actor performs a slip related action.
Suppose that the first query above is not satisfied, but that the second is successful.
The next step for the Content Planner is to satisfy the expectation constraint of HStrategy 4, which requires that the character is aware of the object. Since none of
retrieved shots portrays the line of sight of the character, the Content Planner uses the
substructure "Bodygesture" from the action representation of the actor (see Table
5.3), taken from the last shot of the previous Sequence-Structure, in which the
character appears. The comparison of the Relation Object -> Bodypart (see Table 9.2)
with the line of sight of the Bodygesture (see Table 5.3) reveals, that the character,
Frank, is actually not looking to the ground and thus there is no reason for the
Content Planner to assume, that the character is aware that he is about to slip on an
object, and thus, H-Strategy 4, which requires an expected mishap, is not applicable.
The Content Planner could now continue the search for another suitable opposition
action. However, assume that the Content Planner investigates weaker mishap
strategies, such as H-Strategy 2, which requires an unexpected mishap. This strategy
corresponds with the material of the motivation phase and would fulfil the constraint
of unexpectedness, since in the shots of the motivation phase the character is not
looking towards the ground. As a result, the Content Planner changes the strategy ID
in the realisation Sequence-Structure from H-Strategy 4 to H-Strategy 2, but marks
the switch as "imp-up-strat", which indicates, that the joke can be improved by using
a higher order strategy. This index is not particularly significant for the generation
process, but may become important when, at later stages, the system is asked to
account for its generation and evaluation process. It may be the case, for example,
9: The operation of AUTEUR: Show me a joke
196
that the user does not approve of the supplied joke, which may further require that the
system can suggest improvements that could be made.
Additionally, the Content Planner attempts to apply the constructive H-Strategies
related to misfortune, which might improve the joke. An example is H-Strategy 5,
which states that "If the intention of the joke is derisive, reveal the point in advance.
(enhanced Schadenfreude)". As a result, the Content Planner sends a query to the
Visual Designer, requesting a detail shot of the object the character is to slip on, as
the spectator will then anticipate the mishap and this is predicted to increase the
potential success of the joke. Assume that this query is successful, and the retrieved
shot is that shown on the left of Figure 9.10.
The Visual Designer then analyses the content and style of the potential material (in
terms of spatial and temporal continuity), following which the Visual Constructor
specifies the detailed joining of the material. The final outcome is a two-shot scene as
suggested by the stills in Figure 9.10.
Figure 9.10 Realisation part for the banana skin joke, generated out
of two shots
The Content Planner now evaluates the realisation part of the joke. Since all
requirements of the strategy could be fulfilled, the evaluation value is in the range
"good". Finally, the Content Planner updates the Location-Memory-Structure, and
then indicates the status of the realisation phase to the Structure Planner, by
providing a structure such as that shown in Figure 9.11.
Strategy used
meaning of the story:
motivated event:
mood of main character:
action main character: slip
number of characters: 1
H-Strategy 2+(imp-up-strat)
mishap
slip on banana_peel
[Frank, pleasure+1.0]
Figure 9.11 Status of the realisation phase of the banana skin joke
9: The operation of AUTEUR: Show me a joke
197
The Structure Planner once again compares the status of the generation phase
(realisation) with the related Sequence-Structure (see Figure 9.9). Since the action for
the main character is not indicated in the Sequence-Structure, the Structure Planner
updates it with the action provided by the status information (i.e. "slip").
Furthermore, the strategy type must be updated, since the strategy has changed.
The next step for the Structure Planner is to decide if a resolution stage is required.
Following the concept of misfortune, the Structure Planner investigates the content
representation of the shots generated in the realisation phase with respect to the
portrayal of a mood change, either by showing a reaction or a gesture. Since such
information cannot be found in the constructed material, the Structure Planner
suggests a resolution phase, and then instantiates the relevant Sequence-Structure,
resulting in a structure such as that shown in Figure 9.12.
Sequence-Structure
Kind
Intention
Form
Appearance
Setting
Subjects
Action
resolution
action
misfortune, H-Strategy 2
accelerate
[Single (2, [path, meadow]), Group (1,[trees]),
[Frank [pleasure+1.0]
slip
Figure 9.12 Resolution Sequence-Structure for the banana skin joke
Finally, the Structure Planner instructs the Content Planner to continue with the
generation process.
9.1.4 Resolution phase
Using the conceptual structure for slip (see Table 9.2) the Content Planner constructs
a request for video material that portrays an appropriate reaction by the character. The
reaction is composed by considering possible resulting states and, if pertinent, the
relevant object. For the case of slip and banana peel one possibility is to request a
shot of the character lying on the ground, or a shot of the character looking angrily
down at either the banana peel or something not represented in the shot, or a shot of
the character simply looking angry, and so on (see Result actions and Result moods in
Table 9.2). The strategy for choosing between alternatives is based on a preference of
reactions to moods, and among competing potential results, the choice is based on the
value of the relevant link. Assume that the Content Planner sends the following
9: The operation of AUTEUR: Show me a joke
198
request to the Visual Designer: provide a shot where the character "Frank" looks
back at the object (banana_peel).
In order to ensure a consistent filmic style throughout the scene, the Visual Designer
attempts to satisfy the request of the Content Planner in terms of stylistic aims and
the shot history (Location-Memory-Structure). This can be important, for example, if
earlier decisions concerning editing techniques are to be repeated (examples being the
consistent use of close-ups for highlighting, or cutting without bridging). Since the
realisation phase of the generation predominantly makes use of shot types between
"medium" and "close-up", the Visual Designer attempts to answer the query with a
shot from within this range of types. Figure 9.13 shows a frame from a shot that
realises this aim with respect to the existing chosen material.
Figure 9.13 Retrieved shot for the realisation phase of the
banana skin joke
Once the Visual Constructor has established the join, the Content Planner evaluates
the resolution part of the joke. Since a reaction, rather than a gesture, is shown, the
realisation is valued as "good", even though the reaction itself is the weakest within
the set of reaction links, i.e. the last element of the list (see Result action in Table
9.2). The Content Planner then applies the evaluation values gathered for each
generation phase to evaluate the humour level of the joke. Since each of the
generation phases produces the required content, though with varying degrees of
success, the value of the stylistic units is high enough for a good ranking. The overall
verdict is therefore "good". However, due to the need to downgrade H-Strategy 4 to
the simpler H-Strategy 2, the originality is assessed as "average".
Following the Content Planner’s indication of the successful termination of the joke
generation process, the Structure Planner seeks options for developing further jokes
from the existing story line, or attempts to provide the specification of an appropriate
conclusion to the scene.
Thus, the Structure Planner will first suggest the repeated application of the
misfortune strategy. For our meeting example, this would mean that events should be
9: The operation of AUTEUR: Show me a joke
199
generated where the main character continues to attempt to reach the meeting after
slipping on the banana skin, and is subjected to further mishaps, such as falling over a
bench, missing a bus, failing to hail a taxi, and so on. Let us assume, for simplicity,
that no visual material corresponding to such situations is available.
The Structure Planner will now perform a cross-comparison of the established
Sequence-Structures and the Location-Memory-Structure, to ensure that all generation
goals have been satisfied. For the above example, the comparison between the
Sequence-Structures and the conceptual structure of the event "meeting" reveals that
the generated video material neither provides the realisation nor the resolution stage
for a meeting. Thus the Structure Planner instructs the Content Planner to generate:
a) a realisation and resolution phase for a meeting of the
characters;
or
b) a resolution phase for a meeting of the characters;
or
c) a decomposed version of either of cases (a) or (b).
Assume that each of the above requests is unsuccessful. The Structure Planner then
determines that the shot in Figure 9.7 is superfluous material. This will initiate the
sending of a re-editing plan to the Content Planner.
The first step for the Content Planner is to introduce the Visual Constructor to erase
the shot from the Location-Memory-Structure. The Content Planner then adjusts the
fields Setting, Subjects and Action in the relevant Sequence-Structures. In the given
example, only the motivation Sequence-Structure is affected, since no other sequence
structure contains information about the character to be removed. Finally, the Content
Planner re-calculates the evaluation value of the joke, which, for the given example,
remains the same. The final version of the banana skin joke, which is roughly 20
seconds long, is suggested by the stills in Figure 9.14 (read from left to right, top to
bottom).
9: The operation of AUTEUR: Show me a joke
200
Figure 9.14 The banana skin joke generated by AUTEUR
9.2 The lamp post joke
For the second example, assume that the Content Planner has indicated a successful
completion of the banana skin gag, as described above. As already mentioned, the
Structure Planner next pursues options for developing further jokes from the existing
story line. Thus, the Structure Planner uses the banana skin joke as a case basis for
the creation of other jokes, guided by the repetition and exaggeration strategies
presented in chapter 3 (H-Strategies 9 - 12).
9.2.1 Preparation phase
The first task for the Structure Planner, in the preparation phase of a repetition, is to
find an appropriate startshot. Hence, an analysis of the motivation Sequence-Structure
of the banana skin joke is performed, to retrieve information about the main actor, his
or her actions, and the relevant location. The Structure Planner uses the results of this
information retrieval process to generate a find-startshot query, which, for our
example, may be: "Find a shot with Frank walking or performing a similar action or
related action, in a setting similar to the one of shot 12, without returning shot 12".
This query is sent to the Content Planner, which instructs the Visual Designer to
retrieve a shot based on the provided content requirements. Figure 9.15 shows a
frame from a shot that meets the content requirements.
9: The operation of AUTEUR: Show me a joke
201
Figure 9.15 Startshot for the lamp post joke
Based on the retrieved startshot for the new sequence, the Structure Planner performs
the content analysis. The result of this is a structure such as that shown in Figure
9.16.
Shotid: 25
Shotkind:
medium
Actors: Single (1, [Frank]), Group (0,[])
Objects:
Single (2, [newspaper]), Group (0,[])
Actions:
move].
sequence:no, parallel: [read, move], single action: [read,
Mood: [Frank [ pleasure+ 1.0, hurry+0.5]]
Figure 9.16 Result of startshot analysis for the lamp post joke
The task of the Structure Planner is to now establish the first Sequence-Structure for
the current generation phase, and a crucial choice is that of the next appropriate
humour strategy to be applied. Since there is already a case representing a joke (the
banana skin joke) stored in memory, the Structure Planner first of all attempts to
analyse the applicability of the previously used strategy. However, a comparison of
the results from the current startshot analysis (see Figure 9.12) with the motivation
Sequence-Structure of the banana skin joke reveals that both startshots differ in the
type of actions shown, i.e. the current startshot offers parallel actions. Thus, the
Structure Planner infers that the humour type "mishap" is, in the given
circumstances, inapplicable.
The Structure Planner next attempts to establish the appropriate humour strategy,
based on supportive strategies such as H-Strategies 17 - 19, which suggest the
creation of relationships between parallel actions, either by constructing a wider
context which can accommodate the actions, or by establishing a contextual
9: The operation of AUTEUR: Show me a joke
202
relationship based on conflict. From H-Strategy 18, which is devoted to the
construction of a conflict oriented relation, the Structure Planner launches a request
to the Content Planner, to investigate if the actions (i.e. walk and read) performed by
Frank in shot 25 (Figure 9.15) are mutually conflicting.
A conflict oriented relation between random actions is based on the assumption that
the related actions share resources (e.g. subactions), and that these shared resources
are of importance for all of the actions involved, which is indicated by qualitative
modal scales such as necessary, useful, etc. (as discussed in sections 3.3 and 7.2.1.2).
Hence, the Content Planner investigates the subaction links leaving the conceptual
structures of "reading" and "moving" (or the subaction links of their synonyms), to
detect if both actions share the same sub-action(s), and if the potential links are
tagged with a qualitative modal scale greater than or equal to "useful". An outcome of
this investigation might be that both "read" and "walk" share the same subaction, i.e.
"looking", where the link from "read" is tagged with "necessary", and the link from
"walk" is tagged with "useful". The Content Planner orders the actions according to
the importance of the subaction (i.e. [read,walk]) and returns this list to the Structure
Planner. Since the list is non empty, the Structure Planner assumes that a conflict can
be established, and can now apply H-Strategy 19, which suggests that a conflict
relation between actions indicates derisive humour. Thus, the humour type for the
joke to be created is misfortune.
The next task for the Structure Planner is to determine the construction phase. As
misfortune requires a resulting mood deterioration, the Structure Planner evaluates
the mood of the relevant character. As "pleasure" is already marked with a certainty
value of 1.0 with respect to the content of the shot (25 - Figure 9.15), the Structure
Planner rejects the necessity to motivate the mood.
Since the shot in Figure 9.15 contains parallel performed actions, it may be necessary
to highlight one of these actions. This is investigated by the Structure Planner. The
analysis of the current startshot (see Figure 9.16) indicates that the shot type is
"medium". The Structure Planner investigates the conceptual relationship between
camera distance and hierarchical representation of subjects (see Table 6.3), and so
determines that a "medium" shot favours the appearance of particular body parts.
Hence, the Structure Planner compares the list of actions provided by the startshot
analysis with the conflict-list provided by the Content Planner. The result of this
comparison is that "read" features in both lists whereas the other two actions ("move"
9: The operation of AUTEUR: Show me a joke
203
and "walk") each appear in only one list. Hence, the Structure Planner concludes that
"read" is already highlighted.
As conflict is used to establish the joke, no further event motivation is required. This
means that neither the action, mood nor event need to be motivated. Thus, the
Structure Planner decides that the motivation phase can be omitted, and the
generation phase for the Sequence-Structure will be "realisation". However,
constructing a realisation Sequence-Structure requires the identification of the
appropriate H-Strategy.
H-Strategy 20 suggests that the realisation phase for jokes based on conflict should
emphasise the stronger of the parallel actions (i.e. read), but base the joke on the
weaker one (i.e. walk). Thus, the joke to be created is single action oriented. To
transform the action status, i.e. parallel into single, the Structure Planner once again
attempts to analyse the applicability of previously used humour strategies. This time,
it turns out that H-Strategy 2 is indeed applicable. According to H-Strategy 12, a
repetition strategy, H-Strategy 2 can be repeated, as so far only one joke has been
generated. On completion of the presentation phase, the Structure Planner first
generates a Location-Memory-Structure, so that the startshot can be stored, then sets
the evaluation value for the motivation phase to "good", and finally generates the first
Sequence-Structure for the current joke, which may result in a structure such as that
shown in Figure 9.17.
Sequence-Structure
Kind
Intention
Form
Appearance
Setting
Subjects
Action
realisation
action
misfortune+H-Strategy 2
accelerate
[Single (1, [newspaper]), Group (0,[])],
[Frank [pleasure+1.0]
walk
Figure 9.17 Realisation Sequence-Structure for the lamp post joke
The Structure Planner then instructs the Content Planner to proceed with the
generation of the realisation phase.
9.2.2 Realisation phase
In discussing the representation phase of the banana skin joke, we explained the
major mechanisms for generating the core parts of a joke. However, there is a slight
9: The operation of AUTEUR: Show me a joke
204
difference in the behaviour of the Content Planner in the case of example 2, as the
system is in repetition mode.
In general, the Content Planner performs as described in section 9.1.3. The Content
Planner uses the realisation requirements provided by H-Strategy 2 on the action
"walk", which means traversing the opposition links of the conceptual structure for
"walk", to retrieve the action or event which forms the basis of the joke. However,
since the system is now in repetition mode, the Content Planner also considers
already generated jokes, so that they will not be repeated. Thus, before the Content
Planner attempts to instantiate an oppositional concept, it compares the potential
action of the oppositional concept with the action of the resolution phase of similarly
structured jokes, and only if both actions differ is the potential concept investigated
further.2 For our example, this means that the conceptual structure for "slip" will be
ignored, because there already exists a similar joke.
Let us assume that, in this case, the conceptual structure for "collide with" represents
the retrieved oppositional action for "walk". Assume also that no visual material
exists to show the character, or a part of the character, colliding with a suitable object.
Finally, assume that the system can retrieve an image of an object, which accords
with the location and the intention of the requested action (e.g. a lamp post). The
result of the generation process for the realisation phase may produce a final outcome
of "character Frank collides with a lamp post while walking", which is represented by
a medium shot which zooms into a lamp post, as suggested by the still in Figure 9.18
Figure 9.18 Realisation part for the lamp post joke
Though the system is in a position to provide a credible visual presentation, the actual
action (i.e. colliding) could not be represented visually. Thus, the realisation requires
that the viewer infers the collision. This fails to fulfil the requirements of timing and
2
In cases where no resolution is available, since the realisation already includes the required
information, the Content Planner retrieves the information from the content representation
of the relevant shot.
9: The operation of AUTEUR: Show me a joke
205
readiness (discussed in sections 3.2.1 and 3.2.2) for the punch line. This is the basis
of the Content Planner’s evaluation of the realisation part of the joke, as shown by the
following inference:
no presentation of the action, but a visual presentation of a related
object => evaluate the event phase as poor.
The technical details of the construction of the remainder of the realisation phase for
the lamp post joke are identical to those described for the realisation of the banana
skin joke, such as the comparison between Sequence-Structure and joke status, and
the generation of the resolution Sequence-Structure.
9.2.3 Resolution phase
For the resolution phase, assume that the same generation mechanisms can be applied
as described for the banana skin joke. Note that the system retrieves a similar
resolution shot to the one for the banana skin joke (see Figure 9.13), except that, in
this case, the shot realises the different body-object relation, by showing the character
looking upwards while turning round. The final version of the lamp post joke is
suggested by the stills in Figure 9.19.
Figure 9.19 The lamp post joke generated by AUTEUR
The Content Planner evaluates the resolution part of the joke as "good", since a
reaction is shown, rather than a gesture. However, because the most important event
phase, i.e. the realisation phase, failed to generate an acceptable visualisation, the
overall verdict on the above joke is "poor". Nevertheless, the stylistic impression is
evaluated as better than average, mainly due to the compact presentation of the
motivational aspects in one shot.
9.3 The bus joke
For our final example, assume that the Content Planner has indicated a successful
completion of the "lamp post" joke. Furthermore, suppose that the system allows only
one repetition of a joke based on the same action, which means that from now on
9: The operation of AUTEUR: Show me a joke
206
jokes on "walking" are not supported, unless they are related to a different humour
strategy. Finally, recall that the system continues to be in repetition mode.
9.3.1 Preparation phase
The first task for the Structure Planner is to find an appropriate startshot. The
analysis of the motivation Sequence-Structures of the previously generated joke
provides usable information for the potential character (i.e. Frank) and the location
(i.e. outdoors), but fails for the retrieved action which is, in each case, "walk", and
cannot be used, as stated above.
In response to the need to explore alternative links, the Structure Planner uses the
conceptual link "domain" of the conceptual structure "walk" ("domain" is discussed in
section 7.2.1), which leads to the conceptual structure of "motion". This concept
embodies a set of links to actions or events which share an abstract criteria (e.g.
motion = the change of location of a body in regard to another body or a reference
system). A simplified set of actions for motion might be [fly, drive, swim,
using_transport].
The first three elements of this list represent actions. AUTEUR initially attempts to
construct a joke based on one of these actions, using the processes described above
(i.e. try to establish a startshot featuring the existing main character and related
action, analyse strategies used for their re-useability, and so forth). Let us assume that
each of these attempts is unsuccessful.
When the Structure Planner is confronted with the event using_transport, a request is
send to the Content Planner to retrieve an event-structure for using_transport, which
agrees with the information gathered about the main character and location. The
Content Planner may retrieve an event structure such as that described in Table 9.3.
Name
using_
transport
Actor
number
2
Gender
Intent Motivation Realisation
ion
bus
move
[come]
[stop]
male/female
[stand]
[catch]
Resolution
Episode
[leave]
[sit in]
[]
Table 9.3 Structure of an event "using_transport"
Since the event structure provides a motivation, realisation and resolution phase, it
provides the Structure Planner with a template for the generation process.
9: The operation of AUTEUR: Show me a joke
207
The Structure Planner is once again in the preparation phase, and in repetition mode,
so the first task is to find an appropriate startshot. The Structure Planner uses the
sequentially organised actions of the motivation phase (see Table 9.3) to generate a
number of retrieval queries, such as:
• Find a shot of the character Frank standing or performing a similar action or
related action, in a setting related to a bus, with a bus approaching;
or
•
Find a shot of an approaching bus or a bus performing a similar or related action,
and a shot of the character Frank standing or performing a similar action or
related action, in a setting related to a bus.
These queries are sent, in descending order of importance, to the Content Planner,
which in each case first retrieves and adds, if possible, descriptions of appropriate
objects for the environment (e.g. [bus stop]), by analysing the conceptual structure for
"bus", before instructing the Visual Designer to retrieve a shot based on the given
content requirements. Figure 9.20 shows two frames of a sequence the Content
Planner may return.
Figure 9.20 Startshot sequence for the bus example
The Structure Planner performs a content analysis for each of the startshots retrieved
for the new sequence.
The task of the Structure Planner is now to establish the first Sequence-Structure for
the generation of the current joke. The choice of an appropriate humour strategy is
based on the comparison of the current analysis of the startshot sequence with the
motivation Sequence-Structures of existing jokes. The result is that, due to the
similarity in action and character number between the two current startshots and
previously used startshots, H-Strategy 2 can be reused. The repetition is directed
according to H-Strategy 12. Due to the fact that the same character features as the
butt in each of the previous generated jokes, the Structure Planner decides, that for
the current joke, it is also this character's action on which the joke is to be based.
However, due to the fact that the Structure Planner is actually using the conceptual
9: The operation of AUTEUR: Show me a joke
208
structure for the event "using_transport" as the generation template, the action is
embedded in an event and thus the Intention of the joke is "action+event".
Since two shots have been provided, both causally linked through the logic of the
event phase motivation, and each shows one subject performing one action, the
Structure Planner decides that the Form of the joke must be "composed".
The required actions are provided, the shot type for both shots is "medium", and the
event is established, so the Structure Planner infers that no motivation phase is
necessary.
On completion of the presentation phase, the Structure Planner generates the
Location-Memory-Structure, instantiates the evaluation value for the motivation
phase as "good", and finally generates the Sequence-Structure for the current joke, as
shown in Figure 9.21.
Sequence-Structure
Kind
Intention
Form
Appearance
Setting
Subjects
Action
realisation
action+event
misfortune+H-Strategy 2
accelerate
[Single (3, [bust-stop, street, tree]), Group (,[])],
[Frank [pleasure+1.0]
catch
Figure 9.21 Realisation Sequence-Structure for the bus joke
The Structure Planner then instructs the Content Planner to proceed with the
generation of the realisation phase.
9.3.1 Realisation phase
The task of the Content Planner is essentially to establish a misfortune related to the
action of "catch". The Content Planner therefore attempts to find an opposition action
for "catch", which may be "miss". Analysing the conceptual structure of "miss", the
Content Planner detects that one meaning of "missing" requires an object which
moves away from a character. The object may be of the type [taxi, bus, plane] and the
related motion may be of the type "drive" or "leave".
The Content Planner uses the above information for the action "miss" to request from
the Visual Designer a shot showing a bus moving away, in an environment similar to
the one provided in the startshot sequence. Assume that the Visual Designer retrieves
an appropriate shot.
9: The operation of AUTEUR: Show me a joke
209
The next step for the Content Planner is to establish the action of the character. The
result of traversing the opposition links of "miss", might be the action "look after".
The Content Planner sends a query to the Visual Designer, asking for a shot showing
the character in the same location as in the startshot sequence, performing a similar
action and where the direction of the character’s sight matches the direction of the
bus. Assume that such a shot can be retrieved.
On completion of the realisation phase, a possible outcome may be the shot sequence
suggested by the stills in Figure 9.22.
Figure 9.22 Realisation sequence for the bus joke
9.3.3 Resolution phase
For the resolution, assume that the same generation mechanisms as described for the
previous jokes are applied. However, in this example, the system retrieves a shot
showing a gesture (i.e. the facial expression of anger) rather than the performance of
an action, for the resolution. The final version of the bus joke is suggested by the
stills in Figure 9.23 (read from left to right, top to bottom).
Figure 9.23 The bus joke as generated by AUTEUR
9: The operation of AUTEUR: Show me a joke
210
As the content of each of the three phases, motivation, realisation and resolution, in
itself, is exactly as required, the joke is valued as "good" by the Content Planner.
Note that the event of missing a bus is not, in itself, necessarily funny. However,
AUTEUR generates the joke because the system is in repetition mode. Thus the joke
is that, once again, the same character experiences a mishap.
Assume that no further repetitions can be generated. The final task of the Structure
Planner is to instruct the Retrieval System to present the generated streams of shot ids
to the user, in the order of their generation, i.e. the banana skin joke, followed by the
lamp post joke, and finishing with the bus joke.
9.4 Conclusion
The examples presented above should provide the reader with a suitable impression
of the operation of AUTEUR. Though AUTEUR is but a prototypical system, we
have demonstrated here that it is capable of generating visual jokes based on single
and parallel actions for one character, or on single action interaction between two
active characters. We have also demonstrated the interaction between the multiple
planners involved in the editing process. Finally, we demonstrated the automatic
evaluation of a joke according to a humour scale.
The AUTEUR system, and its theoretical foundations, should be regarded as a
platform that demonstrates the potential for automated thematic film editing in
restricted, yet complex domains. Despite its complexity, AUTEUR is capable of
generating only a restricted range of humorous film sequences. However, the
complexity is a necessary basis for improving the system so that it can provide more
sophisticated humour, and, eventually, is capable of dealing with additional themes to
that of humour. We refer to these points in the next chapter.
211
Chapter
X
Achievements
and conclusions
10.1 Achievements
This thesis has presented a planner-based approach to the application of video
semantics and theme representation in the automated editing of visual stories at the
level of events.
In order to understand the dynamics of editing, the current author carried out a
“knowledge elicitation” exercise that involved studying and interviewing editors at
work in their own environment, i.e. the cutting rooms of the WDR1. In common with
those involved in the enterprise of knowledge elicitation in other domains, I found
that the expertise of the editors could not be codified in the form of simple rules. The
complexity involved in the video editing process is obscured by the fact that those
involved appear to manage it effortlessly, though many different influences, and
knowledge at varying levels and detail, are involved. Some of the complexity of the
task is reflected in the different influences to be found in this thesis, and in the
architecture of AUTEUR.
A central contribution of my research is the novel application of theories of cinema,
humour and narrativity. In particular, these theories have informed my development
of:
•
A textual representation which describes semantic, temporal and relational
features of video in hierarchically organised structures, which overcomes the
1
WDR (Westdeutscher Rundfunk - Köln) is the largest television broadcasting centre in
Germany.
10: Achievements and conclusions
212
limitations of key word-based approaches. The essential categories for the
proposed ontology are action, character, object, relative position, screen position,
geographical space, functional space and time.
•
A set of 26 humour strategies, which combine the logic of narrative structures
with functional operators of comic primitives. The identified primitives within
humour are grouped into two classes: supportive primitives (exaggeration, timing
and readiness) and constructive primitives (incongruity and derision).
Furthermore, I described a simple, but novel, mechanism to evaluate jokes.
•
A simplified model of the editing process, which covers the juxtaposition of takes,
shots and scenes, for the rough cut stage.
•
A set of 37 editing strategies, which introduce novel schemes for the appropriate
visual and cinematographic presentation of a narrative event. The strategies
support continuity editing, and focus on the essential narrative aspects, context
and form. I described a mechanism for relating the intention of a narrative to its
presentation in shot form. Furthermore, I introduced an analytic mechanism with
which a system can visually guide the viewer of a video sequence, so that he or
she can identify information as relevant or purely descriptive. Moreover, I
introduced a scheme to establish and maintain spatial and temporal continuity over
several shots, based partly on the visual decomposition of a character and/or
actions, and partly on the decomposition of spatial relationships between
characters and between characters and their screen positions. Finally, I
demonstrated strategies which shape the temporal rhythm of a sequence by means
of physical clipping.
•
An ontological representation of narrative elements such as actions, events, and
emotional and visual codes, based on a semantic net of conceptual structures
related via six types of semantic links (e.g. synonym, subaction, opposition,
ambiguity, association, conceptual). A coherent action-reaction dynamic is
provided by the introduction of three event phases, i.e. motivation, realisation and
resolution.
In order to demonstrate the applicability of the above representations, I implemented
an experimental system AUTEUR, which achieves a limited degree of success in
producing humorous film sequences. The architecture of AUTEUR embodies the
synchronisation of structural and stylistic requirements for automatically generating
10: Achievements and conclusions
213
visual stories, by providing a planning system for each overall development level (i.e.
structure and content).
10.2 Conclusions
The AUTEUR system and its theoretical foundation are best regarded as a platform
that demonstrates the feasibility of automated thematic film editing in restricted, yet
complex, domains. We showed how automated film editing can support automated
help, and, despite the implementational complexity of the system, that the extracted
principles will also provide a source of help to designers and workers in other related
fields. However, there are a number of problems yet to be solved, and though most of
them were touched upon in this thesis, it is useful to revisit the major issues, and
reflect on the strength and weaknesses of this work.
The structured textual approach to the representation of video content provides an
objective representation and thus does not restrict possible connotative meanings of
the material. At this moment in time, the structures provided appear to be sufficiently
rich to describe complex film specific features, and the denotative aspects of film,
though some of the structures are rudimentary, such as the representation of gestures
and the representation of groups of subjects. In theory, the current representation
enables the annotation of video material of arbritary length and content. It should be
noted that the notion of arbitrary re-use of video material is illusory, since in general,
the computational effort involved in comparing large numbers of highly varied
content based descriptions to establish continuity would result in an unacceptable
degradation of system performance. However, if we focus on domain dependent
applications, a suitable selection of material would reduce this complexity. For
example, we may ensure that our database contains only shots in which:
•
a small set of characters are found in similar locations,
•
the locations are simple,
•
the actions are available from different angles, point of views, etc.
The strongest qualitative drawback of the representation is the absence of sound. At
the current stage of my research, I cannot predict the extent to which the introduction
of sound would necessitate the re-design of representational structures for video
content, but it is obvious that at least the structures for "actor action", "object action",
and "setting" would require modification. Nevertheless, sound would enhance
AUTEUR's abilities as a generator of meaningful sequences.
10: Achievements and conclusions
214
AUTEUR is a research platform, and as such achieves limited success in
automatically generating video material to suggest a given theme. The current version
of AUTEUR produces only a restricted range of humorous scenes, predominantly of
the so-called slapstick style, e.g. "slipping on a banana skin". However, AUTEUR
achieves this in ways that take account of knowledge of filmic acceptability.
The discussion of the cinematic image showed that a very large number of codes are
involved in the meaning of images. The current version of AUTEUR makes use of but
a few of these, such as:
•
emotional codes, which are represented as conceptual structures (e.g. for rage or
pleasure), and their visualisation in the form of gestures of body, face, hands and
limbs (e.g. smiling, nodding, pointing), or actions (e.g. whistling as an indication
of pleasure);
•
cinematic codes, as exemplified by the relationship between camera distance and
hierarchical representations of subjects, in combination with conceptual
relationships between filmic devices and narrative functionality. Additionally,
there are the spatial relationships between shot distances;
•
cultural codes, in that our representation of jokes, editing and story structures,
reflect the humour, film and narrative schemes specific to European culture.
Each of the above code systems could be dealt with more extensively. For example,
emotional codes could be improved by extending our action-gesture-centred approach,
used in the representation of video content, with a more complete representational
scheme for hand and body gestures. Useful augmentation to the cinematic codes
would be colour codes (e.g. bright colours support the impression of a good mood,
and there is relationship between colours and certain abstract concepts). Further
refinements of the cinematic codes could be achieved by including additional
cinematographic aspects of video content (e.g. camera angle, or shot contrast) or
denotative aspects of video content (e.g. season, structures of objects - form, colour,
size, etc.), and linking these to conceptual structures of abstract concepts. It must be
stressed that these amendments would not necessitate in great change to the existing
representation structures, but would greatly improve AUTEUR's inferencing
capabilities.
Spatial and temporal continuity editing is relatively well provided for by our editing
scheme. There is a need, however, to improve the temporal aspects of physical
editing, and, in particular, to adequately address the influence of the speed of action
10: Achievements and conclusions
215
on the rhythmical structure of a sequence, which is important in slowing and
accelerating the pace of a sequence.
At this moment in time, I make no claim as to the general applicability of the humour
techniques embodied in AUTEUR, particularly since so far they cater only for the
interaction between, at most, two characters. In order to enhance AUTEUR's ability to
automatically generate a meaningful humorous sequence, further work should be
devoted to the definition of richer strategies for particular styles of visual humour (e.g.
incongruous humour), or more sophisticated humour strategies, such as transforming
the behaviour of a character from human into automaton or vice versa.
With the larger set of humour strategies, it would then be necessary to perform more
experiments involving many more conceptual descriptions in a more extensive
semantic net, along with a much larger video database. This would enable us to assess
performance of our system in a more realistic situation.
The real challenge, however, is the consideration of larger narrative structures than
single events. The generation of higher order narrative structures, e.g. episodes,
requires more consideration of such stylistic features as genres and different themes,
such as "tragedy", and I assume that this will also lead to a need for an increase in the
number of supportive and constructive thematic strategies.
Thus, to use AUTEUR as a basis of a truly flexible system for the generation of
meaningful thematic sequences on any narrative level, major amendments to the
Structure Planner and Content Planner would be needed. Future work should also
focus on the incorporation of tools to support video annotation, the provision of
conceptual representations and the editing of thematic and film editing strategies.
10.3 Postscript
This thesis has shown that an artificial system can, in some respects, autonomously
create emotionally stimulating visual events, and it is my hope that the work
undertaken will provide input into research into the automated generation of video
scenes and research into the interpretation and analysis of video.
The future may see the emergence of systems that can actively influence our creative
work with and improve our understanding of, still and moving images. If this thesis
helps to solve some of the problems associated with the development of such systems,
then it will have achieved much more than the author dared to hope.
216
Bibliography
Aguierre Smith, T. G., & Davenport, G. (1992). The Stratification System. A Design
Environment for Random Access Video. In ACM workshop on Networking and
Operating System Support for Digital Audio and Video, San Diego, California
Aguierre Smith, T. G., & Pincever, N. C. (1991). Parsing Movies In Context. In
Proceedings of the Summer 1991 Usenix Conference, (pp. 157-168). Nashville,
Tennessee.
Aigrain, P., & Joly, P. (1994). The automatic real-time analysis of film editing and
transformation effects and its applications. Computer & Graphics, 18(1), 93 - 103.
Aigrain, P., Joly, P., & Longueville, V. (1995). Medium Knowledge-Based MacroSegmentation of Video into Sequences. In M. Maybury (Ed.) (pp. 5-16), IJCAI 95 Workshop on Intelligent Multimedia Information Retrieval. Montréal: August 19,
1995
Allen, J. F. (1983). Maintaining Knowledge
Communications of the ACM, 26(11), 832-843.
about
Temporal
intervals.
Allen, J. F. (1990). Towards a General Theory of Action and Time. In J. Allen, J.
Hendler, & A. Tate (Eds.), Readings in Planning (pp. 464 - 519). San Mateo: Morgan
Kaufmann Publishers.
Allen, J. F. (1991). Time and Time Again: The Many Ways to Represent Time.
International Journal Of Intelligent Systems, 6, 341-355.
Andersen, S., & Slator, B. M. (1990). Requiem for a theory: the 'story grammar'
story. Journal of Experimental and Theoretical Artificial Intelligence, 2(3), 253 275.
André, E. (1995). Ein planbasierter Ansatz zur Generierung multimedialer
Präsentationen., Ph.D., Sankt Augustin: INFIX, Dr. Ekkerhard Hundt.
André, E., & Rist, T. (1994). Multimedia Presentations: The Support of Passive and
Active Viewing. In AAAI Spring Symposium on Intelligent Multi-Media Multi-Modal
Systems, (pp. 22 - 29). Stanford University: AAAI.
Bibliography
217
André, E., & Rist, T. (1995). Generating Coherent Presentations Employing Textual
and Visual Material. Artificial Intelligence Review, Special Volume on the Integration
of Natural Language and Vision Processing, 9(2 - 3), 147 - 165.
Andrew, D. (1984). Concepts in Film Theory. Oxford: Oxford University Press.
Aristotle (1968). Poetics, Introduction, Commentary and Appendixes by D.W. Lucas.
Oxford: Oxford university Press.
Arnheim, R. (1956). Art and Visual Perception: A Psychology of the creative eye.
London: Faber & Faber.
Arnheim, R. (1983). Film as Art. London: Faber & Faber.
Ashley, J., Barber, R., Flickner, M., Hafner, J., Lee, D., Niblack, W., & Petkovic, D.
(1995). Automatic and Semi-Automatic Methods for Image Annotation and Retrieval
in QBIC. In W. Niblack & R. Jain (Ed.) (pp. 24 - 35), Storage and Retrieval for
Image and Video Databases II . San Jose, California, February 9 - 10, 1995: SPIE.
Balázs, B. (1972). Theory of the Film. London: Dennis Dobson.
Barthes, R. (1967). Elements of Semiology. New York: Hill and Wang.
Barthes, R. (1974). S/Z with a preface by Richard Howard. London: Jonathan Cape.
Barthes, R. (1977). Image, Music, Text - Essays selected and translated by Stephen
Heath. London: Fontana Press.
Bateman, J. A., Henschel, R., & Rinaldi, F. (1996). The Generalized Upper Model
2.0 (Technical Document: http://www.darmstadt.gmd.de/publish/komet/genum/newUM. html). GMD-Darmstadt, Germany.
Bateman, J. A., Magnini, B., & Rinaldi, F. (1994). The Generalized Italian, German,
English Upper Model. In Proceedings of the ECAI94 Workshop: Comparison of
Implemented Ontologies, Amsterdam.
Bates, J. (1992). The Nature of Characters in Interactive Worlds and the Oz Project
(Technical Report No. CMU-CS-92-200). School of Computer Science Carnegie
Mellon University.
Bates, J. (1994). The role of Emotion in Believable Agents. Communications of the
ACM, 37(7), 122 - 125.
Bates, J., Loyall, A. B., & Reilly, W. S. (1992). An Architecture for Action, Emotion
and Social Behaviour. (Technical Report No. CMU-CS-92-144). School of Computer
Science Carnegie Mellon University.
Bazin, A. (1967a). The Ontology of the Photographic Image. In H. Gray (Eds.), What
is Cinema - essays selected and translated by Hugh Gray (pp. 9 - 16). Berkeley:
University of California Press.
Bazin, A. (1967b). What is Cinema? Volume I. Berkeley: University of California
Press.
Bazin, A. (1971). What is Cinema? Volume II. Berkeley: University of California
Press.
Bibliography
218
Beattie, J. (1776). On laughter and ludicrous composition. In Essays (pp. 583 - 706).
Edinburgh: Creech.
Beckwith, R., Miller, G. A., & Tengi, R. (1993). Design and Implementation of The
WordNet Lexical Database and Searching Software .
(ftp://clarity.princeton.edu/pub/wordnet/5papers.ps). Cognitive Science Laboratory,
Princeton University.
Bergson, H. (1956). Laughter. In W. Sypher (Eds.), Comedy - Introduction and
Appendix by Wylie Sypher (pp. 61 - 190). New York: Doubleday.
Bettetini, G. (1973). The Language and Technique of the Film. The Hague: Mouton
Publishers.
Binstead, K., & Ritchie, G. (1994a). An implemented model of punning riddles
(Research Paper No. 690). Department of Artificial Intelligence, University of
Edinburgh.
Binstead, K., & Ritchie, G. (1994b). A symbolic description of punning riddles and
its computer implementation (Research Paper No. 688). Department of Artificial
Intelligence, University of Edinburgh.
Black, J. B., & Bower, G. H. (1980). Story understanding as problem solving.
Poetics, 9, 223 - 250.
Black, J. B., & Wilensky, R. (1979). An evaluation of story grammars. Cognitive
Science, 3, 213 - 230.
Bloch, G. R. (1986) Elements d’une Machine de Montage Pour l’Audio-Visuel. Ph.D.,
Ecole Nationale Supérieure Des Télécommunications.
Bloch, G. R. (1988). From Concepts To Film Sequences. In RIAO 88, (pp. 760 767). MIT Cambridge MA.: March 21-24, 1988.
Bobrow, D. G., & Winograd, T. (1985). An Overview of KRL: A Knowledge
Representation Language. In R. J. Brachman & H. J. Levesque (Eds.), Readings in
Knowledge Representation (pp. 263 - 285). San Mateo, California: Morgan
Kaufmann Publishers.
Bordwell, D. (1985). Narration in the Fiction Film. London: Methuen.
Bordwell, D. (1986). Classical Hollywood Cinema: Narrational Principles and
Procedures. In P. Rosen (Eds.), Narrative, Apparatus, Ideology - A Film Theory
Reader (pp. 17 - 34). New York: Columbia University Press.
Bordwell, D. (1989). Making Meaning - Inference and Rhetoric in the Interpretation
of Cinema. Cambridge, Massachusetts: Harward University Press.
Bordwell, D., & Thompson, K. (1993). Film Art - An Introduction. New York:
McGraw-Hill.
Bremmer, J., & Roodenburg, H. (1991). A Cultural History of Gesture - from
Antiquity to the Present Day. Cambridge, MA: Polity Press in association with Basil
Blackwell.
Bibliography
219
Brooks, K. (1995). Agent Stories. In Working Notes of the AAAI-Spring Symposium
T95, Interactive Story: Systems Plot and Character, Stanford University.
Brooks, K. (1996). Do Story Agents Use Rocking Chairs? The Theory and
Implementation of One Model for Computational Narrative (Internal unpublished
report MIT-Media Lab Interactive Cinema Group).
Brunovska Karnick, K. (1995). Commitment and Reaffirmation in Hollywood
Romantic Comedy. In K. B. K. &. H. Jenkins (Eds.), Classical Hollywood Comedy
(pp. 123 - 146). New York: Routledge.
Burch, N. (1981). Theory of Film Practice. Princeton, New Jersey: Princeton
University Press.
Burke, R. (1993). Intelligent Retrieval of Video Stories in a Social Simulation.
Journal of Educational Multimedia and Hypermedia, 2(4), 381 - 392.
Butler, S., & Parkes, A. (1996). Film Sequence Generation Strategies for generic
Automatic Intelligent Video Editing. To appear in the special issue “Intelligent Studio
and Film Production” of Applied Artificial Intelligence. H. Kitano (ed.).
Butz, A. (1995). BETTY - Ein System zur Planung und Generierung informativer
Animationssequenzen (Document No. DFKI-D-95-02). Deutsches Forschungszentrum
für Künstliche Intelligenz GmbH.
Carbonell, J. (1978) Subjective Understanding: Computer models of belief systems.
Ph.D., Yale University.
Carroll, J. M. (1980). Toward a Structural Psychology of Cinema. The Hague:
Mouton Publishers.
Cawsey, A. (1990). Generating Explanatory Discourse. In R. Dale, C. Mellish, & M.
Zock (Eds.), Current Research in Natural Language Generation (pp. 75 - 101).
London: Academic Press.
Chakravarthy, A., Haase, K. B., & Weitzman, L. (1992). A uniform Memory-based
Representation for Visual Languages. In B. Neumann (Ed.), ECAI 92 Proceedings of
the 10th European Conference on Artificial Intelligence, (pp. 769 - 773). Wiley,
Chichester: Springer Verlag.
Chakravarthy, A. S. (1994). Toward Semantic Retrieval of Pictures and Video. In C.
Baudin, M. Davis, S. Kedar, & D. M. Russell (Ed.), AAAI-94 Workshop Program on
Indexing and Reuse in Multimedia Systems, (pp. 12 - 18). Seattle, Washington:
AAAI Press.
Charniak, E., & McDermott, D. (1985). Introduction to Artificial Intelligence.
Reading, Massachusetts: Addison-Wesley Publishing Company.
Chatman, S. (1978). Story and Discourse: Narrative Structure in Fiction and Film.
New York: Ithaca.
Chomsky, N. (1965). Aspects of the theory of syntax. Cambridge, MA: MIT Press.
Bibliography
220
Clark, D. R., & Sandford, N. (1986). Semantic Descriptors and Maps of Meaning for
Videodisc Images. Programmed Learning and Educational Technology, 23(1), 84 90.
Colby, B. N. (1973). A partial grammar of Eskimo folktales. American
Anthropologist, 75, 645 - 62.
Cullingford, R. (1978) Script application: Computer understanding of newspaper
stories. Ph.D., Yale University.
Davenport, G. (1994). Seeking Dynamic, Adaptive Story Environments. IEEE
MultiMedia, 1(3), 9 - 13.
Davenport, G., Aguierre Smith, T., & Pincever, N. (1991). Cinematic Primitives for
Multimedia. IEEE Computer Graphics & Applications (7), 67-74.
Davenport, g., & Murtaugh, M. (1995). ConText: Towards the Evolving
Documentary. In ACM Multimedia 95 - Electronic Proceedings. San Francisco,
California: November 5-9, 1995.
http://ic.www.media.edu/icPublications/gdlist.html
Davis, M. (1993). Media streams: An iconic visual language for video annotation.
Telektronikk, 89(4), 59 - 71.
Davis, M. (1995) Media Streams: Representing Video for Retrieval and Repurposing.
Ph.D., MIT.
DeJong, G. (1983). An Overview of the FRUMP System. In W. G. Lehnert. &. M. H.
Ringle (Eds.), Strategies for Natural Language Processing (pp. 149 - 197). Hillsdale,
New Jersey: Lawrence Erlbaum Associates.
Del Bimbo, A., Vicario, E., & Zingoni, D. (1992). A Spatio-Temporal Logic for
Sequence Coding and Retrieval. In IEEE Workshop on Visual Languages, (pp. 228 231). Seattle, Washington: IEEE Computer Society Press.
Del Bimbo, A., Vicario, E., & Zingoni, D. (1993). Sequence Retrieval by Contents
through Spatio Temporal Indexing. In IEEE Symposium on Visual Languages, (pp.
88 - 92). Bergen, Norway: IEEE Computer Society Press.
Domeshek, E. A., & Gordon, A. S. (1995). Structuring Indexing for Video. In J. Lee
(Ed.), First International Workshop on Intelligence and Multimodality in Multimedia
Interfaces: Research and Applications.. Edinburgh University: July 13 - 14, 1995.
Domeshek, E. A., & Kolodner, J. L. (1994). End-User Indexing of Design Lessons.
In C. Baudin, M. Davis, S. Kedar, & D. M. Russell (Eds.), AAAI-94 Workshop
Program on Indexing and Reuse in Multimedia Systems, (pp. 119 - 125). Seattle,
Washington: AAAI Press.
Don, A. (1990). Narrative and the Interface. In B. Laurel (Eds.), The Art of Human
Computer Interaction (pp. 383 - 391), Reading, Massachusetts: Addison-Wesely.
http://www.abbedon.com/Project/wemake.html
Bibliography
221
Dramatica (1996). Screenplay Systems Inc. Developed by Melanie Anne Philips &
Chris Huntley. Software architecture by Stephen Greenfield.
http://www.well.com/user/dramatica/index.html
Dyer, M. G. (1982) In-Depth Understanding: A Computer Model for Integrated
Processing for Narrative Comprehension. Ph.D. Thesis, Yale University, New
Haven, CT.
Eastman, M. (1937). Enjoyment of Laughter. London: Hamish Hamilton.
Eco, U. (1976). Articulations of the Cinematic Code. In B. Nichols (Eds.), Movies
and Methods (pp. 590 - 607). Berkeley: University of California Press.
Eco, U. (1977). A Theory of Semiotics. London: The Macmillan Press.
Eco, U. (1985). Einführung in die Semiotik. München: Wilhelm Fink Verlag.
Edelson, D. C. (1993). Socrates, Aesops and the Computer: Questioning and
Storytelling with Multimedia. Journal of Educational Multimedia and Hypermedia,
2(4), 393 - 404.
Efron, D. (1972). Gesture, Race and Culture. The Hague: Mouton.
Eisenstein, S. M. (1948). Film Sense, edited and translated by Jay Leyda. London:
Faber and Faber Ltd.
Eisenstein, S. M. (1951). Film Form: Essays in Film Theory, edited and translated by
Jay Leyda. London: Dobson Books Ltd.
Eisenstein, S. M. (1970). Film Essays and a Lecture, edited by Lay Leyda. New
York: Praeger.
Eisenstein, S. M. (1988). Selected Works: Writings 1922 - 1934. London: BFI
Publishing.
Eisenstein, S. M. (1991). Selected Works: Towards a Theory of Montage. London:
BFI Publishing.
Feiner, S. K., & McKeown, K. R. (1991). Automating the Generation of Coordinated
Multimedia Explanations. IEEE Computer, 24(10), 33 - 41.
Fellbaum, C. (1993). English Verbs as a Semantic Net.
(ftp://clarity.princeton.edu/pub/wordnet/5papers.ps). Cognitive Science Laboratory,
Princeton University.
Fellbaum, C., Gross, D., & Miller, K. (1993). Adjectives in WordNet
(ftp://clarity.princeton.edu/pub/wordnet/5papers.ps). Cognitive Science Laboratory,
Princeton University.
Freud, S. (1960). Jokes And Their Relation To The Unconscious. London: Routledge
& Kegan Paul Ltd.
Frith, U., & Robson, J. E. (1975). Perceiving the language of film. Perception, 4, 97 103.
Bibliography
222
Galyean, T. A. (1995). Narrative Guidance. In AAAI Spring Symposium on
Interactive Story Systems: Plot and Character. Stanford University:
Gehring, W. D. (1986). Screwball Comedy - A Genre of Madcap Romance. New
York: Greenwood Press.
Gibson, J. J. (1950). Perception of the Visual World. Boston, Ma: Houghton Mifflin.
Gibson, J. J. (1971). The Information available in Pictures. Leonardo(4), 27 - 35.
Golden, L., & Hardison, O. B. (1968). Aristotle’s Poetics: A Translation and
Commentary for Students of Literature. Englewood Cliffs, N.J.: Prentice-Hall.
Gordon, A. S., & Domeshek, E. A. (1995). Conceptual Indexing for Video Retrieval.
In M. Maybury (Ed.)(pp. 23-38), IJCAI 95 - Workshop on Intelligent Multimedia
Information Retrieval. Montréal, Canada: August 19, 1995.
Graham, A. (1983). What is wrong with story grammars? Cognition, 15, 145 - 154.
Gregory, J. R. (1961) Some Psychological Aspects of Motion Picture Montage. Ph.D.
Thesis, University of Illinois.
Gregory, R. L. (1971). The Intelligent Eye. London: Weidenfeld&Nicolson.
Greimas, J. (1983). Structural Semantics: An Attempt at a Method. Lincoln:
University of Nebraska Press.
Güttinger, F. (1984). Der Stummfilm in Zitat der Zeit. Frankfurt: Deutsches
Filmmuseum Frankfurt am Main.
Haase, K. (1994). FRAMER: A Persistent Portable Representation Library. In ECAI
94 European Conference on Artificial Intelligence, (pp. 732- 736). Amsterdam, The
Netherlands.
Hampapur, A., Jain, R., & Weymouth, T. E. (1995a). Indexing in Video Databases.
In Storage and Retrieval for Image and Video Databases II, (pp. 292 - 306). San
Jose, California, 9 - 10 February 1995: SPIE.
Hampapur, A., Jain, R., & Weymouth, T. E. (1995b). Production Model Based
Digital Video Segmentation. Multimedia Tools and Applications, 1, 9 - 46.
Hayes, P. J. (1990). The Frame Problem and Related Problems in Artificial
Intelligence. In J. Allen, J. Hendler, & A. Tate (Eds.), Readings in Planning (pp. 588
- 595). San Mateo, California: Morgan Kaufmann Publishers.
Hayes-Roth, B. (1985). A Blackboard Architecture for Control. Artificial
Intelligence, 26, 251 - 321.
Hayes-Roth, B. (1995). Agents on Stage: Advancing the State of the Art of AI. In C.
S. Melish (Ed.), IJCAI-95 International Joint Conference on Artificial Intelligence,
(pp. 967 - 971). Montréal, Canada: August 10 - 25, 1995.
Hayes-Roth, B., Brownston, L., & Sincoff, E. (1995). Directed Improvisation by
Computer Characters (Technical Report No. KSL-95-04). Stanford University.
Bibliography
223
Hayes-Roth, B., & Hayes-Roth, F. (1990). A Cognitive Model of Planning. In J.
Allen, J. Hendler, & A. Tate (Eds.), Readings in Planning (pp. 245 - 262). San
Mateo: Morgan Kaufmann Publishers.
Hayes-Roth, B., Sincoff, E., Brownston, L., Huard, R., & Lent, B. (1994). Directed
Improvisation (Technical Report No. KSL-94-61). Stanford University.
Hobbes, T. (1650). Human Nature. London: John Bohn.
Hochberg, J. E. (1978). Perception. Englewood Cliffs, New Jersey: Prentice-Hall.
Isenhour, J. P. (1975). The Effects of Context and Order in Film Editing. AV
Communication Review, 23(1), 69 - 80.
Iser, W. (1989). Prospecting: From Reader Response in Literary Anthropology.
Baltimore, Maryland: The John Hopkins University Press.
Iser, W. (1993). The Fictive and the Imaginary: Charting Literary Anthropology.
Baltimore, Maryland: The John Hopkins University Press.
Jakobson, R., & Halle, M. (1980). Fundamentals of Language. The Hague: Mouton
Publishers.
Janowitz, M., & Street, D. (1966). The Social Organization of Education. In P. H.
Rossi & B. J. Biddle (Eds.), The New Media and Education. Chicago: Aldine.
Jenkins, H. (1992). Textual Poachers: Television Fans & Participatory Culture. New
York: Routledge.
Jordan, T. H. (1975). The Anatomy Of Cinematic Humor. New York: The Revisionist
Press.
Kant, I. (1951). Critique of Judgement. New York: Haffner.
Katz, S. D. (1991). Shot by Shot - Visualizing from concept to screen. Stoneham,
MA: Michael Wiese Productions in conjunction with Focal Press.
Kendon, A. (1981). Nonverbal Communication, Interaction, and Gesture - Selections
from Semiotica. The Hague: Mouton Publishers.
Kennedy, J. M. (1974). A Psychology of Picture Perception. San Francisco: JosseyBass.
Knitsch, W., & van Dijk, T. A. (1978). Towards a model of text comprehension and
production. Psychological review, 85, 363 -394.
Kolodner, J. L. (1984). Retrieval and Organizational Strategies in Conceptual
Memories. Hillsdale, N.J.: Lawrence Erlbaum.
Kolodner, J. L. (1993). Case-Based Reasoning. San Mateo, California: Morgan
Kaufmann.
Korf, R. E. (1990). Planning as Search: A Quantitative Approach. In J. Allen, J.
Hendler, & A. Tate (Eds.), Readings in Planning (pp. 566 - 577). San Mateo: Morgan
Kaufmann Publishers.
Bibliography
224
Kracauer, S. (1960). Theory of Film: The Redemption of Physical Reality. New York:
Oxford University Press.
Kuleshov, L. (1974). Kuleshov on Film - Writing of Lev Kuleshov. Berkeley:
University of California Press.
La Fave, L. (1972). Humour judgements as a function of reference groups and
identification classes. In J. H. Goldstein & P. E. McGhee (Eds.), The Psychology of
Humour (pp. 195 - 210). New York, London: Academic Press.
La Fave, L., Haddad, J., & Maesen, W. A. (1976). Superiority, Enhanced SelfEsteem, and Perceived Incongruity Humour Theory. In T. Chapman & H. Foot
(Eds.), Humor and Laughter: Theory, Research and Applications (pp. 63 - 91). New
York: John Wiley & Sons.
Lakoff, G. P. (1972). Structural complexity in fairy tales. The study of man, 1, 128 190.
Lebowitz, M. (1980) Generalization and memory in an integrated understanding
system. Ph.D., Yale University.
Lehnert, W. G. (1983). Plot Units: A Narrative Summarization Strategy. In W. G.
Lehnert &. M. H. Ringle (Eds.), Strategies for Natural Language Processing (pp. 375
- 412). Hillsdale, New Jersey: Lawrence Erlbaum Associates.
Lehnert, W. G., Dyer, M. G., Johnson, P. N., Yang, C. J., & Harley, S. (1983).
BORIS - An Experiment in In-Depth Understanding of Narratives. Artificial
Intelligence, 20, 15 - 62.
Lemon, L. T., & Reis, M. J. (1965). Russian Formalist Criticism - Four Essays.
Lincoln: University of Nebraska Press.
Lenat, D. B., & Guha, R. V. (1990). Building Large Knowledge-Based Systems Representation and Inference in the Cyc Project. Reading, MA.: Addison-Wesley.
Lenat, D. B., & Guha, R. V. (1994). Strongly Semantic Information Retrieval. In C.
Baudin, M. Davis, S. Kedar, & D. M. Russell (Ed.), AAAI-94 Workshop Program on
Indexing and Reuse in Multimedia Systems, (pp. 58 - 68). Seattle, Washington:
AAAI Press.
Leventhal, H., & Safer, M. A. (1977). Individual Differences, Personality, and
Humour Appreciation: Introduction to Symposium. In A. J. Chapman & H. C. Foot
(Eds.), It’s a funny Thing, Humour (pp. 335 - 349). Oxford: Pergamon.
Lévi-Strauss, C. (1968). Structural Anthropology Volume 1, translated by Clair
Jacobson, Brooke Grundfest Schoepf. London: Allen Lane.
Lévi-Strauss, C. (1977). Structural Anthropology Volume 2, translated by Monique
Layton. London: Allen Lane.
Lippmann, W. (1934). Public Opinion. New York: The MacMillan Company.
Mackay, W. E., & Davenport, G. (1989). Virtual Video Editing in Interactive
Multimedia Applications. Communications of the ACM, 32(7), 802 - 810.
Bibliography
225
Maes, P. (1991). Designing Autonomous Agents, Theory and Practise from Biology to
Engineering and back. Cambridge, MA: MIT Press.
Maes, P. (1994). Modeling Adaptive Autonomous Agents. Journal of Artificial Life,
1(1/2), url: http://pattie.www.media.mit.edu/people/pattie/cv.html#publications.
Mandler, J. M., &. Johnson, N. S. (1977). Remembrance of Things Parsed: Story
Structure and Recall. Cognitive Psychology, 9, 111 - 151.
Mann, T. (1980). Doktor Faustus - Das Leben des deutschen Tonsetzers Adrian
Leverkühn erzählt von einem Freunde. Frankfurt am Main: S. Fischer Verlag.
Marr, D. (1982). Vision: A Computational Investigation into the Human
Representation and Processing of Visual Information. San Francisco: Freeman.
Mast, G. (1979). The Comic Mind - Comedy and the Movies. Chicago: The
University of Chicago Press.
McAleese, R. (1985). Knowledge and Information Mapping and Interactive Video.
Aberdeen: University Teaching Centre, University of Aberdeen.
McCarthy, J., & Hayes, P. J. (1990). Some philosophical problems from the
standpoint of artificial intelligence. In J. Allen, J. Hendler, & A. Tate (Eds.),
Readings in Planning (pp. 393 - 435). San Mateo: Morgan Kaufmann Publishers.
McDermott, D. (1978). Planning and Acting. Cognitive Science, 2, 71 - 109.
McDermott, D. (1990). A Temporal Logic For Reasoning About Processes and Plans.
In J. Allen, J. Hendler, & A. Tate (Eds.), Readings in Planning (pp. 436 - 463). San
Mateo: Morgan Kaufmann Publishers.
Metz, C. (1974). Film Language: A Semiotic Of The Cinema. New York: Oxford
University Press.
Metz, C. (1976a). Current Problems of Film Theory: Mitry's L'Esthétique et
Psychologie du Cinema. Vol. II. In B. Nichols (Eds.), Movies and Methods (pp. 568 578). Berkeley: University of California Press.
Metz, C. (1976b). On the Notion of Cinematographic Language. In B. Nichols (Eds.),
Movies and Methods (pp. 582 - 589). Berkeley: University of California Press.
Metz, C. (1982). Story/Discourse. In The Imaginary Signifier - Psychoanalysis and
the cinema (pp. 91 - 98). Bloomington: Indiana University Press.
Miller, G. A. (1993). Nouns in WordNet: A Lexical Inheritance System
(ftp://clarity.princeton.edu/pub/wordnet/5papers.ps). Cognitive Science Laboratory,
Princeton University.
Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. (1993).
Introduction to WordNet: An On-line Lexical Database
(ftp://clarity.princeton.edu/pub/wordnet/5papers.ps). Cognitive Science Laboratory,
Princeton University.
Bibliography
226
Mills, M., Cohen, J., & Wong, Y. Y. (1992). A Magnifier Tool for Video Data. In
CHI’92, (pp. 93 - 98). Monterey, CA: ACM Press.
Mindess, H. (1971). Laughter and Liberation. Los Angeles: Nash.
Minsky, M. L. (1988). The Society of mind. London: Picador.
Monaco, J. (1981). How To Read A Film. New York: Oxford University Press.
Monro, D. H. (1951). Argument of Laughter. Melbourne: Melbourne University
Press.
Nagasaka, A., & Tanaka, Y. (1992). Automatic video indexing and full-search for
video appearance. In E. Knuth & I. M. Wegener (Eds.), Visual Database Systems (pp.
113 - 127). Amsterdam: Elsevier Science Publishers.
Neale, S., & Krutnik, F. (1990). Popular Film and Television Comedy. London:
Routledge.
Newell, A., & Simon, H. A. (1990). GPS, A Program that Simulates Human
Thought. In J. Allen, J. Hendler, & A. Tate (Eds.), Readings in Planning (pp. 59-66).
San Mateo: Morgan Kaufmann Publishers.
Norwig, P. (1992). Paradigms of Artificial Intelligence Programming: Case Studies
in Common Lisp. San Mateo: Morgan Kaufmann Publishers.
Oldham, G. (1992). First Cut: Conversations with Film Editors. Berkeley: University
of California Press.
Olson, E. (1968). The Theory of Comedy. Bloomington: Indiana University Press.
Oomoto, E., & Tanaka, K. (1993). OVID: Design and Implementation of a VideoObject Database System. IEEE Transactions On Knowledge And Data Engineering,
5(4), 629-643.
Ortony, A., Clore, G. L., & Collins, A. (1988). The Cognitive Structure of Emotions.
New York: Cambridge University Press.
Parkes, A. P. (1987). Towards a Script-Based Representation Language for
Educational Films. Programmed Learning and Educational Technology, 24(3), 234 246.
Parkes, A. P. (1989a) An Artificial Intelligence Approach to the Conceptual
Description of Videodisc Images. Ph.D. Thesis, Lancaster University.
Parkes, A. P. (1989b). The Prototype CLORIS system: Describing, Retrieving and
Discussing Videodisc Stills and Sequences. Information Processing and
Management, 25(2), 171 - 186.
Parkes, A. P. (1989c). Settings and the Settings Structure: The Description and
Automated Propagation of Networks for Perusing Videodisk Image States. In N. J.
Belkin & C. J. van Rijsbergen (Ed.), SIGIR ’89, (pp. 229 - 238). Cambridge, MA:
Parkes, A. P. (1992). Computer-controlled video for intelligent interactive use: a
description methodology. In A. D. N. Edwards &. S.Holland (Eds.), Mulimedia
Interface Design in Education (pp. 97 - 116). New York: Springer-Verlag.
Bibliography
227
Parkes, A. P., & Self, J. (1988). Video-Based Intelligent Tutoring Of Procedural
Skills. In ITS-88, (pp. 454 - 461). Montréal: June 1 -3, 1988.
Peirce, C. S. (1960). The Collected Papers of Charles Sanders Peirce - 1 Principles
of Philosophy and 2 Elements of Logic, Edited by Charles Hartshorne and Paul
Weiss. Cambridge, Massachusetts: The Belknap Press of Harvard University Press.
Pentland, A., Picard, R., Davenport, G., & Welsh, B. (1993). The BT/MIT Project on
Advanced Tools for Telecommunications: An Overview (Perceptual Computing
Technical Report No. 212). MIT.
Pentland, A. P., Picard, R., Davenport, G., & Haase, K. (1994). Video and Image
Semantics: Advanced Tools for Telecommunications (Technical Report No. 283).
MIT.
Petric, V. (1987). Constructivism in Film. Cambridge: Cambridge University Press.
Piaget, J. (1970). Structuralism - edited by Chaninah Maschler. New York: Basic
Books Inc.
Picard, R. W., & Liu, F. (1994). A new Wold ordering for image similarity (Technical
Report No. 237). MIT.
Picard, R. W., & Minka, T. P. (1995). Vision texture for annotation. Multimedia
Systems, 3(1), 3 - 14.
Pinhanez, C. S., & Bobick, A. F. (1995). Intelligent Studios: Using Computer Vision
to Control TV Cameras. In J. Bates., B. Hayes-Roth & H. Kitano (Ed.), IJCAI-95
Workshop on AI and Entertainment and AI/Alife, (pp. 69 - 76). Montréal, Canada:
August 19.
Price, G. (1973). A grammar of story: An introduction. The Hague: Mouton.
Propp, V. W. (1968). Morphology of the Folktale. University of Texas Press.
Pudovkin, V. I. (1968). Film Technique And Film Acting. London: Vision Press
Limited.
Quillian, M. R. (1966). Semantic Memory. In M. Minsky (Eds.), Semantic
Information Processing (pp. 227 - 270). Cambridge, Mass.: MIT Press.
Quillian, M. R. (1985). Word Concepts: A Theory and Simulation of Some Basic
Semantic Capabilities. In R. J. Brachman & H. J. Levesque (Eds.), Readings in
Knowledge Representation (pp. 97 - 118). Los Altos: Morgan Kaufmann.
Rabinger, M. (1989). Directing - Film Techniques and Aesthetics. Boston: Focal
Press.
Raphael, B. (1971). The Frame Problem in Problem Solving Systems. In N. Findler
& B. Meltzer (Eds.), Artificial Intelligence and Heuristic Programming (pp. 159 169). New York: American Elsevier.
Raskin, V. (1985). Semantic Mechanisms of Humor. Dordrecht: D. Reidel Publishing
Company.
Bibliography
228
Reisz, K., & Millar, G. (1969). The Technique of Film Editing. New York:
Focal/Hastings House.
Ricoeur, P. (1985). Time and Narrative. Chicago: The University of Chicago Press.
Riesbeck, C. K., & Schank, R. C. (1989). Inside case-based reasoning. Hillsdale,
New Jersey: Lawrence Erlbaum Associates.
Rosenblum, R., & Karen, R. (1979). When The Shooting Stops, The Cutting Begins.
New York: Da Capo Press, Inc.
Rothbart, M. K. (1976). Incongruity, Problem-Solving and Laughter. In T. Chapman
& H. Foot (Eds.), Humor and Laughter: Theory, Research and Applications (pp. 37 54). New York: John Wiley & Sons.
Rothbart, M. K., & Pien, D. (1977). Elephants and Marshmallows: A Theoretical
Synthesis of Incongruity-Resolution and Arousal Theories of Humour. In A. J.
Chapman & H. C. Foot (Eds.), It’s a funny Thing, Humour (pp. 37 - 40). Oxford:
Pergamon.
Rumelhart, D. E. (1975). Notes on a schema for stories. In D. G. Bobrow & A.
Collins (Eds.), Representation and Understanding (pp. 211 - 236). New York:
Academic Press.
Rumelhart, D. E. (1977). Understanding and summarizing brief stories. In D. Laberge
& S. J. Samuels (Eds.), Basic processes in reading: Perception and comprehension
(pp. 265 - 303). Hillsdale, N.J.: Lawrence Erlbaum Associates.
Russel, K., Starner, T., & Pentland, A. (1995). Unencumbered Virtual Environments.
In J. Bates., B. Hayes-Roth & H. Kitano (Ed.), IJCAI-95 Workshop on AI and
Entertainment and AI/Alife, (pp. 58 - 62). Montréal, Canada: August 19, 1995.
Ryan, M.-L. (1991). Possible Worlds, Artificial Intelligence and Narrative Theory.
Bloomington: Indiana University Press.
Sacerdoti, E. D. (1977). A Structure for Plans and Behaviour. New York: Elsevier.
Sacerdoti, E. D. (1990a). The Nonlinear Nature of Plans. In J. Allen, J. Hendler, & A.
Tate (Eds.), Readings in Planning (pp. 162 - 170). San Mateo: Morgan Kaufmann
Publishers.
Sacerdoti, E. D. (1990b). Planning in a Hierarchy of Abstraction Space. In J. Allen, J.
Hendler, & A. Tate (Eds.), Readings in Planning (pp. 98 - 108). San Mateo: Morgan
Kaufmann Publishers.
Sack, W. (1993). Coding News And Popular Culture. In The International Joint
Conference on Artificial Intelligence (IJCA93) Workshop on Models of Teaching and
Models of Learning. Chambery, Savoie, France.
Sack, W., & Davis, M. (1994). IDIC: Assembling Video Sequences from Story Plans
and Content Annotations. In IEEE International Conference on Multimedia
Computing and Systems. Boston, Ma: May 14 - 19, 1994.
Sack, W., & Don, A. (1993). Splicer: An Intelligent Video Editor (Unpublished
Working Paper).
Bibliography
229
Salomon, G., & Cohen, A. A. (1977). Television formats, mastery of mental skills,
and the acquisition of knowledge. Journal of Educational Psychology, 69, 612 - 619.
Sandewall, E. (1972). An Approach to the Frame Problem and Its Implementation. In
B. Meltzer & D. Mitchie (Eds.), Machine Intelligence 7. Edinburgh: Edinburgh
University Press.
Saussure, F. d. (1966). Course in General Linguistics - edited by Charles Balley,
Albert Sechehaye and Albert Riedlinger. New York: McGraw-Hill.
Schank, R. C. (1972). Conceptual Dependency: A theory of natural language
understanding. Cognitive Psychology, 3, 552 - 631.
Schank, R. C. (1982). Dynamic memory. New York: Cambridge University Press.
Schank, R. C. (1991). Case-based teaching: Four experiences in educational
Software Design. (Technical Report No. 7). Institute for Learning Sciences,
Northwestern University.
Schank, R. C. (1994). Active Learning through Multimedia. IEEE MultiMedia, 1(1),
69 - 78.
Schank, R. C., & Abelson, R. (1977). Scripts, Plans, Goals And Understanding.
Hillsdale, New Jersey: Lawrence Earlbaum Associates.
Schank, R. C., Kass, A., & Riesbeck, C. (1994). Inside Case-Based Explanation.
Hillsdale, N.J.: Lawrence Erlbaum Associates.
Schank, R. C., & Riesbeck, C. K. (1981). Inside Computer Understanding. Hillsdale,
New Jersey: Lawrence Erlbaum Associates.
Schopenhauer, A. (1966). The World As Will And Representation. New York: Dover
Publications, Inc.
Schumm, G. (1993). Feinschnitt - die verborgene Arbeit an der Blickregie. In H.
Beller (Eds.), Handbuch der Filmmontage - Praxis und Prinzipien des Filmschnitts
(pp. 224 - 225). München: TR-Verlagsunion.
Segre, C. (1979). Structure and Time - Narration, Poetry, Models. Chicago: The
University of Chicago Press.
Shoham, Y. (1987). Temporal Logics in AI: Semantical and Ontological
Considerations. Artificial Intelligence, 33(1), 89 - 104.
Shultz, T. R. (1976). A Cognitive-Developmental Analysis of Humor. In T. Chapman
& H. Foot (Eds.), Humor and Laughter: Theory, Research and Applications (pp. 11 36). New York: John Wiley & Sons.
Siddons, H. (1968). Practical Illustrations of Rhetorical Gesture and Action. New
York: Benjamin Blom, Inc.
Smith, B. K., Agganis, A., & Reiser, B. J. (1995). Children and Artificial Life
Revisited. In J. Bates., B. Hayes-Roth & H. Kitano (Ed.), IJCAI-95 Workshop on AI
and Entertainment and AI/Alife, (pp. 6 - 13). Montréal, Canada: August 19, 1995.
Bibliography
230
Smith, T. C., & Witten, I. H. (1991). A Planning Mechanism for Generating Story
Text. Literary and Linguistic Computing, 6(2), 119 - 126.
Sowa, J. F. (1984). Conceptual Structures: Information Processing in Mind and
Machine. Reading, MA: Addison-Wesley Publishing Company.
Spottiswode, R. J. (1955). A Grammar Of The Film - an analysis of film technique.
London: Faber & Faber.
Stein, N. L., & Glenn, C. G. (1979). An analysis of story comprehension in
elementary school children. In R. O. Freedle (Eds.), New directions in discourse
processing Norwood, N.J.: Ablex Pub. Corp..
Sternberg, M. (1978). Expositional Modes and Temporal Ordering in Fiction.
Baltimore: The Johns Hopkins University Press.
Storyline Pro (1993). Truby’s Writer Studio. Developed by John Truby.
http://hollywoodnetwork.com/hn/shopping/kiosk/wcs40.htm
Strassmann, S. (1994,). Semi-Autonomous Animated Actors. In National Conference
on Artificial Intelligence (July 31 - August 4),(pp. 128 - 134). Seattle, Washington:
AAAI Press.
Striedter, J. (1971). Russischer Formalismus: Texete zur allgemeinen Literaturtheorie
und zur Theorie der Prosa. München: Fink.
Suleiman, S. R. (1983). Authoritarian Fictions: The Ideological Novel As a Literary
Genre. New York: Columbia University Press.
Suls, J. (1977). Cognitive and Disparagement Theories of Humour: A Theoretical and
Empirical Synthesis. In A. J. Chapman & H. C. Foot (Eds.), It’s a funny Thing,
Humour (pp. 41 - 45). Oxford: Pergamon.
Suls, J. M. (1972). A two-stage model for the appreciation of jokes and cartoons: An
information-processing analysis. In J. H. Goldstein & P. E. McGhee (Eds.), The
Psychology of Humour (pp. 81 - 100). New York, London: Academic Press.
Tate, A., Hendler, J., & Drummond, M. (1990). A Review of AI Planning
Techniques. In J. Allen, J. Hendler, & A. Tate (Eds.), Readings in Planning (pp. 26 49). San Mateo: Morgan Kaufmann Publishers.
Thorndyke, P. W. (1977). Cognitive structures in comprehension and memory of
narrative discourse. Cognitive Psychology, 9, 77 - 100.
Tonomura, Y., Akutsu, A., Taniguchi, Y., & Suzuki, G. (1994). Structured Video
Computing. IEEE MultiMedia, 1(3), 34 - 43.
Tosa, N., & et al. (1995). Network Neuro-Baby with robotic hand. J. Bates., B.
Hayes-Roth & H. Kitano (Ed.), IJCAI-95 Workshop on AI and Entertainment and
AI/Alife, (pp. 48 - 53). Montréal, Canada: August 19, 1995.
Tudor, A. (1974). Image And Influence. London: George Allen & Unwin Ltd.
Ueda, H., Miyatake, T., Sumino, S., & Nagasaka, A. (1993). Automatic Structure
Visualization for Video Editing. In ACM & IFIP INTERCHI ’93, (pp. 137 - 141).
Bibliography
231
Ueda, H., Miyatake, T., & Yoshizawa, S. (1991). IMPACT: An Interactive NaturalMotion-Picture Dedicated Multimedia Authoring System. In Proc ACM CHI ’91
Conference on Human Factors In Computing Systems, (pp. 343-450).
van Dijk, T. (1972). Some aspects of text grammars: a study in theoretical linguistics
and poetics. The Hague: Mouton.
Wilensky, R. (1978) Understanding goal-based stories. Ph.D., Yale University.
Wilensky, R. (1983a). Planing and Understanding - A Computational Approach to
Human Reasoning. Reading, Massachusetts: Addison-Wesley Publishing Company.
Wilensky, R. (1983b). Points: A Theory of the Structure of Stories in Memory. In W.
G. Lehnert & M. H. Ringle (Eds.), Strategies for Natural Language Processing (pp.
345 - 376). Hillsdale, New Jersey: Lawrence Erlbaum Associates.
Wilensky, R. (1983c). Story grammars versus story points. The Behavioral and Brain
Sciences, 6(4), 579 - 623.
Wilensky, R. (1990). A Model for Planning in Complex Situations. In J. Allen, J.
Hendler, & A. Tate (Eds.), Readings in Planning (pp. 263 - 274). San Mateo: Morgan
Kaufmann Publishers.
Wilkins, D. E. (1990). Domain-independent Planning: Representation and Plan
Generation. In J. Allen, J. Hendler, & A. Tate (Eds.), Readings in Planning (pp. 319 335). San Mateo: Morgan Kaufmann Publishers.
Winograd, T. (1985). Frame Representations and the Declarative/Procedural
Controversy. In R. J. Brachman & H. J. Levesque (Eds.), Readings in Knowledge
Representation (pp. 357 - 370). San Mateo, California: Morgan Kaufmann
Publishers.
Wolff, C. (1972). A Psychology of Gesture. New York: Arno Press.
Woods, W. A. (1985). What’s in a link: Foundations for Semantic Networks. In R. J.
Brachman & H. J. Levesque (Eds.), Readings in Knowledge Representation (pp. 218
- 241). San Mateo, California: Morgan Kaufmann Publishers.
Wulff, H. J. (1993). Der Plan macht’s. In H. Beller (Eds.), Handbuch der
Filmmontage - Praxis und Prinzipien des Filmschnitts (pp. 178 - 189). München: TRVerlagsunion.
Yeung, M. M., Yeo, B., Wolf, W. & Liu, B. (1995). Video Browsing using
Clustering and Scene Transitions on Compressed Sequences. In Proceedings
IS&T/SPIE ’95 Multimedia Computing and Networking, San Jose. SPIE (2417), 399 413.
Zhang, H., Gong, Y., & Smoliar, S. W. (1994). Automated parsing of news video. In
IEEE International Conference on Multimedia Computing and Systems, (pp. 45 54). Boston: IEEE Computer Society Press.
Zhang, H., Kankanhalli, A., & Smoliar, S. W. (1993). Automatic Partitioning of FullMotion Video. Multimedia Systems, 1, 10 - 28.
Zillmann, D., & Cantor, J. R. (1976). A Disposition Theory of Humour and Mirth. In
T. Chapman & H. Foot (Eds.), Humor and Laughter: Theory, Research and
Applications (pp. 97 - 115). New York: John Wiley & Sons.
232
Filmography
Abbott and Costello meet the
Mummy
Charles Lamont
USA - 1955
Airplane!
Jim Abrahams, David
Zucker & Jerry Zucker
Sergei M. Eisenstein
Louis Lumière
Robert Zemeckis
Woody Allen
Sergei M. Eisenstein
Walter Ruttmann
USA - 1980
USSR - 1938
France - 1895
USA - 1990
USA - 1971
USSR - 1925
Germany - 1927
Penny Marshall
Stephen Herek
USA - 1988
USA - 1989
Ridley Scott
James Parrott
Howard Hawks
Orson Wells
Richard Linklahr
Spike Lee
Stanley Kubrick
USA - 1982
USA - 1930
USA - 1938
USA - 1941
USA - 1993
USA - 1989
GB - 1963
John Hughes
Charles Chaplin
Harold Ramis
Chris Columbus
USA - 1986
USA - 1925
USA - 1993
USA - 1990
Alexander Nevsky
Arroseur arrosé, Le
Back to the Future III
Bananas
Battleship Potemkin
Berlin, die Symphonie der
Großstadt
Big
Bill and Ted's Excellent
Adventure
Blade Runner
Brats
Bringing up Baby
Citizen Kane
Dazed and Confused
Do the right thing
Dr. Strangelove or: How I
Learned to Stop Worrying and
Love the Bomb
Ferris Bueller's Day Off
Gold Rush, The
Groundhog Day
Home Alone
Filmography
233
Idle Class, The
Immigrant, The
Kabinett des Dr. Caligari, Das
Last Emperor, The
Charles Chaplin
Charles Chaplin
Robert Wiene
Bernado Bertolucci
USA - 1921
USA - 1917
Germany - 1919
China, Italy,
UK - 1987
Germany - 1924
USA - 1936
GB - 1983
Letzte Mann, Der
Modern Times
Monty Python’s the Meaning
of Life
Mr. Deeds Goes to Town
Naked
Naked Gun, The
Naked Gun 2 1/2, The
Night at the Opera, A
October
Pulp Fiction
Rear Window
Rosemary’s Baby
Shame
Shoulder Arms
Snow White and the Seven
Dwarfs
Sunrise
Tin Toy
Toy Story
Trainspotting
Take the money and run
Wayne’s World
Friedrich Wilhelm Murnau
Charles Chaplin
Terry Gilliam &
Terry Jones
Frank Capra
Mike Leigh
David Zucker
David Zucker
Sam Wood
Sergei M. Eisenstein
Quentin Tarantino
Alfred Hitchcock
Roman Polansky
Ingemar Bergman
Charles Chaplin
Walt Disney & David Hand
USA - 1936
UK - 1993
USA - 1988
USA - 1991
USA - 1935
USSR - 1927
USA - 1994
USA - 1954
USA - 1968
Sweden - 1968
USA - 1918
USA - 1937
Friedrich Wilhelm Murnau
John Lasseter
John Lasseter
Danny Boyle
Woody Allen
Penelope Spheeris
USA - 1927
USA - 1988
USA - 1995
GB - 1995
USA - 1969
USA - 1992
234
Appendix
The following is a generation trace produced by AUTEUR (see chapter 9).
(dolphin)~/auteur4>sicstus˝SICStus 2.1 #7: Tue Mar 16 09:53:11 GMT 1993˝
?-[start].
{consulting /tmp_mnt/home/fn/auteur4/start.pl...}
{/tmp_mnt/home/fn/auteur4/start.pl consulted, 170 msec 7184 bytes}
yes
?- system.
yes
?- go(10,humour).
Start instantiation :
==============
Instantiate databases:
{consulting/tmp_mnt/home/fn/auteur4/db_shot.pl...}
{/tmp_mnt/home/fn/auteur4/db_shot.pl consulted, 870 msec 40800 bytes}
{consulting/tmp_mnt/home/fn/auteur4/dicsem.pl...}
{/tmp_mnt/home/fn/auteur4/dicsem.pl consulted, 140 msec 10048 bytes}
{consulting/tmp_mnt/home/fn/auteur4/diccon.pl...}
{/tmp_mnt/home/fn/auteur4/diccon.pl consulted, 160 msec 11152 bytes}
{consulting/tmp_mnt/home/fn/auteur4/dicvisual.pl...}
{/tmp_mnt/home/fn/auteur4/dicvisual.pl consulted, 90 msec 7488 bytes}
Load modules:
{consulting/tmp_mnt/home/fn/auteur4/startshot_analysis.pl...}
{/tmp_mnt/home/fn/auteur4/startshot_analysis.pl consulted, 650 msec 21008bytes}
{consulting/tmp_mnt/home/fn/auteur4/story_planner.pl...}
Appendix
235
{/tmp_mnt/home/fn/auteur4/story_planner.pl consulted, 460 msec 14400 bytes}
{consulting/tmp_mnt/home/fn/auteur4/scene_planner.pl...}
{/tmp_mnt/home/fn/auteur4/scene_planner.pl consulted, 520 msec 22496 bytes}
{consulting/tmp_mnt/home/fn/auteur4/scene_analyser.pl...}
{/tmp_mnt/home/fn/auteur4/scene_analyser.pl consulted, 380 msec 11232 bytes}
{consulting/tmp_mnt/home/fn/auteur4/scene_creator.pl...}
{/tmp_mnt/home/fn/auteur4/scene_creator.pl consulted, 340 msec 6096 bytes}
{consulting/tmp_mnt/home/fn/auteur4/motivation.pl...}
{/tmp_mnt/home/fn/auteur4/motivation.pl consulted, 2020 msec 51200 bytes}
{consulting/tmp_mnt/home/fn/auteur4/realisation.pl...}
{/tmp_mnt/home/fn/auteur4/realisation.pl consulted, 1040 msec 27744 bytes}
{consulting/tmp_mnt/home/fn/auteur4/resolution.pl...}
{/tmp_mnt/home/fn/auteur4/resolution.pl consulted, 700 msec 14448 bytes}
{consulting/tmp_mnt/home/fn/auteur4/repetition.pl...}
{/tmp_mnt/home/fn/auteur4/repetition.pl consulted, 350 msec 19696 bytes}
Start instantiation :
==============
Instantiate constants / variables :
Startshot : 10˝
Theme : humour
perform : []
Used jokes: []
filenumber: 1
Start material organiser :
=================
[walk+1+32˝
Analysisset :
[1,0,1,0,[frank,[[],[],[walk+1+32]]],[pleasure/1/32+0.5],[[path/1/32],[],[]]],[],[],[]]]
Successfully finished: Material organiser.
Start scene creation :
===============
MOTIVATION
Analysisset :
[1,0,1,0,[frank,[[],[],[walk+1+32]]],[pleasure/1/32+0.5],[[path/1/32],[],[]]],[],[],[]]]
Interpretation : []
Planlist
: [event+s_action+[misfortune,ambiguty,stupidit]
Shotlist
: []
Plan
: event+s_action+misfortune
Event
: meeting
Overall idea is : date
Possible mood shots : pleasure ->[13/29/37,14/47/58,16/118/140,18/1/32,21/1/32,
22/15/27,22/1/32,23/1/27,24/1/27,25/1/27,
29/1/18, 6/63/72]
Appendix
236
Possible action shots : search not necessary since it is a s_action.
Shotlist-Action / action / editing kind : [10/1/32] walk [no˝
The mood shot to be added : 22/1/27
Shotlist-Mood / mood / editing kind : [10/1/18,22/1/27,10/29/32] pleasure insert
The shot to be added for the event : 25/1/27
Shotlist-Event / even/ editing kind :
[25/1/27,10/1/18,22/15/27,10/29/32] meeting+1 join1
REALISATION
Analysis set :
[1,0,1,0,[frank,[[],[],[walk+1+32]],[pleasure/1/32+0.5],[[path/1/32],[],[]]],[],[],[]]
Interpretation :
[[walk,[no],[10/1/32]],[pleasure,[insert],[10/1/18,22/15/27,10/29/32]],[meeting+1,2,[j
oin1],[25/1/27,10/1/18,22/15/27,10/29/32],25]
Planlist
: event+s_action_actors+misfortune+unexpectedness
Shotlist
: [25/1/27,10/1/18,22/15/27,10/29/32]
try to create event person-person oriented joke: :
Could not use any of the known strategies for realisation plan. Try something else.
REALISATION
Analysis set :
[1,0,1,0,[frank,[[],[],[walk+1+32]],[pleasure/1/32+0.5],[[path/1/32],[],[]]],[],[],[]]
Interpretation :
[[walk,[no],[10/1/32,[pleasure,[insert],[10/1/18,22/15/27,10/29/32]],[meeting+1,2,[joi
n1],[25/1/27,10/1/18,22/15/27,10/29/32]],25]
Planlist
: event+s_action+misfortune+unexpectedness
Shotlist
: [25/1/27,10/1/18,22/15/27,10/29/32]
try to create an event single action person joke :
Could not use any of the known strategies for realisation plan. Try something else.
REALISATION
Analysis set :
[1,0,1,0,[frank,[[],[],[walk+1+32]],[pleasure/1/32+0.5],[[path/1/32],[],[]]],[],[],[]]
Interpretation :
[[walk,[no],[10/1/32,[pleasure,[insert],[10/1/18,22/15/27,10/29/32]],[meeting+1,2,[joi
n1],[25/1/27,10/1/18,22/15/27,10/29/32]],25]
Planlist
: event+s_action+misfortune+unexpectedness
Shotlist
: [25/1/27,10/1/18,22/15/27,10/29/32]
try to create a single action person joke
:
Actionshotlist : [[],[4/49/56,5/16/34],[]]
Possible shots for highlighting the object : [1/7/43,2/51/74]
Appendix
237
The shot to be added : 1/7/43
The realisation sequence is : [1/13/43,4/49/56]
The main part is Shot / Shotkind / Joinkind : [4/49/56 medium join1
It reads / action / bodypart / object / location: slip+1 shoe banana path
Realisationlist : [10/1/18,22/15/27,10/29/32,1/13/43,4/49/56]
RESOLUTION
Analysis set :
[1,0,1,0,[frank,[[],[],[walk+1+32]],[pleasure/1/32+0.5],[[path/1/32],[],[]]],[],[],[]]
Interpretation :
[[walk,[no,[10/1/32]],[pleasure,[insert],[10/1/18,22/15/27,10/29/32]],[no,1,[join1],[10
/1/18,22/15/27,10/29/32]],[slip+1,shoe,banana,path,join1,[1/13/43,4/49/56]],75]
Planlist
: s_action+misfortune+unexpectedness
Shotlist
: [10/1/18,22/1/27,10/29/32,1/13/43,4/49/56]
Possible shots for the resolution : [6/63/72]
The shot to be added : 6/63/72
The final sequence is : [10/1/18,22/15/27,10/29/32,1/13/43,4/49/56,6/63/72]
AUTEUR values the created joke as good.
REPETITION
===========
[walk+1+32,read+1+32,scratch+1+25,poke+1+25]
The new startshot is :24
[[walk,read]+1+32,[scratch,poke]+1+25],[walk+1+32,read+1+32,scratch+1+25,poke
+1+25][[read,walk]+1+32]]
MOTIVATION
Analysis set
:
[1,0,2,0,[frank,[[],[[read,walk]+1+32],[walk+1+32,read+1+32,scratch+1+25,poke+1+
25]],[pleasure/1/32+0.5,hurry/1/32+0.5],[[newspaper/1/32,path/1/32],[],[]]],[],[],[]]
Interpretation
: []
Planlist
: [parallel+[misfortune,stupidity]]
Shotlist
: []
Plan
: parallel+misfortune
Chosen action
: read
Possible mood shots : pleasure -> search not necessary since mood is defined
Possible action shots : search not necessary since it is a s_action.
REALISATION
Analysis set :
[1,0,2,0,[frank,[[],[[read,walk]+1+32],[walk+1+32,read+1+32,scratch+1+25,poke+1]
,25]],[pleasure/1/32+0.5,hurry/1/32+0.5],[[newspaper/1/32,path/1/32],[],[]]],[,[],[]]
Interpretation :
[[walk,[no,[24/1/27]],[pleasure,[no,[24/1/27]],[no,1,[[no]],[24/1/27]],25]
Appendix
Planlist
Shotlist
238
: s_action+misfortune+unexpectedness
: [24/1/27]
try to create a single action person joke
:
Objectshotlist : [[],[],[3/1/11]]
Possible shots for the object : [3/1/11]
The realisation sequence is : 3/1/11
The main part is Shot / Shotkind / Joinkind : 3/1/11 medium join1
It reads / action / bodypart / object / location: collide+2 body lamppost path
Realisationlist : [24/1/27,3/1/11]
RESOLUTION
Analysis set :
[1,0,2,0,[frank,[[],[[read,walk]+1+32],[walk+1+32,read+1+32,scratch+1+25,poke+1]
,25]],[pleasure/1/32+0.5,hurry/1/32+0.5],[[newspaper/1/32,path/1/32],[],[]]],[,[],[]]
Interpretation :
[[walk,[no,[24/1/27]],[pleasure,[no,[24/1/27]],[no,1,[[no]],[24/1/27]],[collide+2,lamp
post,body,no,path,join1,[3/1/11]],45]
Planlist
: s_action+misfortune+unexpectedness
Shotlist
: [24/1/27,3/1/11]
Possible shots for the resolution : [29/1/18,30/1/18]
Possible shots for the resolution : [29/1/18,30/1/18]
Possible shots for the resolution : [29/1/18,30/1/18]
Possible shots for the resolution : [29/1/18,30/1/18]
Possible shots for the resolution : [29/1/18,30/1/18]
Possible shots for the resolution : [29/1/18,30/1/18]
Possible shots for the resolution : [29/1/18,30/1/18]
Possible shots for the resolution : [7/76/86]
The shot to be added : 7/76/86
The final sequence is : [24/1/27,3/1/11,7/76/86]
AUTEUR values the created joke as poor.
REPETITION
===========
Try to create something using the higher concept : movement.
The action used is : fly.
Try to create something using the higher concept : movement.
The action used is : drive.
Try to create something using the higher concept : movement.
The action used is : swim.
Try to create something using the higher concept : movement.
The action used is : using_transport.˝[stand+1+27]
The new startshot is now : 25.
Appendix
239
The plan is : scen+using_transport+s_action+[misfortune,ambiguty,stupidity]
MOTIVATION
Analysis set
:
[1,0,1,0,[frank,[[],[],[stand+1+27]],[pleasure/1/27+1.0],[[busstop/1/27],[],[]]],[],[],[]]
Interpretation : []
Planlist
: [scen+using_transport+s_action+[misfortune,ambiguty,stupidity]
Shotlist
: []
Plan
: scen+using_transport+s_action+misfortune
Possible mood shots : pleasure -> search is not necessary since mood is defined.
The first scenario shot to be added : [25/1/27˝
The second scenario shot to be added : 26/1/27
Shotlist-Scenario / actor1 / action / actor2 / action / Shotlist :
frank stand bus come [25/1/27,26/1/27]
REALISATION
Analysis set :
[1,0,1,0,[frank,[[],[],[stand+1+27]],[pleasure/1/27+1.0],[[busstop/1/27],[],[]]],[],[],[]]
Interpretation : [using_transport,[frank+[stand],[25/1/27]],[bus+[come],[26/1/27]],30]
Planlist
: scen+using_transport+s_action+misfortune+schar
Shotlist
: [25/1/27,26/1/27]
The shotlist for the realisation : 28/1/27
Scenario reads : actor1 / action : bus leave
RESOLUTION
Analysis set :
[1,0,1,0,[frank,[[],[],[stand+1+27]],[pleasure/1/27+1.0],[[busstop/1/27],[],[]]],[],[],[]]
Interpretation :
[using_transport,[frank+[stand,[25/1/27],[bus+[come],[26/1/27]],[bus+[leave],[28/1/2
7],70]
Planlist
: scen+using_transport+s_action+misfortune+schar
Shotlist
: [25/1/27,26/1/27,28/1/27]
The shotlists (action/mood) for the resolution : [29/1/18][30/1/18]
Scenario reads : actor1 / action / mood : frank look anger
AUTEUR values the created joke as good.
REPETITION
===========
No further repetition possible.
Appendix
240
The executable file is called auteurfile1.
The content is : ˝#!/bin/sh˝/usr/local/video/bin/edit_movie /usr/local/video/data/fnbus1.Mpeg1 Mpeg1 1 27 /usr/local/video/data/fn-bus2 1 27 /usr/local/video/data/fnbus3.Mpeg1 Mpeg1 1 27 /usr/local/video/data/fn-bus4.Mpeg1 Mpeg1 1 18
/usr/local/video/data/fn-bus5.Mpeg1 Mpeg1 1 18
That's all folks.
The executable file is called auteurfile2.
The content is : ˝#!/bin/sh˝/usr/local/video/bin/edit_movie /usr/local/video/data/fnwalk2.Mpeg1 Mpeg1 1 27 /usr/local/video/data/fn-lamppost.Mpeg1 Mpeg1 1 11
/usr/local/video/data/fn-lamppost.Mpeg1 Mpeg1 76 86 /usr/local/video/data/fnbus1.Mpeg1 Mpeg1 1 27 /usr/local/video/data/fn-bus2 1 27 /usr/local/video/data/fnbus3.Mpeg1 Mpeg1 1 27 /usr/local/video/data/fn-bus4.Mpeg1 Mpeg1 1 18
/usr/local/video/data/fn-bus5.Mpeg1 Mpeg1 1 18
That's all folks.
The executable file is called auteurfile3.
The content is : ˝#!/bin/sh˝/usr/local/video/bin/edit_movie /usr/local/video/data/fnwalk1.Mpeg1 Mpeg1 1 18 /usr/local/video/data/walk1.Mpeg1 Mpeg1 15 27
/usr/local/video/data/fn-walk1.Mpeg1 Mpeg1 29 32 /usr/local/video/data/fnbanana1.Mpeg1 Mpeg1 13 43 /usr/local/video/data/fn-lamppost.Mpeg1 Mpeg1 49
56 /usr/local/video/data/fn-walk2.Mpeg1 Mpeg1 1 27 /usr/local/video/data/fnlamppost.Mpeg1 Mpeg1 1 11 /usr/local/video/data/fn-lamppost.Mpeg1 Mpeg1 76 86
/usr/local/video/data/fn-bus1.Mpeg1 Mpeg1 1 27 /usr/local/video/data/fn-bus2 1 27
/usr/local/video/data/fn-bus3.Mpeg1 Mpeg1 1 27 /usr/local/video/data/fnbus4.Mpeg1 Mpeg1 1 18 /usr/local/video/data/fn-bus5.Mpeg1 Mpeg1 1 18
That's all folks.
Successfully finished: scene creation.

Download Report

AUTEUR: The Application of Video Semantics and Theme

Paperzz.com

Your Paperzz