Jared (Panning) - 400 Bad Request

Honours Project Report
Virtual Panning for Lecture
Environments
Jared Norman
Category
1 Requirement Analysis and Design
2 Theoretical Analysis
3 Experiment Design and Execution
4 System Development and Implementation
5 Results, Findings and Conclusion
6 Aim Formulation and Background Work
7 Quality of Report Writing and Presentation
8 Adherence to Project Proposal and Quality of Deliverables
9 Overall General Project Evaluation
Total marks
Min
0
0
0
0
10
10
Max
20
25
20
15
20
15
10
10
0
10
80
Supervised by Patrick Marais
Department of Computer Science
University of Cape Town
2014
Chosen
10
10
0
5
20
15
10
10
0
80
Abstract
The video recording of lectures has become increasingly popular, though much of the logistics
involved are around the associated costs. Recurring costs for cameramen or large overhead costs
for so called Pan Tilt Zoom (PTZ) cameras deter many institutions from developing their own
educational materials. We report on an approach to minimising these costs which makes use of
multiple high definition (HD) cameras.
i
Acknowledgements
I would like to acknowledge A/Prof. Patrick Marais of the Department of Computer
Science at the University of Cape Town for his valuable input and support. I would
also like to thank Stephen Marquard of CILT for suggesting the project, his input
and his help in acquiring testing data.
I would like to further thank my team members, Chris and Terry, for their enthusiasm
and hard work on the project.
This research is made possible under grant of the National Research Foundation
(hereafter NRF) of the Republic of South Africa. All views expressed in this report
are those of the author and not necessarily those of the NRF.
ii
Contents
1 Introduction
1.1 Project brief and aims . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Licensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Background
2.1 Problem context . . . . . . . . . . . . . . . . . . . . .
2.1.1 Storing video in a computer . . . . . . . . . .
2.1.2 Use of computer vision . . . . . . . . . . . . .
2.2 Overview of digital camerawork . . . . . . . . . . . .
2.2.1 Evaluating lecture recording systems . . . . .
2.2.2 Camera setup and inputs . . . . . . . . . . . .
2.2.3 Feature detection and object tracking . . . . .
2.2.4 Event detection and rules . . . . . . . . . . .
2.2.5 Computer aided camerawork . . . . . . . . . .
2.3 Taxonomy of previous systems . . . . . . . . . . . . .
2.4 Discussion of key papers . . . . . . . . . . . . . . . .
2.4.1 Discussion of the work by Heck et al . . . . .
2.4.2 Discussion of the work by Yokoi and Fujiyoshi
1
1
3
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
4
4
4
5
6
6
7
8
9
11
11
11
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
line approximation.
15
16
16
16
20
21
23
4 Testing of the proposed algorithm
4.1 General analysis of virtual panning . . . . . . . . . . . . . . . . . . .
4.2 Testing of the algorithm . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Linear programming for determining thresholds . . . . . . . .
27
27
29
29
5 Implementation
5.1 Constraints . . . . . . . . . . . . . . . . . . . . .
5.2 Implementation of the ClippedFrame helper class
5.3 Implementation of the proposed algorithm . . . .
5.4 Issues encountered during implementation . . . .
5.5 Run time of the component . . . . . . . . . . . .
5.6 Results of the component as integrated . . . . . .
34
34
34
35
36
36
37
3 Design
3.1 Modelling the panning motion . . . . . .
3.2 Experimenting with filtering . . . . . . .
3.3 Derivative fitting . . . . . . . . . . . . .
3.4 Genetic algorithm . . . . . . . . . . . . .
3.5 Dynamic programming approach . . . .
3.6 Analytic approach based on extrema and
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
heuristic
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 Conclusion
38
7 Future work
38
Appendices
39
iii
A Derivation of panning motion:
39
List of Figures
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Illustration of system components . . . . . . . . . . . . . . . . . . .
Panning function constructed by Heck et al. . . . . . . . . . . . . .
Filtered data of Yokoi and Fujiyoshi . . . . . . . . . . . . . . . . . .
Panning period constructed by Yokoi and Fujiyoshi . . . . . . . . .
Our initial plot of the ROI over frames, from testing data. . . . . .
Our web based tool developed to explore ROI data. . . . . . . . . .
Illustration of approach which classifies data with derivative fitting
Proposed algorithm: cases for clean data step . . . . . . . . . . . .
Proposed algorithm: find intervals step . . . . . . . . . . . . . . . .
Proposed algorithm: creation of pans step . . . . . . . . . . . . . .
Testing of algorithm: number of pans . . . . . . . . . . . . . . . . .
Testing of algorithm: panning total time . . . . . . . . . . . . . . .
Testing of algorithm: leave frame too long penalty . . . . . . . . . .
Testing of algorithm: panning length penalty . . . . . . . . . . . . .
Sample output of the panning component . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
9
12
13
15
17
19
24
27
27
30
31
32
33
37
Taxonomy of related works . . . . . . . . . . . . . . . . . . . . . . . .
10
List of Tables
1
iv
1
Introduction
The video recording of lectures has become increasingly popular with the advent of
open courseware. Traditionally this has involved the use of a cameraman (or many
cameramen) to film each lecture [Nagai, 2009]. The field of computer vision provides
insight into the prospects of automating this process, essentially removing the recurring costs of hiring someone to film each lecture [Wulff et al., 2013].
Several systems have been developed which, using computer vision, remove these
costs. However, many of these systems come with their own costs and are still quite
expensive or inaccessible to the public. In this project, we began development on
open source software to cater to this task with minimal costs. The component which
we report on runs on Linux and integrates with the system.
Some work has been done in the area of generating lecture videos automatically.
Through the process of surveying the literature, it was found that these systems have
the following components in common:
1. Camera setup and inputs
2. Feature detection and object tracking
3. Event detection and rules
4. Computer aided camerawork
The focus in this report is on 3 and 4, while other group members will focus on 1 and 2.
1.1
Project brief and aims
To understand the focus of this report, we outline the three components into which
our research was split and make mention of their relation to one another. A digram
depicting this information can be found in Figure 1
The first component, camera setup and inputs, takes as input two or more camera
feeds which have recorded a lecturer simultaneously. These feeds have some
portion of the lecture overlapping, which are stitched into a single camera feed.
The second component, feature detection and object tracking, takes this camera
feed as input and tracks the lecturer over multiple frames. The output of this
component is a list which associates with each frame an x and y coordinate,
representing the center point of the lecturer.
The third component, and the focus of this report, makes use of these to create a
clipped frame which simulates panning (hence the term “virtual panning”). This
clipped frame is written as a video file on disk.
1
Figure 1: This figure illustrates the components of the VIRPAN system. First the lecturer is recorded
with multiple cameras. Then the videos are stitched together and a larger video produced. The
tracking component then locates the lecturers location in each frame of the video. The panning
component uses these locations to produce a smaller video which virtually pans to keep the lecturer
in view.
2
This report focuses on the design and development of the third component in the developed system. This encompasses both event detection and rules, as well as computer
aided camerawork. We had several aims for the project:
We aimed to explore how virtual panning has been implemented by previous systems,
and how the other components of these systems informed their implementation.
We aimed to design an algorithm which create a sequence of locations for a window
which virtually pans. The algorithm would in general have access to tracking
data from the tracking component, the size the output video and the size of the
input video (stitched video).
We aimed to use this algorithm in the implementation of a lecture recording system.
As mentioned CILT hope to use such a system to generate lecture content inexpensively. Each component of this system was to be implemented by a different
team member and our task was to implement the panning component.
1.2
Licensing
The software developed for this component makes use of a number of libraries. Of
particular interest, the licenses for the opencv and Python Numpy libraries require
that software making use of these libraries include a copy of their copyright notices.
The licenses do not restrict the software which we have created from being used noncommercially by CILT, or anyone else. They do not restrict the software from being
rewritten, though anyone wishing to do so should consult the licenses to better understand whether these licenses are compatible with the authors intended use.
The rest of this report is laid out as follows:
First we discuss the background literature relevant to understanding both the problem
and potential solutions to it. We then briefly survey previous systems and make
mention of how they contrast to the proposed system. Next, we explain the design
process of the proposed system and which avenues were explored with some analysis
as to their strengths and weaknesses. We then report on the implementation of our
system and conclude by discussing our results and recommending future work.
2
Background
In this chapter we cover background material relevant to this report. We first elaborate on the problem context. We then give an overview of our brief and how systems
which have employed virtual panning have typically been evaluated. We explore how
the goals and components of these systems inform their implementation of virtual
panning. We then use the underlying themes found in these systems to summarise
them and identify key papers. Finally, we embark on a discussion of key papers.
2.1
Problem context
In order to better understand our problem, we discuss here how video is stored in a
computer. We then elaborate on what computer vision is and why it is relevant to
3
our problem.
2.1.1
Storing video in a computer
Conceptually, computers store video media as a sequence of images called frames
(together with audio, which is not discussed here). The frame rate is the number
of frames which the computer displays in one second of showing the video. Frame
rates are typically measured in frames per second, or fps. Each frame consists of
many pixels. A pixel is a tiny light on a computer screen which projects one of many
colours (typically these are red, blue or green). Numerous pixels shining together are
what constitute an image.
The number of pixels each frame consists of is given by the resolution of the video. A
video with a resolution of 1920 × 1080 will be 1920 pixels wide and 1080 pixels high.
Some screens may have only have 800 × 600 pixels, in which case a rescaling of the
video may be required.
2.1.2
Use of computer vision
Computer vision refers to the process of using a computer to infer information about
an image. We used techniques from computer vision in the tracking component to
identify the location of the lecturer in the stitched video. Locating an object in an
image is usually done by detecting features. Features in this context refer to pixels in
an image which give information about where an object of interest lies. For instance,
pixels constituting the edge of an object are a feature. Features may be more complex, for example they may refer to the location of an eye or face.
Another relevant technique from computer vision is object tracking, which refers to
the tracking of an object (such as the lecturer) over a sequence of frames. While this
may use feature detection, it is a distinct task.
A technique used in image processing (related to computer vision) is called filtering.
A filter transforms an image in order to reduce or enhance the image in some way. If
one thinks of an image as a mathematical function which takes x and y coordinates
and gives an intensity value, a filter takes this function and returns a new function.
A simple example of a filter is a mean filter. Such a filter sends the intensity of each
pixel to the average intensity of nearby pixels. This gives the effect of smoothing the
image.
2.2
Overview of digital camerawork
The general problem of editing a video has been called digital camerawork by [Chou
et al., 2010], virtual videography by [Heck et al., 2007] and [Gleicher and Masanz,
2000], or virtual camera planning in the survey paper by [Christie et al., 2005]. These
terms mostly refer to the same concept however they define the quality of the output
4
differently.
One characteristic of such systems is that they are either real-time systems, or systems
in which the captured video is edited in some post-processing period. The former involve restrictions on the possible algorithms used since at any given frame the future
is unknown. Furthermore, it is important that real-time systems are computationally
efficient while post-processing systems can devote more time to computation.
As mentioned in the introduction, we found that systems employing digital camerawork tend to fit into four components. Our brief for this project was to create a
post-processing system which fits into these components as follows:
Camera setup and inputs: two or three high definition (HD) cameras recording
the lecture from the back of the theater which overlap at some region. These
streams must be stitched together to create a wider video stream.
Feature detection and object tracking: the lecturer should be tracked. The output of this component is a sequence of points which describe where the lecturer
is in each frame of the video.
Event detection and rules: possibly events should be detected such as whether
the lecturer is moving rapidly or not. Rules should be created which inform
when the camera should virtually pan.
Computer aided camerawork: a moving frame must virtually pan in such a way
that the lecturer is in view and the resulting video is pleasing to watch.
We will see at the end of the chapter that such a system has not been previously
implemented. With this, we first explore what constitutes satisfactory camerawork by
surveying evaluation techniques. We then explore how systems which employ digital
camerawork fit into the categories mentioned above. Next, we present a taxonomy of
the literature to better understand the context and relevance of the proposed system.
Finally we discuss key papers which emerge from the taxonomy.
2.2.1
Evaluating lecture recording systems
Systems incorporating digital camerawork measure efficacy in the following ways:
The authors of [Chou et al., 2010] and [Gleicher and Masanz, 2000] simply commented that initial results seemed promising.
Both [Wallick et al., 2004] and [Yokoi and Fujiyoshi, 2005] compiled a questionnaire
and commented on the results.
In [Wulff et al., 2013], the aim was to show that their system improved student
success. As such they used a Mann-Whitney test. Such a test takes in two
distributions and outputs a value which indicates how significant the difference
is between the distributions. In their study, the two groups were students who
have learned with the system and those that have not. The students wrote an
exam and the values of the input distributions were taken from their results on
this exam.
5
The authors of [Ariki et al., 2006] evaluated their football match recording system
using an Analytic Hierarchical Process (AHP). Generally, this method is used to
make decisions when the criteria are subjective. This is done by making pairwise
comparison judgments along a fixed set of subjective criteria. The items they
chose to compare were: their system, a previous system they had created, an HD
recording of one half of the soccer pitch and a traditional TV recording of the
match. Students rated these numerically along various subjective criteria such
as “zooming”, “panning”, “video quality”. In their paper, they found that the
TV recording was preferably but that the discrepancy with their system was not
large.
A common theme in all surveyed papers is that they mentioned that results are
qualitative in nature. Further, there is no standard technique for evaluating such
systems. Our approach to evaluating this component is similar to that of [Heck et al.,
2007] and is explained in the design chapter of this report.
2.2.2
Camera setup and inputs
This area refers to the hardware inputs of the system. This may include one or more
cameras and audio recording devices, a Microsoft Kinect device (for depth sensing) or
any other peripheral. Variation in this area is largely motivated by arguments related
to cost.
Some systems [Chou et al., 2010, Lampi et al., 2007, Wallick et al., 2004, Winkler
et al., 2012] use Pan Tilt Zoom (PTZ) cameras which cost around R60,000. Typically these systems are realtime systems since the PTZ camera moves while the
lecture progresses. Other systems [Ariki et al., 2006, Gleicher and Masanz, 2000, Nagai, 2009, Kumano et al., 2005, Yokoi and Fujiyoshi, 2005] use HD cameras which are
significantly cheaper. There have also been systems making use of devices such as
the Microsoft Kinect device [Winkler et al., 2012], which enables depth tracking. In
the case of [Wallick et al., 2004], different types of cameras are used - an HD camera aids tracking of features while a PTZ camera produces the lecture centered stream.
Previous systems use a different number of cameras too. The aim of the system
described in [Ariki et al., 2006] was to record a football match with the final development goal being to create an automated commentary of the match. A single HD
camera was used and a clipped frame produced a lower quality video. The system
in [Chou et al., 2010] records lectures with a PTZ camera at the back of the classroom.
2.2.3
Feature detection and object tracking
As previously explained, this is the component in which features are detected and
objects identified. Methods and objects of tracking depend on the camera setup and
inputs. An obvious feature that one might wish to detect is the lecturer. This information could be coupled with information about other features as was done in [Chou
et al., 2010] who detect the lecturers face.
6
The tracking component attempts to locate, at each frame, which pixels correspond
to the lecturer. In general there are many such pixels. The collective term for these
pixels is a region of interest (ROI). The tracking component condenses these points
into a single point for that frame. This location is the “centroid” of the ROI. If
v1 , v2 , ..., vn are pixel locations such that for each i, vi = (xi , yi ) corresponds to a
pixel in the ROI, then a way to associate a single point with the ROI is to take the
average location of the points which constitute the ROI. Formally the centroid is a
point c, calculated as
Σxi Σyi
c=(
,
)
n
n
A common method for extracting features in the foreground (such as a lecturer) is
background subtraction (or foreground detection). In this method, the frame at a
given time t1 is compared with a reference frame at some time t0 so that the pixels
constituting the foreground can be identified. The time t0 may be taken when the
system is calibrated or may be some time close to t1 .
Another method described in the literature is temporal differencing with bilateral
filtering [Yokoi and Fujiyoshi, 2005]. The method of temporal differencing which the
paper uses refers to differencing the pixel intensities in consecutive frames, and accepting pixels whose difference is larger in magnitude than some specified threshhold
value. Accepted pixels are clustered into regions which are supposed to correspond to
foreground objects. Bilateral filtering refers to a certain kind of transformation of an
image. The reason they performed bilateral filtering is that the temporal differencing
produced a sequence of ROI points which jittered. The bilateral filtering smoothed
this jittering. This is further discussed later as this is a key paper.
Another method used in this area is blob detection [Wulff et al., 2013]. This uses
mathematical methods to isolate regions of pixels where all the points in the region
(or blob) are similar.
The approach by [Chou et al., 2010] made use of Adaboost face detection. This is a
machine learning algorithm and is fairly detailed. As such it falls outside of the scope
of this discussion.
2.2.4
Event detection and rules
The purpose of this component is to extract some higher level information about
the objects in the video given the tracking information from the previous component.
Again the methods vary at this level since they depend, at their core, on what features
are tracked. Since the resulting video is evaluated on a loosely defined qualitative
level, it is not surprising that event detection and the corresponding rules are motivated by heuristics.
The lecture recording system in [Chou et al., 2010] makes use of a camera action
table. This is a hash table which associates the status of the lecturer and of their
7
face to some action which the PTZ should perform.
Another system, LectureSight, allows for a scriptable virtual director [Wulff and
Fecke, 2012a]. A virtual director is a module responsible for selecting which virtual cameraman to use at any given time. A virtual cameraman is simply a module
which takes as input a video stream and produces one or more shots of that video
stream.
Another system making use of scripting is [Wallick et al., 2004]. Curiously, both systems use PTZ cameras. A more general description was developed in [Christianson
et al., 1996], using a more formal understanding of cinematographic techniques, which
aims to automate event detection in 3D animation environments.
A popular data structure for simulating a virtual director is a Finite State Machine
(FSM) [Lampi et al., 2007, Wallick et al., 2004]. These work as follows:
First the author creates a set of states such as “medium shot”, each of which correspond to some action taken by the camera (e.g. zooming the frame to 640x480
pixels).
Periodically (usually on the order of seconds) the algorithm takes the current state
and the set of inputs (which can be events such as “lecturer moving”) and finds
the corresponding state to move to in the state table.
An FSM can be implemented as a 2D array. The rows of the array represent the
current state. The columns represent the input to the state machine. The contents
of a given cell is an output state. Generally an FSM yields an ”action” for each input/state combination, though in the case of virtual directors the action is to choose
the shot associated with the current state.
In particular, [Lampi et al., 2007] uses a more complex FSM than [Wallick et al.,
2004] which, the paper argues, is better because it enables a richer expression of the
desired cinematography.
In a system tackling a related problem, [Kumano et al., 2005] attempts to record a
soccer game by identifying the state of the game at a given time. For example, the
system detects a goal kick by noting that the ball is stationary and the players are
on average very far away from the ball. When a goal kick is about to be taken, the
players move to the center of the pitch waiting for the goal keeper to kick the ball.
When the system is in this state, it pans to the center of the pitch with the ball near
the edge of the frame.
2.2.5
Computer aided camerawork
This module defines how the camera should execute a transition given by the previous
module. In the case of PTZ cameras, moving the camera also impacts the cameras
coordinates and hence the tracking information in a calculable way. In this sense,
PTZ cameras are more complex to reason about than static cameras.
8
The approaches by [Yokoi and Fujiyoshi, 2005] and [Heck et al., 2007] frame the problem with constraints that are more similar to ours than other papers. As such they
are key papers. Both construct a “panning” function which gives the location of the
frame as it pans. The function from [Heck et al., 2007] is shown in Figure 2.
Figure 2: The panning function constructed in [Heck et al., 2007]. The function gives the position of
the clipped frames coordinates as a function of time. Here, time and position are shifted and scaled
so that they lie in the interval (0, 1). The image shows how, as the position moves up on the graph,
the clipped window moves to the right. The parameter a is the fraction of the time which is to be
spent accelerating. Similarly, a is the fraction of time spent decelerating.
The HD camera used in [Yokoi and Fujiyoshi, 2005] looks at heuristic rules for panning. It was found that professional panning was achieved by accelerating only 40%
of the time. Secondly it was found that the time at which the camera begins to
decelerate should have maximal velocity and thirdly the average velocity should be
dependent on the range of panning (quantified in the paper). These constraints were
used in the construction of a panning function using only basic interpolation techniques.
Virtual videography refers to an approach by [Gleicher and Masanz, 2000] and later
implemented by [Heck et al., 2007], in which stationary, unattended cameras and
microphones are used together with a novel editing system aimed at simulating a
professional videographer. The editing problem is posed mathematically as a constrained optimisation problem.
2.3
Taxonomy of previous systems
The systems which we have surveyed are all typically classifiable by their components.
We wanted to understand how these components informed the techniques used for
digital camerawork. In particular we were interested in systems which were similar
to the one proposed. This information was synthesized, and can be found in Table
1.
9
Papers which used PTZ cameras were placed at the bottom since they do not apply
to us. Of those remaining, the system proposed in [Gleicher and Masanz, 2000] was
not developed until later where it was presented in [Heck et al., 2007], so the earlier
paper is lower in the taxonomy. The paper by [Ariki et al., 2006] was a realtime
system which attempted to solve a different problem. Our key papers are thus [Yokoi
and Fujiyoshi, 2005] and [Heck et al., 2007].
The systems presented in these papers make use of a single HD camera in contrast
to our system which stitches multiple cameras. Another difference is that the source
code for the software artifacts developed in these systems is not available online. The
next section discusses these sections in more detail.
Paper
[Yokoi and Fujiyoshi,
2005]
[Heck et al., 2007]
[Ariki et al., 2006]
[Gleicher and Masanz,
2000]
[Chou et al., 2010]
[Nagai, 2009]
[Wallick et al., 2004]
[Lampi et al., 2007,
Lampi et al., 2006]
[Winkler et al., 2012]
[Wulff and Fecke,
2012a, Wulff and Fecke,
2012b, Wulff and Rolf,
2011, Wulff et al., 2013]
Camera
Setup
and Inputs
One HD camera
One HD camera
Tracking
Digital Camerawork
Realtime
Temporal Differencing
with Bilateral Filtering
“Region objects” (text
on board) and lecturer
tracked. Gesture based.
Pan function placed
at select points.
Pan function placed
at select points.
Minimise an objective function.
Virtual
panning/zooming using Finite
State
Machine
(FSM).
N/A
No
Face detection and
brighter blobs extraction
Differential frames
camera action table
Yes
Not mentioned
No
Frame differencing, audio tracking
Colour
segmentation
and centroids
Depth tracking
Scriptable FSM
Yes
N/A
Yes
Pans based on location of lecturer and
threshold
Virtual director
Yes
One HD camera
Player/ball
segmentation through background subtraction
One or two static
cameras
One PTZ camera
N/A
One
AVCHD
1920 × 1080
Arbitrary
Two PTZ cameras
One PTZ camera
and
Microsoft
Kinect device
One PTZ camera
“Decorators” detect if
student/lecturer.
Kmeans head detection
No
Yes
No
Yes
Table 1: Taxonomy of surveyed systems. This table shows in what ways the systems we have
surveyed relate to our system. More similar works are nearer to the top.
10
2.4
2.4.1
Discussion of key papers
Discussion of the work by Heck et al
The goal of [Heck et al., 2007] was more ambitious than ours, in that it encompasses zooming, shot transitions and special effects. Tracking is accomplished by
background subtraction. Objects on the board (such as writing) are identified and
their ROI points are stored. They have also used gesture recognition.
The paper frames the required output as a path in a “shot sequence graph”. Loosely
speaking, a graph is a set of vertices or nodes (represented by points) together with a
set of edges (represented by lines between nodes). A shot sequence graph is a particular kind of directed acyclic graph (DAG). This means that the edges have direction,
and there are no cycles (no node can get back to itself by following the directed edges).
The nodes of the graph are types of shots which could be selected. A path in this
graph is a sequence of nodes where each node is visited only once, and the nth node
in the sequence has a directed edge pointing to the (n + 1)st node.
The algorithm works by splitting the frames of the video into chunks of some specified
length l. At each chunk, one of several shots could occur (long shot, medium shot,
etc). Each of these possible shots is a node in the graph. Nodes at frame n have edges
pointing to each of the nodes at frame n + l. The authors provide a multi-objective
function. This is a function which takes as input a path through this graph and
gives a numerical measure as to how good the corresponding camerawork is. A graph
searching optimisation algorithm is then used to find an optimal path, corresponding
to an optimal shot sequence.
One of the reasons we did not adopt this approach was that it relies on specific tracking algorithms. The panning component of our system could not rely on specific inner
workings of other components (such as the tracking component).
Another issue is that there are many subcomponents to the system. We did not have
the time to explore how tightly these were coupled. This was especially true since
the paper was discovered half way through our project timeline.
2.4.2
Discussion of the work by Yokoi and Fujiyoshi
The algorithm in [Yokoi and Fujiyoshi, 2005] begins by first applying a bilateral filter
to the ROI location data. We are somewhat critical of this step. Traditional bilateral
filtering occurs on an image. It sends the pixel intensity at each pixel to some other
intensity. It is known that this is edge preserving, in the sense that a large variation
of pixel intensity between adjacent pixels will remain large after the filter has been
applied. The application of the filter in [Yokoi and Fujiyoshi, 2005] was not to an
image but rather to a region of interest point where the input was time. Therefore
the edge preserving nature of the bilateral filter on the ROI data is with respect to
time. This means that if the lecturer were to suddenly race to the other side of the
lecture theater, the ROI would accelerate rapidly, and this would create a sort of edge
11
in the ROI data.
One insight gained from this paper, and perhaps a criticism of the paper, was that
we should not be concerned with sharp edges. We should be more concerned about
describing of the lecturers general motion on the order of seconds rather than frames.
For instance, consider the result of the bilateral filter in Figure 3.
Figure 3: This depicts the results of applying a bilateral filtering on data from [Yokoi and Fujiyoshi,
2005]. This figure illustrates the negative effect bilateral filtering can have near local extrema which
is to distort points inwards. This is about 80 seconds of data.
We can see that, at around frame 100 where the filtered region of interest is at location 400, there is a moment where the lecturer hesitates before continuing motion
to the left (closer towards location 100). In fact, there is a slight spike upwards at
this location in the unfiltered ROI data (given in the paper). This is detected as an
edge by the bilateral filter and consequently there is a point of extrema there in the
filtered data. By extrema, we mean points at which the function is bigger than both
of its neighbours or smaller than both of its neighbours. At this edge, the filtered
ROI shows a nearly vertical drop to the final location at 200 pixels (about frame 120).
We will discuss this later.
The algorithm proposed by [Yokoi and Fujiyoshi, 2005] then finds these extrema. It
is unclear how it proceeds from here, saying only that “after calculating the distance
of the detected next points, the moving period of the ROI is determined by threshold.” [Yokoi and Fujiyoshi, 2005, p.4]
The diagram for these points is also somewhat cryptic since it does not show local
extrema, rather the so called panning points. It can be found in Figure 4. For
example there is an extreme point at about frame 1450 which is not indicated on the
diagram.
12
Presumably using these points to construct panning periods works by first taking
panning points which differ by 1 frame and discarding the earlier of the two. Then,
of the remaining points, the odd points are associated with the start of a pan, ending
at the next panning point. That said, it is still not clear from the paper how these
points are found.
Figure 4: The panning period given the isolated points in [Yokoi and Fujiyoshi, 2005]. The method
of obtaining these points is somewhat unclear, as is the method of using them to construct the
panning periods.
We further analysed bilateral filters in general and found an explanation for the vertical drop behaviour so clearly seen in Figure 4. As is explained in [Buades et al.,
2006], a bilateral filter belongs to a family of filters called neighbourhood filters. Such
filters produce an effect around edges which the authors refer to as a “staircase effect”.
This means that while bilateral filters preserve edges, they also introduce edges. When
bilateral filters are applied to an image, they act on an intensity function (intensity
of each pixel as a function of its x any y coordinate). In this case, the staircase effect
applies to the intensity of pixels around the edges. The paper by Yokoi and Fujiyoshi
have applied the filter to a different function. In this case the staircase effect applies
to the coordinate of the ROI near the time of the edge.
The reason this is a problem in Figure 4, around panning period 6. Because an
edge has been falsely introduced as the lecturer moves from 600 pixels to 200 pixels,
a pan has been created over a very short time period (the pan covers 200 pixels in
just a few milliseconds). The authors of [Buades et al., 2006] further explain that
this behaviour can be circumvented by changing the filter. While the modification is
13
not complicated, the details are beyond the scope of this discussion, but may be of
interest in future work.
Another issue with the approach is that it does not use the original ROI locations of
the lecturer when deciding when to pan. Instead, they have used the filtered ROI location which, as we have seen, tends be a less accurate representation of the lecturers
location.
Lastly, they do not seem to account for the size of the clipped frame. A larger clipped
frame should need to pan less frequently and this does not seem to be compensated
for in their algorithm.
14
3
Design
One strong geometrical approach to better understanding the problem comes from
simply plotting the x coordinate of the ROI as a function of frame number. This
allows one to get an understanding of what is being returned from the tracker as well
as what sorts of movements a lecturer typically makes. Such a plot, taken from initial
tracking data, can be found in Figure 5.
In this section we discuss the approaches taken in designing an algorithm to fit a
panning motion to ROI data. For each approach, we discuss its strengths and weaknesses, and describe what insights we were able to gain. We then explain our proposed
algorithm in detail.
Figure 5: A plot of the lecturers ROI as a function of the frame number. The x position is in pixels
from the left. This was an 800 × 600 recording, at 25 frames per second. This comes from some
initial tracking data which we gathered.
15
3.1
Modelling the panning motion
The idea adopted in [Yokoi and Fujiyoshi, 2005] is that when panning, the cameraman
typically accelerates for a percent of the total time (call this p), and then decelerates
1 − p percent of the time. Letting ∆T be the total time and ∆X the total distance,
and supposing we begin panning at time t0 with elapsed time t − t0 = ∆t. Then
tmid = t0 + p∆T , tf = t0 + ∆T and the equation for panning some distance x can be
written as:
(
∆X
2
if t0 ≤ t ≤ tmid
2 (∆t)
x(t) = x0 + p∆T
2∆X
∆X
2
p∆X + ∆T (∆t − p∆T ) − (1−p)∆T 2 (∆t − p∆T ) if tmid < t ≤ tf
Our derivation of this equation can be found in the appendix. Such an equation is
derived in [Yokoi and Fujiyoshi, 2005], though the equation written there is incorrect
and does not easily allow for varying values of p. Another panning motion is derived
in [Heck et al., 2007], in which acceleration occurs for some time a, at which point
constant velocity is maintained until time 1−a where deceleration occurs. The resulting function looks similar to ones previously mentioned and can be found in Figure 2.
3.2
Experimenting with filtering
Performing a smoothing filter (such as a bilateral filter) on the data is not necessarily
impractical. One advantage is that it can help to identify local extrema (points where
the lecturer has changed direction). When the data are raw there is a fair amount
of “jittering”, and filtering smooths this out, revealing where the true turning points
are. In order to investigate this idea further, a web-based interface was developed to
facilitate exploration of the data as well as its discrete derivatives. The derivatives
give us information about the motion of the ROI point. This tool allows one to
filter the data using a variable amount of neighbouring data points. It also allows
the user to filter the data multiple times. In addition it offers the ability to add
and remove points by clicking on the relevant location in the plane. The centered
difference formula was used to calculate the discrete derivatives.
3.3
Derivative fitting
After experimenting with the output of the web based tool, a curious idea occurred
to us. The approach comes from assuming constant acceleration, so that the third
derivative (representing “jerk”, the change in acceleration per unit time) should be 0.
In this case, the lower order derivatives can give us information as to how the lecturer
was moving. For instance, the method which was developed and considered worked
as follows:
First, using a threshold value, find the frames where the first derivative is approximately zero (representing the lecturer standing still). Classify these frames as
frames in which the lecturer is stationary.
16
Figure 6: A web-based interface which was coded for exploration of ROI data. The aim was to guide
design decisions for the panning algorithm. One can filter the data and view derivatives, which
correspond roughly to velocity, acceleration and jerk.
17
Next, using another threshold value, find in the remaining frames those where the
second derivative is about zero (representing the lecturer moving at some constant non-zero speed). Classify these as frames in which the lecturer is moving
at a constant velocity.
Finally, classify the remaining frames as those where the lecturer is accelerating or
decelerating.
For a visual depiction of this approach refer to Figure 7. The algorithm classifies
each frame as either a frame in which the lecturer is stationary, moving constantly, or
has begun/stopped moving. We can then find all instances of consecutive frames of
the form “begin/stopped moving, moving constantly, moving constantly, ..., moving
constantly, begin/stopped moving”. These frames were frames over which we panned.
One problem with this approach is that it couples the panning component with the
tracking component since the amount of filtering required depends on how jittery the
ROI points from the tracker are. The filtering further determines which threshold
values are ideal in a way which is not obvious.
This approach doesn’t account for the fact that the distance between pixels and the
true distance between what those pixels represent is not a simple relationship. When
the lecturer moves by one pixel when they are on the edge of the camera view, the
amount of space which they have moved is typically larger than when they move by
one pixel in the center of the view. The exact relationship depends on the stitching
technique, camera focal length, and even depth information. This method currently
does not account for these changes, making it less likely to correctly account for the
true speed of the lecturer.
Another problem is that negligible velocities over a long period can be detected as
stationary motion if the threshold is too tight. This causes points of acceleration
to not be detected which in turn gives the effect of the lecturer leaving the frame.
A mitigation strategy is to test if stationary motion points are indeed staying still.
Through investigating this avenue we were able to gain some key insights into the
problem:
Perhaps the most important insight is that the extreme points of the ROI are important. If the lecturer moves to the left and then back again, we want to know
where they were when they turned around since it informs when/whether to pan.
It also informs how far we need to pan, since the lecturer won’t move further
than that local extrema point where they turned around.
Another insight is that the data may be mapped into other forms (e.g. via bilateral
filtering) but it is important when panning to make use of the original data.
We do not want to pan to some location in the filtered data, but rather to the
location in the original data which relates to information which we discovered
with the use of the bilateral filtered data.
Further, through this process an important point was reiterated: there is no “perfect
18
Figure 7: This shows an approach to classifying the lecturers motion as accelerating, moving constantly or standing still. We hold threshold values on the derivatives around 0 (represented here as
cyan bands). We then classify the lecturer as motionless if the first derivative is within its threshold,
moving constantly if the second derivative is in its threshold, or accelerating otherwise. In all cases
the frame number is along the horizontal axis. The topmost graph has the lecturers x coordinate
as its vertical axis and the filtered ROI function is in cyan. A smoothing filter has been applied to
each function 10 times with length 15.
19
pan”. If we are to frame this as an optimisation problem, the resulting pan just
needs to be good enough.
3.4
Genetic algorithm
With these insights in mind, a genetic algorithm (GA) was considered. For an explanation of GA’s, the reader is referred to [Goldberg, 1989]. The algorithm was not
eventually implemented for reasons we discuss at the end of this section. The primary
reasons this seemed to be a promising approach were that:
Firstly, it is difficult to decide on a set of locations describing where to pan, but
given a set of pans we are able to measure how well they solve the problem
(using the metrics mentioned earlier).
Secondly, we don’t seek the “best” solution, we simply seek one which is good
enough.
With this in mind, we sought to represent the problem in a way that a GA could
solve. A sketch of the algorithm follows:
A chromosome is represented as an array of 0’s and 1’s. Each index in the array
corresponds to a frame in the video. A 0 represents “do not pan at this frame”
while a 1 represents “if we are panning, stop, otherwise start panning”. If we start
panning, we start from the coordinate of the lecturer at that frame. Similarly,
we stop panning at the corresponding coordinate.
Selection of 2 parents is done by randomly selecting 1 parent from the top 50% of
the population and 1 parent randomly from the entire population.
Crossover compares 2 parents, and arbitrarily selects one to be the dominant parent.
This parent is scanned through, and each pan is iterated on. One pan may be
represented in the form (t1 , t2 ) where ti indicates the index of the chromosome
array where there is a 1. Now, we find the value at index t1 in the non-dominant
parent. If we are in the middle of a pan at t1 then we find the index of the
beginning of the pan in the non-dominant parent. Call this t01 . Otherwise there
are 2 cases, either there exists a 1 in the interval (t1 , t2 ) or there does not. If there
does not, then presumably it is beneficial to be in the non-dominating solution
for the duration of this pan, so we remove the pan from the dominant parent.
Otherwise we find t01 to be the first time we start panning in the non-dominant
pan after t1 . We then remove the 1 from t1 in the dominant parent and add a
1 at t01 . This will either delay the pan or begin the pan earlier. The dominant
parent gets placed into the new population.
Mutation in this case would simply involve iterating through the chromosome and
for each 1 encountered, shifting it to a nearby location with some probability.
Furthermore, we can keep the top few fittest populations.
20
Another idea is to segment the video into 2 minute chunks and apply the algorithm on
each chunk separately. In considering this approach, objective functions were developed. These are simply functions which evaluate the efficacy of a solution. These are
given in the testing chapter of this report. Since we created many objective functions,
a multi-objective optimisation was sought.
The idea was to find so called “Pareto optimal” solutions a some weighted sum of
the fitness functions. To understand what a Pareto optimal solution is, we must first
define what it means for a solution to dominate another. A solution X dominates
another solution Y if:
X is no worse than Y for any of the objective functions
X is strictly better than Y for some objective function
Given this definition, a solution is Pareto optimal if no other solution dominates it.
When we have multiple objective functions, the Pareto optimal solutions often sought
since it is not usually true that one solution is better than all others for all objective
functions. This is a standard approach to handling multiple objective functions in
GA’s [Goldberg, 1989].
A problem with the above idea was that it was not clear whether the crossover step
would work. For this step to be effective, it would ideally output a chromosome that
is on average better than the parents.
Another problem was that we were not sure whether we had enough time to implement it.
We were also concerned that it was unclear how to choose the weights for the objective functions. An insight which was gained, however, was that these functions may
be useful in evaluating the quality of the desired pan.
We sought an algorithm that was likely to get satisfactory results, was deterministic
in nature and simple to understand and improve upon.
3.5
Dynamic programming approach
Another approach considered was dynamic programming. We can see that a solution
to this problem is an initial location together with a sequence of 3-tuples of the form
(ti , ti+1 , xi+1 ) which indicate when to start panning (ti ), when to stop panning (ti+1 ),
and the final location of the pan (xi+1 ).
This seemingly exhibits overlapping subproblems - if we know satisfactory panning
solutions for each half of the lecture, it seems possible to construct a panning solution
to the entire lecture. Similarly, if we know solutions for the first two quarters of the
lecture, then we can find one for half the lecture, and so on.
The problem here is that it is not actually very clear how exactly one should merge
two solutions. One must account for the fact that the final location of the clipped
21
frame in the earlier solution is not generally the same as the initial location of the
clipped frame in the later solution.
If the ROI sequence is increasing between the solutions, most likely you would like to
create a new pan from some point in the earlier solution to some point in the later
solution. One strategy could be to merge overlapping, or nearby pans. That is, if the
earlier solution ends by panning, and the later solution begins by panning, the two
pans could be merged into a single pan.
Dynamic programming may work and it is another avenue that could be explored in
future work, however due to time constraints we decided not to explore this further.
This idea further inspired our approach, which is outlined in the next section.
22
3.6
Analytic approach based on extrema and heuristic line approximation.
Each step of the proposed algorithm is elaborated upon in this section, with the
pseudocode for this approach given as:
f u n c t i o n pan ( x t h r e s h , y t h r e s h )
get roi points from tracker ( input url )
smooth data ( ) #e v e r y 5 f r a m e s g e t s a v e r a g e d
find cleaned data points ( y thresh )
locate local extrema ()
define intervals around extrema points ( x thresh , y thresh )
merge intervals ()
pan between intervals ()
First we smooth the ROI function, removing some of the jittering in the data.
After smoothing, we find a reduced subset of the data (called cleaned data). The
idea here is similar to that of [Yokoi and Fujiyoshi, 2005] although different points
are found and they are used differently.
We then locate extrema in these cleaned data. For each of the extrema, we want the
clipped frame to be at the point of extrema for as long as possible in the resulting
video. We thus approximate a horizontal line at each extrema which is not far from
the ROI data at any frame. We then pan between each horizontal line.
An important design tool used while developing this algorithm was the Python programming language, since it allowed for rapid prototyping and evaluation of various
ideas without committing to writing excessive C++ code. We now explain the algorithm in more detail:
First we smooth the data. We do this by using a variable ave width. For instance if
ave width = 5 then we set the location at each of the 5 frames to the average location
over those 5 frames. At 25 fps this accounts for time intervals of only 200 ms. This
step removes some of the jitter. Pseudocode for this step is:
f u n c t i o n smooth data ( )
f o r i i n [ 0 , 5 , 1 0 , . . . , num frames −5, num frames ]
sum roi location = 0
for j in [0 ,1 ,2 ,3 ,4]
s u m r o i l o c a t i o n += r o i l o c a t i o n [ i+j ]
endfor
a v e r o i l o c a t i o n = s u m r o i l o c a t i o n /5
for j in [0 ,1 ,2 ,3 ,4]
r o i l o c a t i o n [ i+j ] = a v e r o i l o c a t i o n
endfor
endfor
endfunction
23
Figure 8: Illustration of the cases for the “clean data” step of the developed algorithm. We use a
y thresh variable to add a select group of points to a new data set called the cleaned data. In the
first case we don’t want to remember intermediate extrema, while in the second case we went over
some extrema and do want to remember it.
The second thing we do is iterate through the points until the distance between the
maximal and minimal point on that range exceeds some threshold, y thresh. In the
analysis here, y refers to the coordinate of the lecture and x refers to the frame number. At this point there are two cases to consider (up to symmetry). These cases
are shown in Figure 8. One case is that we increased/decreased on that range (Case
1). Another case is that we increased/decreased, hit a local maximum/minimum,
then decreased/increased and hit a local minimum/maximum lower/higher than our
original point (Case 2). In the second case, we want to record the extrema in the
middle while in the first case we just record the new extrema. We call the recorded
points the “cleaned data”. Pseudocode for getting the cleaned data is:
function find cleaned data points ()
first x = roi location [0]
cleaned data = [ f i r s t x ]
i =0
w h i l e i < num frames
p r e v i o u s c l e a n e d d a t a x = c l e a n e d d a t a [ −1]
current roi x = roi location [ i ]
m a x r o i x = max( max roi x , c u r r e n t r o i x )
m i n r o i x = min ( m i n r o i x , c u r r e n t r o i x )
dx = c u r r e n t r o i x − p r e v i o u s c l e a n e d d a t a x
i f abs ( dx ) > x t h r e s h
# e i t h e r we j u s t i n c r e a s e d on t h i s range
24
# i n which c a s e o n l y one o f
# t he m a x r o i x o r m i n r o i x changed
o n l y d e c r e a s e d = ( p r e v i o u s c l e a n e d d a t a x == m a x r o i x )
o n l y i n c r e a s e d = ( p r e v i o u s c l e a n e d d a t a x == m i n r o i x )
i f only decreased or o n l y i n c r e a s e d
c l e a n e d d a t a . append ( c u r r e n t r o i x )
i+=1
continue
else i f previous cleaned data x > current roi x
# i n t h i s c a s e t h e r e i s a l o c a l maxima
# between p r e v i o u s c l e a n e d d a t a x and c u r r e n t r o i x
c l e a n e d d a t a . append ( m a x r o i x )
i = frame index of ( max roi x )
continue
else i f previous cleaned data x > current roi x
c l e a n e d d a t a . append ( m i n r o i x )
i = frame index of ( min roi x )
continue
e l s e # s t i l l haven ’ t r e a c h e d x t h r e s h
i+=1
return cleaned data
We then locate the extrema of these cleaned data. These are times around which we
would like the clipped frame to be stationary. For this reason the algorithm approximates the panning function around these extrema as being horizontal for as long as
such an approximation is accurate. For each extrema it finds the maximal interval of
frames around which the ROI function does not leave the y thresh range. We define
also an x thresh, which is a limit on the width of an interval.
The algorithm then iterates to the left until either the lecturer leaves the y thresh
or the x thresh is exceeded, this is the beginning of horizontal line. Similarly, the
algorithm iterates to the right to define the end of the line. This line is conceptually
an interval which includes the extrema point. We want to pan in between these
intervals. A visual depiction of the state of the algorithm at this point can be found
in Figure 9. The pseudocode for this process follows:
function locate local extrema ()
extrema = [ ]
f o r i i n [ 0 , 1 , . . . , l e n g t h ( c l e a n e d d a t a ) −1]
x = cleaned data [ i ]
j = frame number of ( x )
x r o i = f r a m e s [ j ] # c o r r e s p o n d i n g data p o i n t i n r o i data
#f i r s t t a k e c a r e o f b o u n d a r i e s
i f i == 0 o r i == l e n g t h ( c l e a n e d d a t a )−1
extrema . append ( x r o i )
continue
25
endif
x p r e v = c l e a n e d d a t a [ i −1]
x n e x t = c l e a n e d d a t a [ i +1]
i f x n e x t < x and x p r e v < x
extrema . append ( x r o i )
e l s e i f x n e x t > x and x p r e v > x
extrema . append ( x r o i )
endif
endfor
endfunction
function define intervals around extrema points ( x thresh , y thresh )
intervals = [ ]
f o r i i n [ 0 , 1 , . . . , num extrema −1]
extrema y = extrema [ i ]
x = f r a m e i n d e x o f ( y ) #u s i n g ROI data
# look l e f t
i n t e r v a l s t a r t = x − x thresh
f o r j i n [ 0 , 1 , . . . , x t h r e s h −1]
r o i y = r o i d a t a [ x−j ]
dy = abs ( r o i y − extrema y )
i f dy > y t h r e s h :
i n t e r v a l s t a r t = x−j
break
endif
endfor
# look right
interval end = x + x thresh
f o r j i n [ 0 , 1 , . . . , x t h r e s h −1]
r o i y = r o i d a t a [ x+j ]
dy = abs ( r o i y − extrema y )
i f dy > y t h r e s h :
i n t e r v a l e n d = x+j
break
endif
endfor
interval = [ interval start , interval end ]
i n t e r v a l s . append ( i n t e r v a l )
endfor
return i n t e r v a l s
endfunction
26
Figure 9: This shows what is happening in the algorithm when we find intervals. The red stars
represent the extreme points, the cyan rectangles show the interval around each extreme point. The
intervals are determined by the amount of frames over which we can stay at the extreme point
without having to pan. Overlapping intervals will be merged next.
After we find these intervals, we merge overlapping intervals. We then have a list of
intervals which have the property that we don’t need to pan on these intervals. For
the final step we take each pair of adjacent intervals and start panning from the end
of the first interval to the beginning of the second interval. The result can be seen in
Figure 10
Figure 10: Final step of the proposed algorithm. This shows how the intervals in the previous step
(which have been merged) are used to create a sequence of pans. The cleaned ROI data is in blue
and the constructed pan function is in red. The thresholds were chosen so that the image better
explained what was occurring in the algorithm and as such, there is a large pan between frames 500
and 750.
4
4.1
Testing of the proposed algorithm
General analysis of virtual panning
First we would like to put forward some functions which measure certain qualities
which are un/desirable in a good pan. We will measure our methods robustness using
these functions and display the results in the form of contour graphs. These functions
are fairly similar to what is defined in [Heck et al., 2007]. We will discuss each of these
27
in terms of what the undesirable quality is, how we can frame this mathematically,
and how we refer to this metric.
The first undesirable quality is panning too frequently. This can be measured by
simply counting how many pans we make per some unit time. We could bin this by
associating with each bin of frames the number of times a pan has begun in that
bin, or we can associate with each frame the number of times a pan has begun some
set number of frames before and after it. To keep things simple however, we simply
count the number of times we pan per minute (averaged over the entire video). We
call this “number of pans”. If f ps is the frames per second, f rames is the amount
of frames in the video and num pans is the amount of times we pan in the video then
this metric is given by:
60f ps
f rames
We also don’t want to spend too much of our time panning. This may give one the
feeling of motion sickness and is generally undesirable. We thus also count the total
time spent panning as a percent of the total time of the video and call this “panning
total time”. If pan f rames is the amount of frames in which we are panning and
f rames is the total amount of frames in the video, this metric is given by:
num pans
pan f rames
f rames
Another quality which we do not desire is to have the lecturer leave the frame for
a long period of time. It is not such a problem if they leave for a few milliseconds
though. To pose this as a function over the entire video, we can count the amount of
time the lecturer is out of the frame each time they leave the frame and square this
number. This biases long time periods. We can then sum all of the squares and this
will give us a distance penalty. We call this “leave frame too long penalty”. If
the lecturer leaves the clipped frame n times, and if the lecturer leaves the clipped
frame for the i0 th time for an amount of ti frames then the metric is expressed as:
n
X
t2i
i=1
The last undesirable quality we define here is panning over too short or too long a
time. A way to measure this is to create a piecewise function which assigns a penalty
to each pan if it pans for too short or too long a time. If the pan lasts longer than 10
seconds, or shorter than 50ms, the pan is very bad. Finally we divide the sum of the
penalty scores by the pan count to give the average penalty score, since this value is
less biased to how many pans there are. We call this “pan length penalty”. If the
28
i’th pan is over fi frames, the penalty function is:

0
if 1 sec < fi < 1.5 sec

 1

if 300 ms < fi < 2 sec


4
if 200 ms < fi < 3 sec
pi =
9
if 100 ms < fi < 5 sec



20
if 50 ms < fi < 10 sec


100 otherwise
and the pan length penalty for the video is then:
n
X
pi
i=1
4.2
n
Testing of the algorithm
This algorithm relies on the thresholding parameters. In Figure 10 for instance, the
pans at around frame 500 and 1200 both last a considerable amount of time. This is
due to the horizontal threshold, x thresh, being set too low.
Setting the y thresh lower leads to more pans but enforces a stricter requirement
keeping the lecturer within the clipped frame. Note that the algorithm does not rely
on highly accurate tracking data (though this prevents error), and can even support
the situation where a tracker loses the lecturer for some time.
We set out to understand how the proposed algorithm performs when the threshold
values are varied. To do this, we measure performance with respect to the above mentioned metrics. We can measure how these metrics change when we vary the threshold
parameters. A plot of the above metrics as a function of the x and y thresholds can
be found in Figures 11, 12, 13 and 14.
4.2.1
Linear programming for determining thresholds
We now show how, if one considers ideal score ranges, a set of feasible x and y thresholds can be determined.
Arguably, an average of 12 pans per minute or less is desirable. This works out to
panning once every 5 seconds. Going by Figure 11 we can estimate this curve as a
hyperbola and constrain that approximately x thresh × y thresh > 600.
We probably don’t want to pan more than 30% of the time. Judging by Figure 12
this means roughly that we want y thresh > 13 x thresh.
In terms of leaving the frame too long, a score of 625 corresponds to leaving the
frame for no more than 25 frames, which is 1 second. This is perfectly acceptable to
us, so given that all scores in Figure 13 are smaller than this, we make no prescription.
29
Figure 11: Represents the number of pans per minute over a 50 minute lecture with 75000 frames
as we vary the x and y thresholds. It seems that varying x threshold here makes only slightly less of
a difference than varying the y threshold. The inverse relationship between either of the thresholds
and the number of pans is expected since increasing either threshold creates larger intervals around
the extrema which consequently have a higher chance of being merged (fewer pans).
30
Figure 12: Represents the what percentage of the 50 minute lecture was spent panning as we vary
the x and y thresholds. Clearly here the x threshold makes a larger difference than the y threshold.
31
Figure 13: Represents the penalty for the lecturer leaving the frame over a 50 minute lecture with
75000 frames as we vary the x and y thresholds. Clearly here the x threshold makes a larger difference
than the y threshold.
32
Figure 14: Represents the penalty for the length of the pans over a video with 75000 frames as we
vary the x and y thresholds. An average length of 20 means that on average we were panning around
50ms or 10sec, which is undesirable. Arguably an average less than 15 is desirable.
33
Lastly, we make use of the data in Figure 14. The panning length penalty represented
a penalty for each pan if it was too quick or took too long. Recall that a pan gets a
score of 9 if the pan is between 100 ms and 5 sec, and it gets a 20 if it is between 50
ms and 10 sec. We decided that an average pan score of 15 or less was desired. We
decided to approximate the contour there as a linear interpolation between (100, 0)
and (200, 300). For this reason we want that y thresh > 3x thresh − 300.
The above conditions lead to a system of inequalities:
y thresh > 600(x thresh)−1
y thresh > 13 x thresh
y thresh > 3(x thresh − 100)
The solution to this system is a region of threshold values where we feel the panning
is satisfactory. We choose a point which is roughly in the center of this region. This
is the point at which x thresh = 100, y thresh = 150.
We would like to point out an idea for future work. The feasible threshold values can
be discovered on a per-video basis. One first starts with values which seem sensible
such as those above. Then, for threshold values in the neighbourhood, scores can
be calculated. These will allow one to determine for each metric how the x and y
thresholds will have to change in order for that metrics score to be lower (better).
If changing one of the thresholds improves all metrics then the threshold is changed
and the process repeated.
5
Implementation
In this section we discuss the implementation of the panning component and it’s
integration with the system. It should be noted that the design process and the implementation process were tightly coupled with several iterations leading to the final
algorithm. This approach was chosen so as to develop the best algorithm in the given
time frame.
5.1
Constraints
We decided that the system should be developed in C++ using the opencv library.
This library contains many methods useful in image processing, video processing and
computer vision. The reason C++ was selected was that it is fast and the Ubuntu
operating system is able to run compiled C++ code. Recall that CILT required that
the system be able to run on the Ubuntu operating system.
5.2
Implementation of the ClippedFrame helper class
The entire system has been developed in C++, and as such the digital camerawork
component has been implemented so that it integrates with this system. First, we
coded a class to abstract the behaviour of a clipped frame, called ClippedFrame. We
34
conceived of only one such object being created, however abstracting it this way allows for multiple clipped frames to be handled without making the code complex.
This also allows for flexibility.
The object maintains a location within the boundary of the larger stitched frame, as
well as its width and height. The location of the frame refers to the center most point
of the frame. Specifically, it is calculated as:
width
height
c, b
c)
2
2
It is able to clip a section of the larger frame and return a smaller clipped frame (in
practice, this is then written to file).
(x, y) = (b
The object can snap to any location. The location variables were made private, and a
method created to change these. This method gracefully handles snapping outside of
the boundary of the stitched frame by instead snapping to the nearest location which
is inside the boundary.
It is also able to move to a location over a specified amount of time (in seconds) with
a specified p value (percent of time to accelerate). This takes into account the fps
rate. While this does impose that our panning function is used, the code is simple
enough that this function could easily be rewritten.
5.3
Implementation of the proposed algorithm
The algorithm which determines when to pan given the ROI points from the tracker
was implemented in Python and called from the C++ program. There are several
reasons for implementing the algorithm in Python.
The algorithm deals with a list of points which we manipulate to give another list
of points. Python has many features with support manipulation of arrays with very
little code. For instance, the following Python code gives the centered difference:
[ b−a f o r a , b i n z i p ( a r r , a r r [ 1 : ] ) ]
The C++ code for this operation takes longer to write. Should we later decide to use
the centered difference, the Python code becomes:
[ 0 . 5 ∗ ( b−a ) f o r a , b i n z i p ( a r r , a r r [ 2 : ] ) ]
The C++ code will generally take longer to change than the Python code. As the
project progressed, it became apparent that flexibility was required. One reason for
this was that it seemed unlikely that the computational complexity of the algorithms
required for the stitching component would decrease. This means that possibly other
input components would be explored. This was another reason which guided our
decision not to rewrite the Python code in C++.
Through the design and implementation process, we have made many changes and
tried many ideas. The code to plot the output of the pan and penalty metrics was
35
coded using the matplotlib library. Pythons numpy library, which supports numerical
calculations, was used in much of the plotting code. Neither of these libraries are required to create the panning points given the ROI data from the tracking component.
The only dependency is Python 2.7, which is standard with Ubuntu. If one wanted to
rewrite this code in C++, matplotpp is a C++ port of this library and would likely
make the task less challenging.
Many of the methods written in C++ and Python were accompanied by unit tests.
5.4
Issues encountered during implementation
Some issues did arise during the implementation process. The most pertinent issue
requires some information about what input video we were receiving. The input video
was encoded using the H.246/MPEG-4 AVC codec. The role of a codec is to compress
the size of the video. In Ubuntu, a library for encoding and decoding video with the
above codec is called x264. A general purpose tool for encoding and decoding is ffmpeg, which calls x264. The opencv library uses ffmpeg to encode and decode media.
The manner in which it does this causes an error, since the code was not backward
compatible.
The process of diagnosing and trying to account for this bug took 2 weeks off the
project. We tried reinstalling x264, reinstalling ffmpeg, and recompiling older versions from source (as well as opencv and other relevant libraries). The bug still
persisted and we eventually had to resign to being unable to write video using the
H.264/MPEG-4 format.
We managed to find a way to circumvent this, though it is not ideal. One can encode
the output video using some other codec and use a tool like ffmpeg to convert this
back to H.264/MPEG-4 format. This does however inevitably imply that the quality
of the output video will have decreased somewhat.
5.5
Run time of the component
Lastly, as is important that our system processes videos quickly, we comment on the
runtime of our component. Recall that we created scores which we used to measure
quality of the output. To generate the plots of these we ran 75, 000 ROI locations
through our algorithm (with some overhead for plotting code) with varying thresholds
on a machine with a 1.50GHz processor. To plot the scores, the algorithm was run
900 times (once for each threshold pair). The scores were plotted several times, and
the running time for this process was consistently under 14 minutes. Averaging this,
we empirically deduce that our algorithm takes on the order of 1 second to complete
with a reasonable machine.
The reason our algorithm runs is so fast is due to the fact that the input to our
algorithm is a list of around 105 numbers, which is small compared to something like
the input to the stitching or tracking components. The input to these components is
36
on the order of 107 times larger than this.
Since it is clear that our algorithm is sufficiently fast, no formal analysis of complexity
is undertaken, and no optimisations have been made to the algorithm.
Regarding the creation of an output video given the panning points from the algorithm, each frame in the input video is iterated upon once. At each frame, an update
method is called on the ClippedFrame object. This is a constant time operation, as it
simply updates the coordinates of the clipped frame if it is in motion. For this reason
we infer that our component can be no worse (asymptotically) than the stitching or
tracking components.
5.6
Results of the component as integrated
The results of the panning component seem promising. As mentioned, we were able
to find algorithm parameters which were sufficient. Further, we were able to produce
videos with our virtual panning component, as can be seen in Figure 15.
Figure 15: Here we illustrate the output of the panning component. The image shows a series of
frames in our input video over which it pans. The graph shows the panning function in red and the
ROI data in blue. The cyan band indicates the width of the clipped frame.
37
6
Conclusion
Our aims were to explore methods of virtual panning, design an algorithm which
accomplishes this task, and integrate this algorithm into a software system.
We explored and analysed techniques used in previous systems. We found that the
techniques employed by [Heck et al., 2007] and [Yokoi and Fujiyoshi, 2005] were of
particular interest to us. Through analysing these systems, we found many ideas
which were of use to us. We also identified the limitations of directly applying these
techniques to our project.
We tried various approaches to designing an algorithm which creates the required
digital camerawork. Through this process, we developed a method to test the efficacy
of such an algorithm.
The algorithm was coded and implemented as a component of the system as required.
The quality of the output video was determined through the methods which we developed to test the algorithm. We found that the component produces virtual panning
which exhibits all of the properties which were deemed to be desirable.
The system will be released as an open source project. The software is free to use. It
is expected that more work will be required before CILT can make adequate use of
the system due to the considerable time taken when stitching the video.
7
Future work
As development progressed, a number of avenues for future work were discovered.
It was discovered that stitching the video takes a considerable amount of time. One
suggestion for future work is to use a virtual director to decide which of the input
videos to choose at a given frame. The use of virtual directors was explored in the
background chapter of this report.
We have mentioned that other approaches to creating a virtual panning algorithm
could be explored. These include the construction of a GA or framing the problem in
such a way that dynamic programming can solve it. We have seen, in the approach
by [Heck et al., 2007], that the problem can be phrased in terms of shot sequence
graphs.
Our algorithm could be improved by automating the process of parameter tuning.
We have mentioned one approach to doing this by showing how to frame this as
a linear programming problem. Many efficient algorithms exist which solve linear
programming problems.
38
Appendices
A
Derivation of panning motion:
Let ∆T be the total time to pan and ∆X be the total distance. The analysis here is
along a single axis but a projection to two axes is trivial (take the dot product). eg,
s(t) → hs(t) cos(θ0 ), s(t) sin(θ0 )i
Let also t0 , t1 and t2 be such that we begin accelerating at t0 , then begin decelerating at t1 until t2 when we stop panning. Then ∆T = t2 − t0 = ∆t1 + ∆t2 where
∆ti = ti − ti−1
Lastly let vi be the velocity so that v0 = v2 = 0 and v1 = a0 ∆t1 .
From basic kinematics, the equations of motion yield:
x1 − x0 = 12 a0 ∆t21
x2 − x1 = v1 ∆t2 + 12 a1 ∆t22
Here a0 , a1 are coefficients of acceleration and are still to be determined.
Now we add the assumption that ∆t1 = p∆T for some p ∈ (0, 1) so:
t1 = (1 − p)t0 + pt2
Note that the boundary condition that v1 = vmax at implies:
vmax = v1 = a0 ∆t1 = −a1 ∆t2
So that:
a0 = p−1
a1
p
∆t1 = p∆T
∆t2 = (1 − p)∆T
Which enables us to rewrite the system as:
x0 = x1 − 12 p−1
(p∆T )2 a1
p
x2 = x1 + ( p−1
(p∆T )((1 − p)∆T ) + 12 ((1 − p)∆T )2 )a1
p
Which becomes:
x0 = x1 + 12 p(1 − p)∆T 2 a1
x2 = x1 − 12 ((1 − p)∆T )2 a1
Subtracting the first from the second gives:
∆X = − 12 ∆T 2 a1 (1 − p)
So finally:
2∆X
a1 = − (1−p)∆T
2
2∆X
a0 = p∆T 2
39
And the equations of motion become:
(
∆X
2
2 (t − t0 )
x(t) = x0 + p∆T
p∆X + 2∆X
(t − t0 − p∆T ) −
∆T
∆X
(t
(1−p)∆T 2
2
− t0 − p∆T )
if t ≤ p∆T
otherwise
References
[Ariki et al., 2006] Ariki, Y., Kubota, S., and Kumano, M. (2006). Automatic production system of soccer sports video by digital camera work based on situation
recognition. In Multimedia, 2006. ISM’06. Eighth IEEE International Symposium
on, pages 851–860. IEEE.
[Buades et al., 2006] Buades, A., Coll, B., and Morel, J.-M. (2006). The staircasing
effect in neighborhood filters and its solution. Image Processing, IEEE Transactions
on, 15(6):1499–1505.
[Chou et al., 2010] Chou, H.-P., Wang, J.-M., Fuh, C.-S., Lin, S.-C., and Chen, S.-W.
(2010). Automated lecture recording system. In System Science and Engineering
(ICSSE), 2010 International Conference on, pages 167–172. IEEE.
[Christianson et al., 1996] Christianson, D. B., Anderson, S. E., wei He, L., Salesin,
D. H., Weld, D. S., and Cohen, M. F. (1996). Declarative camera control for
automatic cinematography. In AAAI/IAAI, Vol. 1, pages 148–155.
[Christie et al., 2005] Christie, M., Machap, R., Normand, J.-M., Olivier, P., and
Pickering, J. (2005). Virtual camera planning: A survey. In Smart Graphics, pages
40–52. Springer.
[Gleicher and Masanz, 2000] Gleicher, M. and Masanz, J. (2000). Towards virtual
videography (poster session). In Proceedings of the eighth ACM international conference on Multimedia, pages 375–378. ACM.
[Goldberg, 1989] Goldberg, D. E. (1989). Genetic algorithms in search, optimization
and machine learning ‘addison-wesley, 1989. Reading, MA.
[Heck et al., 2007] Heck, R., Wallick, M., and Gleicher, M. (2007). Virtual videography. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP), 3(1):4.
[Kumano et al., 2005] Kumano, M., Ariki, Y., and Tsukada, K. (2005). A method
of digital camera work focused on players and a ball, pages 466–473. Advances in
Multimedia Information Processing-PCM 2004. Springer.
[Lampi et al., 2007] Lampi, F., Kopf, S., Benz, M., and Effelsberg, W. (2007). An
automatic cameraman in a lecture recording system. In Proceedings of the international workshop on Educational multimedia and multimedia education, pages 11–18.
ACM.
40
[Lampi et al., 2006] Lampi, F., Scheele, N., and Effelsberg, W. (2006). Automatic
camera control for lecture recordings. In World Conference on Educational Multimedia, Hypermedia and Telecommunications, volume 2006, pages 854–860.
[Nagai, 2009] Nagai, T. (2009). Automated lecture recording system with avchd camcorder and microserver. In Proceedings of the 37th annual ACM SIGUCCS fall
conference, pages 47–54. ACM.
[Wallick et al., 2004] Wallick, M. N., Rui, Y., and He, L. (2004). A portable solution
for automatic lecture room camera management. In Multimedia and Expo, 2004.
ICME’04. 2004 IEEE International Conference on, volume 2, pages 987–990. IEEE.
[Winkler et al., 2012] Winkler, M. B., Hover, K. M., Hadjakos, A., and Muhlhauser,
M. (2012). Automatic camera control for tracking a presenter during a talk. In
Multimedia (ISM), 2012 IEEE International Symposium on, pages 471–476. IEEE.
[Wulff and Fecke, 2012a] Wulff, B. and Fecke, A. (2012a). Lecturesight-an open
source system for automatic camera control in lecture recordings. In Multimedia
(ISM), 2012 IEEE International Symposium on, pages 461–466. IEEE.
[Wulff and Fecke, 2012b] Wulff, B. and Fecke, A. (2012b). Lecturesight-an open
source system for automatic camera control in lecture recordings. In Multimedia
(ISM), 2012 IEEE International Symposium on, pages 461–466. IEEE.
[Wulff and Rolf, 2011] Wulff, B. and Rolf, R. (2011). Opentrack-automated camera control for lecture recordings. In Multimedia (ISM), 2011 IEEE International
Symposium on, pages 549–552. IEEE.
[Wulff et al., 2013] Wulff, B., Rupp, L., Fecke, A., and Hamborg, K.-C. (2013). The
lecturesight system in production scenarios and its impact on learning from video
recorded lectures. In Multimedia (ISM), 2013 IEEE International Symposium on,
pages 474–479. IEEE.
[Yokoi and Fujiyoshi, 2005] Yokoi, T. and Fujiyoshi, H. (2005). Virtual camerawork
for generating lecture video from high resolution images. In Multimedia and Expo,
2005. ICME 2005. IEEE International Conference on, page 4 pp. IEEE.
41