Honours Project Report Virtual Panning for Lecture Environments Jared Norman Category 1 Requirement Analysis and Design 2 Theoretical Analysis 3 Experiment Design and Execution 4 System Development and Implementation 5 Results, Findings and Conclusion 6 Aim Formulation and Background Work 7 Quality of Report Writing and Presentation 8 Adherence to Project Proposal and Quality of Deliverables 9 Overall General Project Evaluation Total marks Min 0 0 0 0 10 10 Max 20 25 20 15 20 15 10 10 0 10 80 Supervised by Patrick Marais Department of Computer Science University of Cape Town 2014 Chosen 10 10 0 5 20 15 10 10 0 80 Abstract The video recording of lectures has become increasingly popular, though much of the logistics involved are around the associated costs. Recurring costs for cameramen or large overhead costs for so called Pan Tilt Zoom (PTZ) cameras deter many institutions from developing their own educational materials. We report on an approach to minimising these costs which makes use of multiple high definition (HD) cameras. i Acknowledgements I would like to acknowledge A/Prof. Patrick Marais of the Department of Computer Science at the University of Cape Town for his valuable input and support. I would also like to thank Stephen Marquard of CILT for suggesting the project, his input and his help in acquiring testing data. I would like to further thank my team members, Chris and Terry, for their enthusiasm and hard work on the project. This research is made possible under grant of the National Research Foundation (hereafter NRF) of the Republic of South Africa. All views expressed in this report are those of the author and not necessarily those of the NRF. ii Contents 1 Introduction 1.1 Project brief and aims . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Licensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Background 2.1 Problem context . . . . . . . . . . . . . . . . . . . . . 2.1.1 Storing video in a computer . . . . . . . . . . 2.1.2 Use of computer vision . . . . . . . . . . . . . 2.2 Overview of digital camerawork . . . . . . . . . . . . 2.2.1 Evaluating lecture recording systems . . . . . 2.2.2 Camera setup and inputs . . . . . . . . . . . . 2.2.3 Feature detection and object tracking . . . . . 2.2.4 Event detection and rules . . . . . . . . . . . 2.2.5 Computer aided camerawork . . . . . . . . . . 2.3 Taxonomy of previous systems . . . . . . . . . . . . . 2.4 Discussion of key papers . . . . . . . . . . . . . . . . 2.4.1 Discussion of the work by Heck et al . . . . . 2.4.2 Discussion of the work by Yokoi and Fujiyoshi 1 1 3 . . . . . . . . . . . . . 3 3 4 4 4 5 6 6 7 8 9 11 11 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . line approximation. 15 16 16 16 20 21 23 4 Testing of the proposed algorithm 4.1 General analysis of virtual panning . . . . . . . . . . . . . . . . . . . 4.2 Testing of the algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Linear programming for determining thresholds . . . . . . . . 27 27 29 29 5 Implementation 5.1 Constraints . . . . . . . . . . . . . . . . . . . . . 5.2 Implementation of the ClippedFrame helper class 5.3 Implementation of the proposed algorithm . . . . 5.4 Issues encountered during implementation . . . . 5.5 Run time of the component . . . . . . . . . . . . 5.6 Results of the component as integrated . . . . . . 34 34 34 35 36 36 37 3 Design 3.1 Modelling the panning motion . . . . . . 3.2 Experimenting with filtering . . . . . . . 3.3 Derivative fitting . . . . . . . . . . . . . 3.4 Genetic algorithm . . . . . . . . . . . . . 3.5 Dynamic programming approach . . . . 3.6 Analytic approach based on extrema and . . . . . . . . . . . . . . . . . . . . . . . . . heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Conclusion 38 7 Future work 38 Appendices 39 iii A Derivation of panning motion: 39 List of Figures 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Illustration of system components . . . . . . . . . . . . . . . . . . . Panning function constructed by Heck et al. . . . . . . . . . . . . . Filtered data of Yokoi and Fujiyoshi . . . . . . . . . . . . . . . . . . Panning period constructed by Yokoi and Fujiyoshi . . . . . . . . . Our initial plot of the ROI over frames, from testing data. . . . . . Our web based tool developed to explore ROI data. . . . . . . . . . Illustration of approach which classifies data with derivative fitting Proposed algorithm: cases for clean data step . . . . . . . . . . . . Proposed algorithm: find intervals step . . . . . . . . . . . . . . . . Proposed algorithm: creation of pans step . . . . . . . . . . . . . . Testing of algorithm: number of pans . . . . . . . . . . . . . . . . . Testing of algorithm: panning total time . . . . . . . . . . . . . . . Testing of algorithm: leave frame too long penalty . . . . . . . . . . Testing of algorithm: panning length penalty . . . . . . . . . . . . . Sample output of the panning component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 9 12 13 15 17 19 24 27 27 30 31 32 33 37 Taxonomy of related works . . . . . . . . . . . . . . . . . . . . . . . . 10 List of Tables 1 iv 1 Introduction The video recording of lectures has become increasingly popular with the advent of open courseware. Traditionally this has involved the use of a cameraman (or many cameramen) to film each lecture [Nagai, 2009]. The field of computer vision provides insight into the prospects of automating this process, essentially removing the recurring costs of hiring someone to film each lecture [Wulff et al., 2013]. Several systems have been developed which, using computer vision, remove these costs. However, many of these systems come with their own costs and are still quite expensive or inaccessible to the public. In this project, we began development on open source software to cater to this task with minimal costs. The component which we report on runs on Linux and integrates with the system. Some work has been done in the area of generating lecture videos automatically. Through the process of surveying the literature, it was found that these systems have the following components in common: 1. Camera setup and inputs 2. Feature detection and object tracking 3. Event detection and rules 4. Computer aided camerawork The focus in this report is on 3 and 4, while other group members will focus on 1 and 2. 1.1 Project brief and aims To understand the focus of this report, we outline the three components into which our research was split and make mention of their relation to one another. A digram depicting this information can be found in Figure 1 The first component, camera setup and inputs, takes as input two or more camera feeds which have recorded a lecturer simultaneously. These feeds have some portion of the lecture overlapping, which are stitched into a single camera feed. The second component, feature detection and object tracking, takes this camera feed as input and tracks the lecturer over multiple frames. The output of this component is a list which associates with each frame an x and y coordinate, representing the center point of the lecturer. The third component, and the focus of this report, makes use of these to create a clipped frame which simulates panning (hence the term “virtual panning”). This clipped frame is written as a video file on disk. 1 Figure 1: This figure illustrates the components of the VIRPAN system. First the lecturer is recorded with multiple cameras. Then the videos are stitched together and a larger video produced. The tracking component then locates the lecturers location in each frame of the video. The panning component uses these locations to produce a smaller video which virtually pans to keep the lecturer in view. 2 This report focuses on the design and development of the third component in the developed system. This encompasses both event detection and rules, as well as computer aided camerawork. We had several aims for the project: We aimed to explore how virtual panning has been implemented by previous systems, and how the other components of these systems informed their implementation. We aimed to design an algorithm which create a sequence of locations for a window which virtually pans. The algorithm would in general have access to tracking data from the tracking component, the size the output video and the size of the input video (stitched video). We aimed to use this algorithm in the implementation of a lecture recording system. As mentioned CILT hope to use such a system to generate lecture content inexpensively. Each component of this system was to be implemented by a different team member and our task was to implement the panning component. 1.2 Licensing The software developed for this component makes use of a number of libraries. Of particular interest, the licenses for the opencv and Python Numpy libraries require that software making use of these libraries include a copy of their copyright notices. The licenses do not restrict the software which we have created from being used noncommercially by CILT, or anyone else. They do not restrict the software from being rewritten, though anyone wishing to do so should consult the licenses to better understand whether these licenses are compatible with the authors intended use. The rest of this report is laid out as follows: First we discuss the background literature relevant to understanding both the problem and potential solutions to it. We then briefly survey previous systems and make mention of how they contrast to the proposed system. Next, we explain the design process of the proposed system and which avenues were explored with some analysis as to their strengths and weaknesses. We then report on the implementation of our system and conclude by discussing our results and recommending future work. 2 Background In this chapter we cover background material relevant to this report. We first elaborate on the problem context. We then give an overview of our brief and how systems which have employed virtual panning have typically been evaluated. We explore how the goals and components of these systems inform their implementation of virtual panning. We then use the underlying themes found in these systems to summarise them and identify key papers. Finally, we embark on a discussion of key papers. 2.1 Problem context In order to better understand our problem, we discuss here how video is stored in a computer. We then elaborate on what computer vision is and why it is relevant to 3 our problem. 2.1.1 Storing video in a computer Conceptually, computers store video media as a sequence of images called frames (together with audio, which is not discussed here). The frame rate is the number of frames which the computer displays in one second of showing the video. Frame rates are typically measured in frames per second, or fps. Each frame consists of many pixels. A pixel is a tiny light on a computer screen which projects one of many colours (typically these are red, blue or green). Numerous pixels shining together are what constitute an image. The number of pixels each frame consists of is given by the resolution of the video. A video with a resolution of 1920 × 1080 will be 1920 pixels wide and 1080 pixels high. Some screens may have only have 800 × 600 pixels, in which case a rescaling of the video may be required. 2.1.2 Use of computer vision Computer vision refers to the process of using a computer to infer information about an image. We used techniques from computer vision in the tracking component to identify the location of the lecturer in the stitched video. Locating an object in an image is usually done by detecting features. Features in this context refer to pixels in an image which give information about where an object of interest lies. For instance, pixels constituting the edge of an object are a feature. Features may be more complex, for example they may refer to the location of an eye or face. Another relevant technique from computer vision is object tracking, which refers to the tracking of an object (such as the lecturer) over a sequence of frames. While this may use feature detection, it is a distinct task. A technique used in image processing (related to computer vision) is called filtering. A filter transforms an image in order to reduce or enhance the image in some way. If one thinks of an image as a mathematical function which takes x and y coordinates and gives an intensity value, a filter takes this function and returns a new function. A simple example of a filter is a mean filter. Such a filter sends the intensity of each pixel to the average intensity of nearby pixels. This gives the effect of smoothing the image. 2.2 Overview of digital camerawork The general problem of editing a video has been called digital camerawork by [Chou et al., 2010], virtual videography by [Heck et al., 2007] and [Gleicher and Masanz, 2000], or virtual camera planning in the survey paper by [Christie et al., 2005]. These terms mostly refer to the same concept however they define the quality of the output 4 differently. One characteristic of such systems is that they are either real-time systems, or systems in which the captured video is edited in some post-processing period. The former involve restrictions on the possible algorithms used since at any given frame the future is unknown. Furthermore, it is important that real-time systems are computationally efficient while post-processing systems can devote more time to computation. As mentioned in the introduction, we found that systems employing digital camerawork tend to fit into four components. Our brief for this project was to create a post-processing system which fits into these components as follows: Camera setup and inputs: two or three high definition (HD) cameras recording the lecture from the back of the theater which overlap at some region. These streams must be stitched together to create a wider video stream. Feature detection and object tracking: the lecturer should be tracked. The output of this component is a sequence of points which describe where the lecturer is in each frame of the video. Event detection and rules: possibly events should be detected such as whether the lecturer is moving rapidly or not. Rules should be created which inform when the camera should virtually pan. Computer aided camerawork: a moving frame must virtually pan in such a way that the lecturer is in view and the resulting video is pleasing to watch. We will see at the end of the chapter that such a system has not been previously implemented. With this, we first explore what constitutes satisfactory camerawork by surveying evaluation techniques. We then explore how systems which employ digital camerawork fit into the categories mentioned above. Next, we present a taxonomy of the literature to better understand the context and relevance of the proposed system. Finally we discuss key papers which emerge from the taxonomy. 2.2.1 Evaluating lecture recording systems Systems incorporating digital camerawork measure efficacy in the following ways: The authors of [Chou et al., 2010] and [Gleicher and Masanz, 2000] simply commented that initial results seemed promising. Both [Wallick et al., 2004] and [Yokoi and Fujiyoshi, 2005] compiled a questionnaire and commented on the results. In [Wulff et al., 2013], the aim was to show that their system improved student success. As such they used a Mann-Whitney test. Such a test takes in two distributions and outputs a value which indicates how significant the difference is between the distributions. In their study, the two groups were students who have learned with the system and those that have not. The students wrote an exam and the values of the input distributions were taken from their results on this exam. 5 The authors of [Ariki et al., 2006] evaluated their football match recording system using an Analytic Hierarchical Process (AHP). Generally, this method is used to make decisions when the criteria are subjective. This is done by making pairwise comparison judgments along a fixed set of subjective criteria. The items they chose to compare were: their system, a previous system they had created, an HD recording of one half of the soccer pitch and a traditional TV recording of the match. Students rated these numerically along various subjective criteria such as “zooming”, “panning”, “video quality”. In their paper, they found that the TV recording was preferably but that the discrepancy with their system was not large. A common theme in all surveyed papers is that they mentioned that results are qualitative in nature. Further, there is no standard technique for evaluating such systems. Our approach to evaluating this component is similar to that of [Heck et al., 2007] and is explained in the design chapter of this report. 2.2.2 Camera setup and inputs This area refers to the hardware inputs of the system. This may include one or more cameras and audio recording devices, a Microsoft Kinect device (for depth sensing) or any other peripheral. Variation in this area is largely motivated by arguments related to cost. Some systems [Chou et al., 2010, Lampi et al., 2007, Wallick et al., 2004, Winkler et al., 2012] use Pan Tilt Zoom (PTZ) cameras which cost around R60,000. Typically these systems are realtime systems since the PTZ camera moves while the lecture progresses. Other systems [Ariki et al., 2006, Gleicher and Masanz, 2000, Nagai, 2009, Kumano et al., 2005, Yokoi and Fujiyoshi, 2005] use HD cameras which are significantly cheaper. There have also been systems making use of devices such as the Microsoft Kinect device [Winkler et al., 2012], which enables depth tracking. In the case of [Wallick et al., 2004], different types of cameras are used - an HD camera aids tracking of features while a PTZ camera produces the lecture centered stream. Previous systems use a different number of cameras too. The aim of the system described in [Ariki et al., 2006] was to record a football match with the final development goal being to create an automated commentary of the match. A single HD camera was used and a clipped frame produced a lower quality video. The system in [Chou et al., 2010] records lectures with a PTZ camera at the back of the classroom. 2.2.3 Feature detection and object tracking As previously explained, this is the component in which features are detected and objects identified. Methods and objects of tracking depend on the camera setup and inputs. An obvious feature that one might wish to detect is the lecturer. This information could be coupled with information about other features as was done in [Chou et al., 2010] who detect the lecturers face. 6 The tracking component attempts to locate, at each frame, which pixels correspond to the lecturer. In general there are many such pixels. The collective term for these pixels is a region of interest (ROI). The tracking component condenses these points into a single point for that frame. This location is the “centroid” of the ROI. If v1 , v2 , ..., vn are pixel locations such that for each i, vi = (xi , yi ) corresponds to a pixel in the ROI, then a way to associate a single point with the ROI is to take the average location of the points which constitute the ROI. Formally the centroid is a point c, calculated as Σxi Σyi c=( , ) n n A common method for extracting features in the foreground (such as a lecturer) is background subtraction (or foreground detection). In this method, the frame at a given time t1 is compared with a reference frame at some time t0 so that the pixels constituting the foreground can be identified. The time t0 may be taken when the system is calibrated or may be some time close to t1 . Another method described in the literature is temporal differencing with bilateral filtering [Yokoi and Fujiyoshi, 2005]. The method of temporal differencing which the paper uses refers to differencing the pixel intensities in consecutive frames, and accepting pixels whose difference is larger in magnitude than some specified threshhold value. Accepted pixels are clustered into regions which are supposed to correspond to foreground objects. Bilateral filtering refers to a certain kind of transformation of an image. The reason they performed bilateral filtering is that the temporal differencing produced a sequence of ROI points which jittered. The bilateral filtering smoothed this jittering. This is further discussed later as this is a key paper. Another method used in this area is blob detection [Wulff et al., 2013]. This uses mathematical methods to isolate regions of pixels where all the points in the region (or blob) are similar. The approach by [Chou et al., 2010] made use of Adaboost face detection. This is a machine learning algorithm and is fairly detailed. As such it falls outside of the scope of this discussion. 2.2.4 Event detection and rules The purpose of this component is to extract some higher level information about the objects in the video given the tracking information from the previous component. Again the methods vary at this level since they depend, at their core, on what features are tracked. Since the resulting video is evaluated on a loosely defined qualitative level, it is not surprising that event detection and the corresponding rules are motivated by heuristics. The lecture recording system in [Chou et al., 2010] makes use of a camera action table. This is a hash table which associates the status of the lecturer and of their 7 face to some action which the PTZ should perform. Another system, LectureSight, allows for a scriptable virtual director [Wulff and Fecke, 2012a]. A virtual director is a module responsible for selecting which virtual cameraman to use at any given time. A virtual cameraman is simply a module which takes as input a video stream and produces one or more shots of that video stream. Another system making use of scripting is [Wallick et al., 2004]. Curiously, both systems use PTZ cameras. A more general description was developed in [Christianson et al., 1996], using a more formal understanding of cinematographic techniques, which aims to automate event detection in 3D animation environments. A popular data structure for simulating a virtual director is a Finite State Machine (FSM) [Lampi et al., 2007, Wallick et al., 2004]. These work as follows: First the author creates a set of states such as “medium shot”, each of which correspond to some action taken by the camera (e.g. zooming the frame to 640x480 pixels). Periodically (usually on the order of seconds) the algorithm takes the current state and the set of inputs (which can be events such as “lecturer moving”) and finds the corresponding state to move to in the state table. An FSM can be implemented as a 2D array. The rows of the array represent the current state. The columns represent the input to the state machine. The contents of a given cell is an output state. Generally an FSM yields an ”action” for each input/state combination, though in the case of virtual directors the action is to choose the shot associated with the current state. In particular, [Lampi et al., 2007] uses a more complex FSM than [Wallick et al., 2004] which, the paper argues, is better because it enables a richer expression of the desired cinematography. In a system tackling a related problem, [Kumano et al., 2005] attempts to record a soccer game by identifying the state of the game at a given time. For example, the system detects a goal kick by noting that the ball is stationary and the players are on average very far away from the ball. When a goal kick is about to be taken, the players move to the center of the pitch waiting for the goal keeper to kick the ball. When the system is in this state, it pans to the center of the pitch with the ball near the edge of the frame. 2.2.5 Computer aided camerawork This module defines how the camera should execute a transition given by the previous module. In the case of PTZ cameras, moving the camera also impacts the cameras coordinates and hence the tracking information in a calculable way. In this sense, PTZ cameras are more complex to reason about than static cameras. 8 The approaches by [Yokoi and Fujiyoshi, 2005] and [Heck et al., 2007] frame the problem with constraints that are more similar to ours than other papers. As such they are key papers. Both construct a “panning” function which gives the location of the frame as it pans. The function from [Heck et al., 2007] is shown in Figure 2. Figure 2: The panning function constructed in [Heck et al., 2007]. The function gives the position of the clipped frames coordinates as a function of time. Here, time and position are shifted and scaled so that they lie in the interval (0, 1). The image shows how, as the position moves up on the graph, the clipped window moves to the right. The parameter a is the fraction of the time which is to be spent accelerating. Similarly, a is the fraction of time spent decelerating. The HD camera used in [Yokoi and Fujiyoshi, 2005] looks at heuristic rules for panning. It was found that professional panning was achieved by accelerating only 40% of the time. Secondly it was found that the time at which the camera begins to decelerate should have maximal velocity and thirdly the average velocity should be dependent on the range of panning (quantified in the paper). These constraints were used in the construction of a panning function using only basic interpolation techniques. Virtual videography refers to an approach by [Gleicher and Masanz, 2000] and later implemented by [Heck et al., 2007], in which stationary, unattended cameras and microphones are used together with a novel editing system aimed at simulating a professional videographer. The editing problem is posed mathematically as a constrained optimisation problem. 2.3 Taxonomy of previous systems The systems which we have surveyed are all typically classifiable by their components. We wanted to understand how these components informed the techniques used for digital camerawork. In particular we were interested in systems which were similar to the one proposed. This information was synthesized, and can be found in Table 1. 9 Papers which used PTZ cameras were placed at the bottom since they do not apply to us. Of those remaining, the system proposed in [Gleicher and Masanz, 2000] was not developed until later where it was presented in [Heck et al., 2007], so the earlier paper is lower in the taxonomy. The paper by [Ariki et al., 2006] was a realtime system which attempted to solve a different problem. Our key papers are thus [Yokoi and Fujiyoshi, 2005] and [Heck et al., 2007]. The systems presented in these papers make use of a single HD camera in contrast to our system which stitches multiple cameras. Another difference is that the source code for the software artifacts developed in these systems is not available online. The next section discusses these sections in more detail. Paper [Yokoi and Fujiyoshi, 2005] [Heck et al., 2007] [Ariki et al., 2006] [Gleicher and Masanz, 2000] [Chou et al., 2010] [Nagai, 2009] [Wallick et al., 2004] [Lampi et al., 2007, Lampi et al., 2006] [Winkler et al., 2012] [Wulff and Fecke, 2012a, Wulff and Fecke, 2012b, Wulff and Rolf, 2011, Wulff et al., 2013] Camera Setup and Inputs One HD camera One HD camera Tracking Digital Camerawork Realtime Temporal Differencing with Bilateral Filtering “Region objects” (text on board) and lecturer tracked. Gesture based. Pan function placed at select points. Pan function placed at select points. Minimise an objective function. Virtual panning/zooming using Finite State Machine (FSM). N/A No Face detection and brighter blobs extraction Differential frames camera action table Yes Not mentioned No Frame differencing, audio tracking Colour segmentation and centroids Depth tracking Scriptable FSM Yes N/A Yes Pans based on location of lecturer and threshold Virtual director Yes One HD camera Player/ball segmentation through background subtraction One or two static cameras One PTZ camera N/A One AVCHD 1920 × 1080 Arbitrary Two PTZ cameras One PTZ camera and Microsoft Kinect device One PTZ camera “Decorators” detect if student/lecturer. Kmeans head detection No Yes No Yes Table 1: Taxonomy of surveyed systems. This table shows in what ways the systems we have surveyed relate to our system. More similar works are nearer to the top. 10 2.4 2.4.1 Discussion of key papers Discussion of the work by Heck et al The goal of [Heck et al., 2007] was more ambitious than ours, in that it encompasses zooming, shot transitions and special effects. Tracking is accomplished by background subtraction. Objects on the board (such as writing) are identified and their ROI points are stored. They have also used gesture recognition. The paper frames the required output as a path in a “shot sequence graph”. Loosely speaking, a graph is a set of vertices or nodes (represented by points) together with a set of edges (represented by lines between nodes). A shot sequence graph is a particular kind of directed acyclic graph (DAG). This means that the edges have direction, and there are no cycles (no node can get back to itself by following the directed edges). The nodes of the graph are types of shots which could be selected. A path in this graph is a sequence of nodes where each node is visited only once, and the nth node in the sequence has a directed edge pointing to the (n + 1)st node. The algorithm works by splitting the frames of the video into chunks of some specified length l. At each chunk, one of several shots could occur (long shot, medium shot, etc). Each of these possible shots is a node in the graph. Nodes at frame n have edges pointing to each of the nodes at frame n + l. The authors provide a multi-objective function. This is a function which takes as input a path through this graph and gives a numerical measure as to how good the corresponding camerawork is. A graph searching optimisation algorithm is then used to find an optimal path, corresponding to an optimal shot sequence. One of the reasons we did not adopt this approach was that it relies on specific tracking algorithms. The panning component of our system could not rely on specific inner workings of other components (such as the tracking component). Another issue is that there are many subcomponents to the system. We did not have the time to explore how tightly these were coupled. This was especially true since the paper was discovered half way through our project timeline. 2.4.2 Discussion of the work by Yokoi and Fujiyoshi The algorithm in [Yokoi and Fujiyoshi, 2005] begins by first applying a bilateral filter to the ROI location data. We are somewhat critical of this step. Traditional bilateral filtering occurs on an image. It sends the pixel intensity at each pixel to some other intensity. It is known that this is edge preserving, in the sense that a large variation of pixel intensity between adjacent pixels will remain large after the filter has been applied. The application of the filter in [Yokoi and Fujiyoshi, 2005] was not to an image but rather to a region of interest point where the input was time. Therefore the edge preserving nature of the bilateral filter on the ROI data is with respect to time. This means that if the lecturer were to suddenly race to the other side of the lecture theater, the ROI would accelerate rapidly, and this would create a sort of edge 11 in the ROI data. One insight gained from this paper, and perhaps a criticism of the paper, was that we should not be concerned with sharp edges. We should be more concerned about describing of the lecturers general motion on the order of seconds rather than frames. For instance, consider the result of the bilateral filter in Figure 3. Figure 3: This depicts the results of applying a bilateral filtering on data from [Yokoi and Fujiyoshi, 2005]. This figure illustrates the negative effect bilateral filtering can have near local extrema which is to distort points inwards. This is about 80 seconds of data. We can see that, at around frame 100 where the filtered region of interest is at location 400, there is a moment where the lecturer hesitates before continuing motion to the left (closer towards location 100). In fact, there is a slight spike upwards at this location in the unfiltered ROI data (given in the paper). This is detected as an edge by the bilateral filter and consequently there is a point of extrema there in the filtered data. By extrema, we mean points at which the function is bigger than both of its neighbours or smaller than both of its neighbours. At this edge, the filtered ROI shows a nearly vertical drop to the final location at 200 pixels (about frame 120). We will discuss this later. The algorithm proposed by [Yokoi and Fujiyoshi, 2005] then finds these extrema. It is unclear how it proceeds from here, saying only that “after calculating the distance of the detected next points, the moving period of the ROI is determined by threshold.” [Yokoi and Fujiyoshi, 2005, p.4] The diagram for these points is also somewhat cryptic since it does not show local extrema, rather the so called panning points. It can be found in Figure 4. For example there is an extreme point at about frame 1450 which is not indicated on the diagram. 12 Presumably using these points to construct panning periods works by first taking panning points which differ by 1 frame and discarding the earlier of the two. Then, of the remaining points, the odd points are associated with the start of a pan, ending at the next panning point. That said, it is still not clear from the paper how these points are found. Figure 4: The panning period given the isolated points in [Yokoi and Fujiyoshi, 2005]. The method of obtaining these points is somewhat unclear, as is the method of using them to construct the panning periods. We further analysed bilateral filters in general and found an explanation for the vertical drop behaviour so clearly seen in Figure 4. As is explained in [Buades et al., 2006], a bilateral filter belongs to a family of filters called neighbourhood filters. Such filters produce an effect around edges which the authors refer to as a “staircase effect”. This means that while bilateral filters preserve edges, they also introduce edges. When bilateral filters are applied to an image, they act on an intensity function (intensity of each pixel as a function of its x any y coordinate). In this case, the staircase effect applies to the intensity of pixels around the edges. The paper by Yokoi and Fujiyoshi have applied the filter to a different function. In this case the staircase effect applies to the coordinate of the ROI near the time of the edge. The reason this is a problem in Figure 4, around panning period 6. Because an edge has been falsely introduced as the lecturer moves from 600 pixels to 200 pixels, a pan has been created over a very short time period (the pan covers 200 pixels in just a few milliseconds). The authors of [Buades et al., 2006] further explain that this behaviour can be circumvented by changing the filter. While the modification is 13 not complicated, the details are beyond the scope of this discussion, but may be of interest in future work. Another issue with the approach is that it does not use the original ROI locations of the lecturer when deciding when to pan. Instead, they have used the filtered ROI location which, as we have seen, tends be a less accurate representation of the lecturers location. Lastly, they do not seem to account for the size of the clipped frame. A larger clipped frame should need to pan less frequently and this does not seem to be compensated for in their algorithm. 14 3 Design One strong geometrical approach to better understanding the problem comes from simply plotting the x coordinate of the ROI as a function of frame number. This allows one to get an understanding of what is being returned from the tracker as well as what sorts of movements a lecturer typically makes. Such a plot, taken from initial tracking data, can be found in Figure 5. In this section we discuss the approaches taken in designing an algorithm to fit a panning motion to ROI data. For each approach, we discuss its strengths and weaknesses, and describe what insights we were able to gain. We then explain our proposed algorithm in detail. Figure 5: A plot of the lecturers ROI as a function of the frame number. The x position is in pixels from the left. This was an 800 × 600 recording, at 25 frames per second. This comes from some initial tracking data which we gathered. 15 3.1 Modelling the panning motion The idea adopted in [Yokoi and Fujiyoshi, 2005] is that when panning, the cameraman typically accelerates for a percent of the total time (call this p), and then decelerates 1 − p percent of the time. Letting ∆T be the total time and ∆X the total distance, and supposing we begin panning at time t0 with elapsed time t − t0 = ∆t. Then tmid = t0 + p∆T , tf = t0 + ∆T and the equation for panning some distance x can be written as: ( ∆X 2 if t0 ≤ t ≤ tmid 2 (∆t) x(t) = x0 + p∆T 2∆X ∆X 2 p∆X + ∆T (∆t − p∆T ) − (1−p)∆T 2 (∆t − p∆T ) if tmid < t ≤ tf Our derivation of this equation can be found in the appendix. Such an equation is derived in [Yokoi and Fujiyoshi, 2005], though the equation written there is incorrect and does not easily allow for varying values of p. Another panning motion is derived in [Heck et al., 2007], in which acceleration occurs for some time a, at which point constant velocity is maintained until time 1−a where deceleration occurs. The resulting function looks similar to ones previously mentioned and can be found in Figure 2. 3.2 Experimenting with filtering Performing a smoothing filter (such as a bilateral filter) on the data is not necessarily impractical. One advantage is that it can help to identify local extrema (points where the lecturer has changed direction). When the data are raw there is a fair amount of “jittering”, and filtering smooths this out, revealing where the true turning points are. In order to investigate this idea further, a web-based interface was developed to facilitate exploration of the data as well as its discrete derivatives. The derivatives give us information about the motion of the ROI point. This tool allows one to filter the data using a variable amount of neighbouring data points. It also allows the user to filter the data multiple times. In addition it offers the ability to add and remove points by clicking on the relevant location in the plane. The centered difference formula was used to calculate the discrete derivatives. 3.3 Derivative fitting After experimenting with the output of the web based tool, a curious idea occurred to us. The approach comes from assuming constant acceleration, so that the third derivative (representing “jerk”, the change in acceleration per unit time) should be 0. In this case, the lower order derivatives can give us information as to how the lecturer was moving. For instance, the method which was developed and considered worked as follows: First, using a threshold value, find the frames where the first derivative is approximately zero (representing the lecturer standing still). Classify these frames as frames in which the lecturer is stationary. 16 Figure 6: A web-based interface which was coded for exploration of ROI data. The aim was to guide design decisions for the panning algorithm. One can filter the data and view derivatives, which correspond roughly to velocity, acceleration and jerk. 17 Next, using another threshold value, find in the remaining frames those where the second derivative is about zero (representing the lecturer moving at some constant non-zero speed). Classify these as frames in which the lecturer is moving at a constant velocity. Finally, classify the remaining frames as those where the lecturer is accelerating or decelerating. For a visual depiction of this approach refer to Figure 7. The algorithm classifies each frame as either a frame in which the lecturer is stationary, moving constantly, or has begun/stopped moving. We can then find all instances of consecutive frames of the form “begin/stopped moving, moving constantly, moving constantly, ..., moving constantly, begin/stopped moving”. These frames were frames over which we panned. One problem with this approach is that it couples the panning component with the tracking component since the amount of filtering required depends on how jittery the ROI points from the tracker are. The filtering further determines which threshold values are ideal in a way which is not obvious. This approach doesn’t account for the fact that the distance between pixels and the true distance between what those pixels represent is not a simple relationship. When the lecturer moves by one pixel when they are on the edge of the camera view, the amount of space which they have moved is typically larger than when they move by one pixel in the center of the view. The exact relationship depends on the stitching technique, camera focal length, and even depth information. This method currently does not account for these changes, making it less likely to correctly account for the true speed of the lecturer. Another problem is that negligible velocities over a long period can be detected as stationary motion if the threshold is too tight. This causes points of acceleration to not be detected which in turn gives the effect of the lecturer leaving the frame. A mitigation strategy is to test if stationary motion points are indeed staying still. Through investigating this avenue we were able to gain some key insights into the problem: Perhaps the most important insight is that the extreme points of the ROI are important. If the lecturer moves to the left and then back again, we want to know where they were when they turned around since it informs when/whether to pan. It also informs how far we need to pan, since the lecturer won’t move further than that local extrema point where they turned around. Another insight is that the data may be mapped into other forms (e.g. via bilateral filtering) but it is important when panning to make use of the original data. We do not want to pan to some location in the filtered data, but rather to the location in the original data which relates to information which we discovered with the use of the bilateral filtered data. Further, through this process an important point was reiterated: there is no “perfect 18 Figure 7: This shows an approach to classifying the lecturers motion as accelerating, moving constantly or standing still. We hold threshold values on the derivatives around 0 (represented here as cyan bands). We then classify the lecturer as motionless if the first derivative is within its threshold, moving constantly if the second derivative is in its threshold, or accelerating otherwise. In all cases the frame number is along the horizontal axis. The topmost graph has the lecturers x coordinate as its vertical axis and the filtered ROI function is in cyan. A smoothing filter has been applied to each function 10 times with length 15. 19 pan”. If we are to frame this as an optimisation problem, the resulting pan just needs to be good enough. 3.4 Genetic algorithm With these insights in mind, a genetic algorithm (GA) was considered. For an explanation of GA’s, the reader is referred to [Goldberg, 1989]. The algorithm was not eventually implemented for reasons we discuss at the end of this section. The primary reasons this seemed to be a promising approach were that: Firstly, it is difficult to decide on a set of locations describing where to pan, but given a set of pans we are able to measure how well they solve the problem (using the metrics mentioned earlier). Secondly, we don’t seek the “best” solution, we simply seek one which is good enough. With this in mind, we sought to represent the problem in a way that a GA could solve. A sketch of the algorithm follows: A chromosome is represented as an array of 0’s and 1’s. Each index in the array corresponds to a frame in the video. A 0 represents “do not pan at this frame” while a 1 represents “if we are panning, stop, otherwise start panning”. If we start panning, we start from the coordinate of the lecturer at that frame. Similarly, we stop panning at the corresponding coordinate. Selection of 2 parents is done by randomly selecting 1 parent from the top 50% of the population and 1 parent randomly from the entire population. Crossover compares 2 parents, and arbitrarily selects one to be the dominant parent. This parent is scanned through, and each pan is iterated on. One pan may be represented in the form (t1 , t2 ) where ti indicates the index of the chromosome array where there is a 1. Now, we find the value at index t1 in the non-dominant parent. If we are in the middle of a pan at t1 then we find the index of the beginning of the pan in the non-dominant parent. Call this t01 . Otherwise there are 2 cases, either there exists a 1 in the interval (t1 , t2 ) or there does not. If there does not, then presumably it is beneficial to be in the non-dominating solution for the duration of this pan, so we remove the pan from the dominant parent. Otherwise we find t01 to be the first time we start panning in the non-dominant pan after t1 . We then remove the 1 from t1 in the dominant parent and add a 1 at t01 . This will either delay the pan or begin the pan earlier. The dominant parent gets placed into the new population. Mutation in this case would simply involve iterating through the chromosome and for each 1 encountered, shifting it to a nearby location with some probability. Furthermore, we can keep the top few fittest populations. 20 Another idea is to segment the video into 2 minute chunks and apply the algorithm on each chunk separately. In considering this approach, objective functions were developed. These are simply functions which evaluate the efficacy of a solution. These are given in the testing chapter of this report. Since we created many objective functions, a multi-objective optimisation was sought. The idea was to find so called “Pareto optimal” solutions a some weighted sum of the fitness functions. To understand what a Pareto optimal solution is, we must first define what it means for a solution to dominate another. A solution X dominates another solution Y if: X is no worse than Y for any of the objective functions X is strictly better than Y for some objective function Given this definition, a solution is Pareto optimal if no other solution dominates it. When we have multiple objective functions, the Pareto optimal solutions often sought since it is not usually true that one solution is better than all others for all objective functions. This is a standard approach to handling multiple objective functions in GA’s [Goldberg, 1989]. A problem with the above idea was that it was not clear whether the crossover step would work. For this step to be effective, it would ideally output a chromosome that is on average better than the parents. Another problem was that we were not sure whether we had enough time to implement it. We were also concerned that it was unclear how to choose the weights for the objective functions. An insight which was gained, however, was that these functions may be useful in evaluating the quality of the desired pan. We sought an algorithm that was likely to get satisfactory results, was deterministic in nature and simple to understand and improve upon. 3.5 Dynamic programming approach Another approach considered was dynamic programming. We can see that a solution to this problem is an initial location together with a sequence of 3-tuples of the form (ti , ti+1 , xi+1 ) which indicate when to start panning (ti ), when to stop panning (ti+1 ), and the final location of the pan (xi+1 ). This seemingly exhibits overlapping subproblems - if we know satisfactory panning solutions for each half of the lecture, it seems possible to construct a panning solution to the entire lecture. Similarly, if we know solutions for the first two quarters of the lecture, then we can find one for half the lecture, and so on. The problem here is that it is not actually very clear how exactly one should merge two solutions. One must account for the fact that the final location of the clipped 21 frame in the earlier solution is not generally the same as the initial location of the clipped frame in the later solution. If the ROI sequence is increasing between the solutions, most likely you would like to create a new pan from some point in the earlier solution to some point in the later solution. One strategy could be to merge overlapping, or nearby pans. That is, if the earlier solution ends by panning, and the later solution begins by panning, the two pans could be merged into a single pan. Dynamic programming may work and it is another avenue that could be explored in future work, however due to time constraints we decided not to explore this further. This idea further inspired our approach, which is outlined in the next section. 22 3.6 Analytic approach based on extrema and heuristic line approximation. Each step of the proposed algorithm is elaborated upon in this section, with the pseudocode for this approach given as: f u n c t i o n pan ( x t h r e s h , y t h r e s h ) get roi points from tracker ( input url ) smooth data ( ) #e v e r y 5 f r a m e s g e t s a v e r a g e d find cleaned data points ( y thresh ) locate local extrema () define intervals around extrema points ( x thresh , y thresh ) merge intervals () pan between intervals () First we smooth the ROI function, removing some of the jittering in the data. After smoothing, we find a reduced subset of the data (called cleaned data). The idea here is similar to that of [Yokoi and Fujiyoshi, 2005] although different points are found and they are used differently. We then locate extrema in these cleaned data. For each of the extrema, we want the clipped frame to be at the point of extrema for as long as possible in the resulting video. We thus approximate a horizontal line at each extrema which is not far from the ROI data at any frame. We then pan between each horizontal line. An important design tool used while developing this algorithm was the Python programming language, since it allowed for rapid prototyping and evaluation of various ideas without committing to writing excessive C++ code. We now explain the algorithm in more detail: First we smooth the data. We do this by using a variable ave width. For instance if ave width = 5 then we set the location at each of the 5 frames to the average location over those 5 frames. At 25 fps this accounts for time intervals of only 200 ms. This step removes some of the jitter. Pseudocode for this step is: f u n c t i o n smooth data ( ) f o r i i n [ 0 , 5 , 1 0 , . . . , num frames −5, num frames ] sum roi location = 0 for j in [0 ,1 ,2 ,3 ,4] s u m r o i l o c a t i o n += r o i l o c a t i o n [ i+j ] endfor a v e r o i l o c a t i o n = s u m r o i l o c a t i o n /5 for j in [0 ,1 ,2 ,3 ,4] r o i l o c a t i o n [ i+j ] = a v e r o i l o c a t i o n endfor endfor endfunction 23 Figure 8: Illustration of the cases for the “clean data” step of the developed algorithm. We use a y thresh variable to add a select group of points to a new data set called the cleaned data. In the first case we don’t want to remember intermediate extrema, while in the second case we went over some extrema and do want to remember it. The second thing we do is iterate through the points until the distance between the maximal and minimal point on that range exceeds some threshold, y thresh. In the analysis here, y refers to the coordinate of the lecture and x refers to the frame number. At this point there are two cases to consider (up to symmetry). These cases are shown in Figure 8. One case is that we increased/decreased on that range (Case 1). Another case is that we increased/decreased, hit a local maximum/minimum, then decreased/increased and hit a local minimum/maximum lower/higher than our original point (Case 2). In the second case, we want to record the extrema in the middle while in the first case we just record the new extrema. We call the recorded points the “cleaned data”. Pseudocode for getting the cleaned data is: function find cleaned data points () first x = roi location [0] cleaned data = [ f i r s t x ] i =0 w h i l e i < num frames p r e v i o u s c l e a n e d d a t a x = c l e a n e d d a t a [ −1] current roi x = roi location [ i ] m a x r o i x = max( max roi x , c u r r e n t r o i x ) m i n r o i x = min ( m i n r o i x , c u r r e n t r o i x ) dx = c u r r e n t r o i x − p r e v i o u s c l e a n e d d a t a x i f abs ( dx ) > x t h r e s h # e i t h e r we j u s t i n c r e a s e d on t h i s range 24 # i n which c a s e o n l y one o f # t he m a x r o i x o r m i n r o i x changed o n l y d e c r e a s e d = ( p r e v i o u s c l e a n e d d a t a x == m a x r o i x ) o n l y i n c r e a s e d = ( p r e v i o u s c l e a n e d d a t a x == m i n r o i x ) i f only decreased or o n l y i n c r e a s e d c l e a n e d d a t a . append ( c u r r e n t r o i x ) i+=1 continue else i f previous cleaned data x > current roi x # i n t h i s c a s e t h e r e i s a l o c a l maxima # between p r e v i o u s c l e a n e d d a t a x and c u r r e n t r o i x c l e a n e d d a t a . append ( m a x r o i x ) i = frame index of ( max roi x ) continue else i f previous cleaned data x > current roi x c l e a n e d d a t a . append ( m i n r o i x ) i = frame index of ( min roi x ) continue e l s e # s t i l l haven ’ t r e a c h e d x t h r e s h i+=1 return cleaned data We then locate the extrema of these cleaned data. These are times around which we would like the clipped frame to be stationary. For this reason the algorithm approximates the panning function around these extrema as being horizontal for as long as such an approximation is accurate. For each extrema it finds the maximal interval of frames around which the ROI function does not leave the y thresh range. We define also an x thresh, which is a limit on the width of an interval. The algorithm then iterates to the left until either the lecturer leaves the y thresh or the x thresh is exceeded, this is the beginning of horizontal line. Similarly, the algorithm iterates to the right to define the end of the line. This line is conceptually an interval which includes the extrema point. We want to pan in between these intervals. A visual depiction of the state of the algorithm at this point can be found in Figure 9. The pseudocode for this process follows: function locate local extrema () extrema = [ ] f o r i i n [ 0 , 1 , . . . , l e n g t h ( c l e a n e d d a t a ) −1] x = cleaned data [ i ] j = frame number of ( x ) x r o i = f r a m e s [ j ] # c o r r e s p o n d i n g data p o i n t i n r o i data #f i r s t t a k e c a r e o f b o u n d a r i e s i f i == 0 o r i == l e n g t h ( c l e a n e d d a t a )−1 extrema . append ( x r o i ) continue 25 endif x p r e v = c l e a n e d d a t a [ i −1] x n e x t = c l e a n e d d a t a [ i +1] i f x n e x t < x and x p r e v < x extrema . append ( x r o i ) e l s e i f x n e x t > x and x p r e v > x extrema . append ( x r o i ) endif endfor endfunction function define intervals around extrema points ( x thresh , y thresh ) intervals = [ ] f o r i i n [ 0 , 1 , . . . , num extrema −1] extrema y = extrema [ i ] x = f r a m e i n d e x o f ( y ) #u s i n g ROI data # look l e f t i n t e r v a l s t a r t = x − x thresh f o r j i n [ 0 , 1 , . . . , x t h r e s h −1] r o i y = r o i d a t a [ x−j ] dy = abs ( r o i y − extrema y ) i f dy > y t h r e s h : i n t e r v a l s t a r t = x−j break endif endfor # look right interval end = x + x thresh f o r j i n [ 0 , 1 , . . . , x t h r e s h −1] r o i y = r o i d a t a [ x+j ] dy = abs ( r o i y − extrema y ) i f dy > y t h r e s h : i n t e r v a l e n d = x+j break endif endfor interval = [ interval start , interval end ] i n t e r v a l s . append ( i n t e r v a l ) endfor return i n t e r v a l s endfunction 26 Figure 9: This shows what is happening in the algorithm when we find intervals. The red stars represent the extreme points, the cyan rectangles show the interval around each extreme point. The intervals are determined by the amount of frames over which we can stay at the extreme point without having to pan. Overlapping intervals will be merged next. After we find these intervals, we merge overlapping intervals. We then have a list of intervals which have the property that we don’t need to pan on these intervals. For the final step we take each pair of adjacent intervals and start panning from the end of the first interval to the beginning of the second interval. The result can be seen in Figure 10 Figure 10: Final step of the proposed algorithm. This shows how the intervals in the previous step (which have been merged) are used to create a sequence of pans. The cleaned ROI data is in blue and the constructed pan function is in red. The thresholds were chosen so that the image better explained what was occurring in the algorithm and as such, there is a large pan between frames 500 and 750. 4 4.1 Testing of the proposed algorithm General analysis of virtual panning First we would like to put forward some functions which measure certain qualities which are un/desirable in a good pan. We will measure our methods robustness using these functions and display the results in the form of contour graphs. These functions are fairly similar to what is defined in [Heck et al., 2007]. We will discuss each of these 27 in terms of what the undesirable quality is, how we can frame this mathematically, and how we refer to this metric. The first undesirable quality is panning too frequently. This can be measured by simply counting how many pans we make per some unit time. We could bin this by associating with each bin of frames the number of times a pan has begun in that bin, or we can associate with each frame the number of times a pan has begun some set number of frames before and after it. To keep things simple however, we simply count the number of times we pan per minute (averaged over the entire video). We call this “number of pans”. If f ps is the frames per second, f rames is the amount of frames in the video and num pans is the amount of times we pan in the video then this metric is given by: 60f ps f rames We also don’t want to spend too much of our time panning. This may give one the feeling of motion sickness and is generally undesirable. We thus also count the total time spent panning as a percent of the total time of the video and call this “panning total time”. If pan f rames is the amount of frames in which we are panning and f rames is the total amount of frames in the video, this metric is given by: num pans pan f rames f rames Another quality which we do not desire is to have the lecturer leave the frame for a long period of time. It is not such a problem if they leave for a few milliseconds though. To pose this as a function over the entire video, we can count the amount of time the lecturer is out of the frame each time they leave the frame and square this number. This biases long time periods. We can then sum all of the squares and this will give us a distance penalty. We call this “leave frame too long penalty”. If the lecturer leaves the clipped frame n times, and if the lecturer leaves the clipped frame for the i0 th time for an amount of ti frames then the metric is expressed as: n X t2i i=1 The last undesirable quality we define here is panning over too short or too long a time. A way to measure this is to create a piecewise function which assigns a penalty to each pan if it pans for too short or too long a time. If the pan lasts longer than 10 seconds, or shorter than 50ms, the pan is very bad. Finally we divide the sum of the penalty scores by the pan count to give the average penalty score, since this value is less biased to how many pans there are. We call this “pan length penalty”. If the 28 i’th pan is over fi frames, the penalty function is: 0 if 1 sec < fi < 1.5 sec 1 if 300 ms < fi < 2 sec 4 if 200 ms < fi < 3 sec pi = 9 if 100 ms < fi < 5 sec 20 if 50 ms < fi < 10 sec 100 otherwise and the pan length penalty for the video is then: n X pi i=1 4.2 n Testing of the algorithm This algorithm relies on the thresholding parameters. In Figure 10 for instance, the pans at around frame 500 and 1200 both last a considerable amount of time. This is due to the horizontal threshold, x thresh, being set too low. Setting the y thresh lower leads to more pans but enforces a stricter requirement keeping the lecturer within the clipped frame. Note that the algorithm does not rely on highly accurate tracking data (though this prevents error), and can even support the situation where a tracker loses the lecturer for some time. We set out to understand how the proposed algorithm performs when the threshold values are varied. To do this, we measure performance with respect to the above mentioned metrics. We can measure how these metrics change when we vary the threshold parameters. A plot of the above metrics as a function of the x and y thresholds can be found in Figures 11, 12, 13 and 14. 4.2.1 Linear programming for determining thresholds We now show how, if one considers ideal score ranges, a set of feasible x and y thresholds can be determined. Arguably, an average of 12 pans per minute or less is desirable. This works out to panning once every 5 seconds. Going by Figure 11 we can estimate this curve as a hyperbola and constrain that approximately x thresh × y thresh > 600. We probably don’t want to pan more than 30% of the time. Judging by Figure 12 this means roughly that we want y thresh > 13 x thresh. In terms of leaving the frame too long, a score of 625 corresponds to leaving the frame for no more than 25 frames, which is 1 second. This is perfectly acceptable to us, so given that all scores in Figure 13 are smaller than this, we make no prescription. 29 Figure 11: Represents the number of pans per minute over a 50 minute lecture with 75000 frames as we vary the x and y thresholds. It seems that varying x threshold here makes only slightly less of a difference than varying the y threshold. The inverse relationship between either of the thresholds and the number of pans is expected since increasing either threshold creates larger intervals around the extrema which consequently have a higher chance of being merged (fewer pans). 30 Figure 12: Represents the what percentage of the 50 minute lecture was spent panning as we vary the x and y thresholds. Clearly here the x threshold makes a larger difference than the y threshold. 31 Figure 13: Represents the penalty for the lecturer leaving the frame over a 50 minute lecture with 75000 frames as we vary the x and y thresholds. Clearly here the x threshold makes a larger difference than the y threshold. 32 Figure 14: Represents the penalty for the length of the pans over a video with 75000 frames as we vary the x and y thresholds. An average length of 20 means that on average we were panning around 50ms or 10sec, which is undesirable. Arguably an average less than 15 is desirable. 33 Lastly, we make use of the data in Figure 14. The panning length penalty represented a penalty for each pan if it was too quick or took too long. Recall that a pan gets a score of 9 if the pan is between 100 ms and 5 sec, and it gets a 20 if it is between 50 ms and 10 sec. We decided that an average pan score of 15 or less was desired. We decided to approximate the contour there as a linear interpolation between (100, 0) and (200, 300). For this reason we want that y thresh > 3x thresh − 300. The above conditions lead to a system of inequalities: y thresh > 600(x thresh)−1 y thresh > 13 x thresh y thresh > 3(x thresh − 100) The solution to this system is a region of threshold values where we feel the panning is satisfactory. We choose a point which is roughly in the center of this region. This is the point at which x thresh = 100, y thresh = 150. We would like to point out an idea for future work. The feasible threshold values can be discovered on a per-video basis. One first starts with values which seem sensible such as those above. Then, for threshold values in the neighbourhood, scores can be calculated. These will allow one to determine for each metric how the x and y thresholds will have to change in order for that metrics score to be lower (better). If changing one of the thresholds improves all metrics then the threshold is changed and the process repeated. 5 Implementation In this section we discuss the implementation of the panning component and it’s integration with the system. It should be noted that the design process and the implementation process were tightly coupled with several iterations leading to the final algorithm. This approach was chosen so as to develop the best algorithm in the given time frame. 5.1 Constraints We decided that the system should be developed in C++ using the opencv library. This library contains many methods useful in image processing, video processing and computer vision. The reason C++ was selected was that it is fast and the Ubuntu operating system is able to run compiled C++ code. Recall that CILT required that the system be able to run on the Ubuntu operating system. 5.2 Implementation of the ClippedFrame helper class The entire system has been developed in C++, and as such the digital camerawork component has been implemented so that it integrates with this system. First, we coded a class to abstract the behaviour of a clipped frame, called ClippedFrame. We 34 conceived of only one such object being created, however abstracting it this way allows for multiple clipped frames to be handled without making the code complex. This also allows for flexibility. The object maintains a location within the boundary of the larger stitched frame, as well as its width and height. The location of the frame refers to the center most point of the frame. Specifically, it is calculated as: width height c, b c) 2 2 It is able to clip a section of the larger frame and return a smaller clipped frame (in practice, this is then written to file). (x, y) = (b The object can snap to any location. The location variables were made private, and a method created to change these. This method gracefully handles snapping outside of the boundary of the stitched frame by instead snapping to the nearest location which is inside the boundary. It is also able to move to a location over a specified amount of time (in seconds) with a specified p value (percent of time to accelerate). This takes into account the fps rate. While this does impose that our panning function is used, the code is simple enough that this function could easily be rewritten. 5.3 Implementation of the proposed algorithm The algorithm which determines when to pan given the ROI points from the tracker was implemented in Python and called from the C++ program. There are several reasons for implementing the algorithm in Python. The algorithm deals with a list of points which we manipulate to give another list of points. Python has many features with support manipulation of arrays with very little code. For instance, the following Python code gives the centered difference: [ b−a f o r a , b i n z i p ( a r r , a r r [ 1 : ] ) ] The C++ code for this operation takes longer to write. Should we later decide to use the centered difference, the Python code becomes: [ 0 . 5 ∗ ( b−a ) f o r a , b i n z i p ( a r r , a r r [ 2 : ] ) ] The C++ code will generally take longer to change than the Python code. As the project progressed, it became apparent that flexibility was required. One reason for this was that it seemed unlikely that the computational complexity of the algorithms required for the stitching component would decrease. This means that possibly other input components would be explored. This was another reason which guided our decision not to rewrite the Python code in C++. Through the design and implementation process, we have made many changes and tried many ideas. The code to plot the output of the pan and penalty metrics was 35 coded using the matplotlib library. Pythons numpy library, which supports numerical calculations, was used in much of the plotting code. Neither of these libraries are required to create the panning points given the ROI data from the tracking component. The only dependency is Python 2.7, which is standard with Ubuntu. If one wanted to rewrite this code in C++, matplotpp is a C++ port of this library and would likely make the task less challenging. Many of the methods written in C++ and Python were accompanied by unit tests. 5.4 Issues encountered during implementation Some issues did arise during the implementation process. The most pertinent issue requires some information about what input video we were receiving. The input video was encoded using the H.246/MPEG-4 AVC codec. The role of a codec is to compress the size of the video. In Ubuntu, a library for encoding and decoding video with the above codec is called x264. A general purpose tool for encoding and decoding is ffmpeg, which calls x264. The opencv library uses ffmpeg to encode and decode media. The manner in which it does this causes an error, since the code was not backward compatible. The process of diagnosing and trying to account for this bug took 2 weeks off the project. We tried reinstalling x264, reinstalling ffmpeg, and recompiling older versions from source (as well as opencv and other relevant libraries). The bug still persisted and we eventually had to resign to being unable to write video using the H.264/MPEG-4 format. We managed to find a way to circumvent this, though it is not ideal. One can encode the output video using some other codec and use a tool like ffmpeg to convert this back to H.264/MPEG-4 format. This does however inevitably imply that the quality of the output video will have decreased somewhat. 5.5 Run time of the component Lastly, as is important that our system processes videos quickly, we comment on the runtime of our component. Recall that we created scores which we used to measure quality of the output. To generate the plots of these we ran 75, 000 ROI locations through our algorithm (with some overhead for plotting code) with varying thresholds on a machine with a 1.50GHz processor. To plot the scores, the algorithm was run 900 times (once for each threshold pair). The scores were plotted several times, and the running time for this process was consistently under 14 minutes. Averaging this, we empirically deduce that our algorithm takes on the order of 1 second to complete with a reasonable machine. The reason our algorithm runs is so fast is due to the fact that the input to our algorithm is a list of around 105 numbers, which is small compared to something like the input to the stitching or tracking components. The input to these components is 36 on the order of 107 times larger than this. Since it is clear that our algorithm is sufficiently fast, no formal analysis of complexity is undertaken, and no optimisations have been made to the algorithm. Regarding the creation of an output video given the panning points from the algorithm, each frame in the input video is iterated upon once. At each frame, an update method is called on the ClippedFrame object. This is a constant time operation, as it simply updates the coordinates of the clipped frame if it is in motion. For this reason we infer that our component can be no worse (asymptotically) than the stitching or tracking components. 5.6 Results of the component as integrated The results of the panning component seem promising. As mentioned, we were able to find algorithm parameters which were sufficient. Further, we were able to produce videos with our virtual panning component, as can be seen in Figure 15. Figure 15: Here we illustrate the output of the panning component. The image shows a series of frames in our input video over which it pans. The graph shows the panning function in red and the ROI data in blue. The cyan band indicates the width of the clipped frame. 37 6 Conclusion Our aims were to explore methods of virtual panning, design an algorithm which accomplishes this task, and integrate this algorithm into a software system. We explored and analysed techniques used in previous systems. We found that the techniques employed by [Heck et al., 2007] and [Yokoi and Fujiyoshi, 2005] were of particular interest to us. Through analysing these systems, we found many ideas which were of use to us. We also identified the limitations of directly applying these techniques to our project. We tried various approaches to designing an algorithm which creates the required digital camerawork. Through this process, we developed a method to test the efficacy of such an algorithm. The algorithm was coded and implemented as a component of the system as required. The quality of the output video was determined through the methods which we developed to test the algorithm. We found that the component produces virtual panning which exhibits all of the properties which were deemed to be desirable. The system will be released as an open source project. The software is free to use. It is expected that more work will be required before CILT can make adequate use of the system due to the considerable time taken when stitching the video. 7 Future work As development progressed, a number of avenues for future work were discovered. It was discovered that stitching the video takes a considerable amount of time. One suggestion for future work is to use a virtual director to decide which of the input videos to choose at a given frame. The use of virtual directors was explored in the background chapter of this report. We have mentioned that other approaches to creating a virtual panning algorithm could be explored. These include the construction of a GA or framing the problem in such a way that dynamic programming can solve it. We have seen, in the approach by [Heck et al., 2007], that the problem can be phrased in terms of shot sequence graphs. Our algorithm could be improved by automating the process of parameter tuning. We have mentioned one approach to doing this by showing how to frame this as a linear programming problem. Many efficient algorithms exist which solve linear programming problems. 38 Appendices A Derivation of panning motion: Let ∆T be the total time to pan and ∆X be the total distance. The analysis here is along a single axis but a projection to two axes is trivial (take the dot product). eg, s(t) → hs(t) cos(θ0 ), s(t) sin(θ0 )i Let also t0 , t1 and t2 be such that we begin accelerating at t0 , then begin decelerating at t1 until t2 when we stop panning. Then ∆T = t2 − t0 = ∆t1 + ∆t2 where ∆ti = ti − ti−1 Lastly let vi be the velocity so that v0 = v2 = 0 and v1 = a0 ∆t1 . From basic kinematics, the equations of motion yield: x1 − x0 = 12 a0 ∆t21 x2 − x1 = v1 ∆t2 + 12 a1 ∆t22 Here a0 , a1 are coefficients of acceleration and are still to be determined. Now we add the assumption that ∆t1 = p∆T for some p ∈ (0, 1) so: t1 = (1 − p)t0 + pt2 Note that the boundary condition that v1 = vmax at implies: vmax = v1 = a0 ∆t1 = −a1 ∆t2 So that: a0 = p−1 a1 p ∆t1 = p∆T ∆t2 = (1 − p)∆T Which enables us to rewrite the system as: x0 = x1 − 12 p−1 (p∆T )2 a1 p x2 = x1 + ( p−1 (p∆T )((1 − p)∆T ) + 12 ((1 − p)∆T )2 )a1 p Which becomes: x0 = x1 + 12 p(1 − p)∆T 2 a1 x2 = x1 − 12 ((1 − p)∆T )2 a1 Subtracting the first from the second gives: ∆X = − 12 ∆T 2 a1 (1 − p) So finally: 2∆X a1 = − (1−p)∆T 2 2∆X a0 = p∆T 2 39 And the equations of motion become: ( ∆X 2 2 (t − t0 ) x(t) = x0 + p∆T p∆X + 2∆X (t − t0 − p∆T ) − ∆T ∆X (t (1−p)∆T 2 2 − t0 − p∆T ) if t ≤ p∆T otherwise References [Ariki et al., 2006] Ariki, Y., Kubota, S., and Kumano, M. (2006). Automatic production system of soccer sports video by digital camera work based on situation recognition. In Multimedia, 2006. ISM’06. Eighth IEEE International Symposium on, pages 851–860. IEEE. [Buades et al., 2006] Buades, A., Coll, B., and Morel, J.-M. (2006). The staircasing effect in neighborhood filters and its solution. Image Processing, IEEE Transactions on, 15(6):1499–1505. [Chou et al., 2010] Chou, H.-P., Wang, J.-M., Fuh, C.-S., Lin, S.-C., and Chen, S.-W. (2010). Automated lecture recording system. In System Science and Engineering (ICSSE), 2010 International Conference on, pages 167–172. IEEE. [Christianson et al., 1996] Christianson, D. B., Anderson, S. E., wei He, L., Salesin, D. H., Weld, D. S., and Cohen, M. F. (1996). Declarative camera control for automatic cinematography. In AAAI/IAAI, Vol. 1, pages 148–155. [Christie et al., 2005] Christie, M., Machap, R., Normand, J.-M., Olivier, P., and Pickering, J. (2005). Virtual camera planning: A survey. In Smart Graphics, pages 40–52. Springer. [Gleicher and Masanz, 2000] Gleicher, M. and Masanz, J. (2000). Towards virtual videography (poster session). In Proceedings of the eighth ACM international conference on Multimedia, pages 375–378. ACM. [Goldberg, 1989] Goldberg, D. E. (1989). Genetic algorithms in search, optimization and machine learning ‘addison-wesley, 1989. Reading, MA. [Heck et al., 2007] Heck, R., Wallick, M., and Gleicher, M. (2007). Virtual videography. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP), 3(1):4. [Kumano et al., 2005] Kumano, M., Ariki, Y., and Tsukada, K. (2005). A method of digital camera work focused on players and a ball, pages 466–473. Advances in Multimedia Information Processing-PCM 2004. Springer. [Lampi et al., 2007] Lampi, F., Kopf, S., Benz, M., and Effelsberg, W. (2007). An automatic cameraman in a lecture recording system. In Proceedings of the international workshop on Educational multimedia and multimedia education, pages 11–18. ACM. 40 [Lampi et al., 2006] Lampi, F., Scheele, N., and Effelsberg, W. (2006). Automatic camera control for lecture recordings. In World Conference on Educational Multimedia, Hypermedia and Telecommunications, volume 2006, pages 854–860. [Nagai, 2009] Nagai, T. (2009). Automated lecture recording system with avchd camcorder and microserver. In Proceedings of the 37th annual ACM SIGUCCS fall conference, pages 47–54. ACM. [Wallick et al., 2004] Wallick, M. N., Rui, Y., and He, L. (2004). A portable solution for automatic lecture room camera management. In Multimedia and Expo, 2004. ICME’04. 2004 IEEE International Conference on, volume 2, pages 987–990. IEEE. [Winkler et al., 2012] Winkler, M. B., Hover, K. M., Hadjakos, A., and Muhlhauser, M. (2012). Automatic camera control for tracking a presenter during a talk. In Multimedia (ISM), 2012 IEEE International Symposium on, pages 471–476. IEEE. [Wulff and Fecke, 2012a] Wulff, B. and Fecke, A. (2012a). Lecturesight-an open source system for automatic camera control in lecture recordings. In Multimedia (ISM), 2012 IEEE International Symposium on, pages 461–466. IEEE. [Wulff and Fecke, 2012b] Wulff, B. and Fecke, A. (2012b). Lecturesight-an open source system for automatic camera control in lecture recordings. In Multimedia (ISM), 2012 IEEE International Symposium on, pages 461–466. IEEE. [Wulff and Rolf, 2011] Wulff, B. and Rolf, R. (2011). Opentrack-automated camera control for lecture recordings. In Multimedia (ISM), 2011 IEEE International Symposium on, pages 549–552. IEEE. [Wulff et al., 2013] Wulff, B., Rupp, L., Fecke, A., and Hamborg, K.-C. (2013). The lecturesight system in production scenarios and its impact on learning from video recorded lectures. In Multimedia (ISM), 2013 IEEE International Symposium on, pages 474–479. IEEE. [Yokoi and Fujiyoshi, 2005] Yokoi, T. and Fujiyoshi, H. (2005). Virtual camerawork for generating lecture video from high resolution images. In Multimedia and Expo, 2005. ICME 2005. IEEE International Conference on, page 4 pp. IEEE. 41
© Copyright 2026 Paperzz