Recovering Evidence from Video by fusing Video Evidence

Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video Meta­Data REVEAL An EPSRC Proposal The CASE for SUPPORT
Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video Meta­Data (REVEAL) i CONTENTS PART A: TRACK RECORD 1 Digital Imaging Research Centre, Kingston University Dr Graeme A. Jones, Digital Imaging Research Centre Professor Tim Ellis, Digital Imaging Research Centre 1 1 1 1 2 Centre for Knowledge Management, University of Surrey 1 Professor Khurshid Ahmad, Centre for Knowledge Management Bogdan Vrusias, Centre for Knowledge Management 1 2 3 Sira Ltd Dr John Gilby, Innovation 2 2 4 Police Information Technology Organisation (PITO) 2 5 Police Scientific Development Branch (PSDB) 2 PART B: DESCRIPTION OF THE PROPOSED RESEARCH 5 Background 3 3 Strategic Importance for the UK Previous Research 3 3 6 Programme and Methodology 4 Aims and Objectives Methodology 4 5
7 Relevance to Beneficiaries 8 8 Dissemination and Exploitation 8 9 Justification of Resources 8 10 References 8 APPENDIX 1: DIAGRAMMATIC WORKPLAN APPENDIX 2: LETTERS IN SUPPORT OF REVEAL PROPOSAL Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video Meta­Data (REVEAL) 9 10 ii PART A: TRACK RECORD 1 Digital Imaging Research Centre, Kingston University The Digital Imaging Research Centre (DIRC) at Kingston University has developed one of the largest visual sur­ veillance groups nationally. Research activity and industrial consultancy has focused on the development of robust, plug and play surveillance components, integrated wide­area multi­camera systems, behavioural analysis for public transport applications. DIRC has been involved in many EPSRC and EU visual surveillance projects including As­ sessment of Image Processing Techniques as a Means of Improving Personal Security in Public Transport GR/M29436/02; Traffic Simulation and Optimisation on an Intelligent Video Surveillance Network GR/N17706/01; Annotated Digital Video for Surveillance and Optimised Retrieval (ADVISOR) IST­1999­ 11287; and Pro­Active Integrated Systems for Security Management by Technological Institutional and Commu­ nication Assistance (PRISMATICA) GRD1­2000­10601. More recently Kingston (along with Sira Ltd and the Universities of Kent and Reading) are leading the EPSRC ViTAB Network (Video­based Threat Assessment and Biometrics GR/S64301/01) which aims to encourage those video­interpretation technologies most likely to have the greatest impact on future police identification, authentication and threat assessment capabilities. Dr Graeme A. Jones, Digital Imaging Research Centre Currently the Director of DIRC, Dr Graeme A. Jones has over 15 years experience in image and video analysis and has (co­) authored over 100 technical papers related to computer vision. Dr. Jones chaired the British Machine Vi­ sion Association workshop on Visual Surveillance and was co­chair of the IAPR Workshop on Advanced Video­ based Surveillance Systems in 2001. He is currently co­investigator on an EPSRC grant (GR/N17706/01) on simu­ lation and optimisation of video surveillance networks and on the INMOVE project (IST­2000­37422) on intelligent mobile video environments. He is principal investigator on the EPSRC ViTAB Network (see above). Professor Tim Ellis, Digital Imaging Research Centre Tim Ellis was recently appointed Professor and Head of School in Computing and Information Systems at Kingston University, and prior to this, a Reader in Machine Vision at City University. Starting with a contract in 1987 to in­ vestigate the feasibility of detecting and identifying visual events in short image sequences (funded by the Scientific Research and Development Branch of the Home Office), Professor Ellis has over 15 years expertise in visual sur­ veillance and published over 80 technical papers. His recent EPSRC IMCASM project (Intelligent Multi­Camera Surveillance and Monitoring GR/M58030) focused on the problem of integrating the tracking of pedestrians and vehicles across a network of video cameras. He is currently a member of the Executive Committee of the British Machine Vision Association. 2 Centre for Knowledge Management, University of Surrey The Centre, located within a 6*­rated UoA (Electronics), has 7 academics, 3 RAs and 25 PhD students, and is amongst the active centres in Europe in terminology, knowledge and ontology acquisition. The Centre explores the link between different modalities of communication: text, image, and numbers. A model for integrating and co­ learning visual features and textual descriptions of collections of (crime scene) images was developed in the EPSRC­Scene of Crime Information System Project (GR/M89041/01). This fusion of text and image modalities led to a high­performance image retrieval system. Subsequently, the model was adapted in the EPSRC­Television in Words Project (GR/R67194/01) for automatically creating audio descriptions of moving images by using narrative structures of films. The fusion of textual information (financial news) and numerical information (share­price movement) has helped develop models of how market sentiment effects values of financial instruments and led to system that generates buy/sell signals (IST GIDA Project 2000­31123, IST Project ACE 22271). The simultaneous processing of large volumes of news/time serial data requires the investigation of new architectures especially the newly emergent GRID­based systems: a goal of the Centre’s new ESRC e­Science demonstrator project FINGRID. Professor Khurshid Ahmad, Centre for Knowledge Management Khurshid Ahmad is Professor of Artificial Intelligence at the University of Surrey and Head of Department. He has published over 125 articles, and two books, in the areas of terminology and ontology, extraction, knowledge discov­ ery, and multi­net neural systems. He is a member of the EPSRC College of Peer Reviewers. He is the Principal Investigator of ESRC­ FINGRID – an e­Science project for financial analysis, and of the EU­IST GIDA project for investigating the role of automatic text analysis on financial trading. He was principal investigator on the Scene of
Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video Meta­Data (REVEAL) 1 Crime Information Systems Project (1999­2002) and has led ten EU IST and ESPRIT projects since 1989, 3 EPSRC and one Teaching Company. He has successfully supervised 15 PhD students. Bogdan Vrusias, Centre for Knowledge Management Bogdan Vrusias is Lecturer in Neural Computing. His contributions lie in neural network research, data mining techniques, and expert systems. He was the RA on the Scene of Crime Information System Project between 2000­ 02 and where he specified and developed a multi­net classifier. Earlier, he developed, Profiler’s Workbench, for commercially profiling opinion poll data using statistical and neural classifiers, under the auspices of the EPSRC/DTI Teaching Company Scheme TCS­1940 (1998­2000). 3 Sira Ltd Sira Ltd is an independent (non profit distributing) research organisation which carries out collaborative research in co­operation with industry and academia. Activities span fundamental research through to technology commerciali­ sation in the areas of sensing and imaging. It also plays a key role in postgraduate training and has been instrumen­ tal in the development of a number of methodologies to increase the take­up of science and technology into industry and society. Core to these activities is its expertise in the provision of holistic systems incorporating physical meas­ urements with advanced data processing. Sira is the lead partner in the Intersect (Intelligent Sensing) and Imaging Faraday Partnerships and participates in research projects supported by the EPSRC, NERC, European Space Agency, European Commission, National Measurement System, Carbon Trust and BNSC. It manages two FP5 pro­ jects in intelligent instrumentation and is a core member of GOSPEL (IST­507610) Network of Excellence on arti­ ficial olfaction. Dr John Gilby, Innovation Having taken an MA in Engineering and Mathematics from the University of Cambridge, Dr John H Gilby was awarded a PhD from University of Surrey on adaptive control of robot arms. He is Sira’s Chief Scientist with 20 years experience of measurement, instrumentation and sensing systems, and is currently responsible for directing Sira’s research in enabling technologies. As co­ordinator of the Sira/UCL Postgraduate Training Partnership he was responsible for the overall supervision of the eighty PhD students involved in the scheme, personally supervising 16 students in areas such as machine learning, intelligent systems and systems engineering. He is a member of the UKIVA Committee and is co­opted to the Executive Committee of the BMVA. Dr Gilby is a member of ECVision Network in Cognitive Vision and an expert advisor for the FP6 IST. A member of the EPSRC Peer Review College, he participates in the Software Technologies panel, chairs the People and Interactivity panel and participates in re­ views of EPSRC­funded machine vision projects. He is currently expert advisor to Police Scientific Development Branch on the VITAL project to develop a test­bed for video motion detectors. 4 Police Information Technology Organisation (PITO) PITO is a non­departmental public body that has been charged under the Police Act of 1998 to support and manage the delivery of information and communications technologies to the police services of England and Wales. A major part of this effort is focussed on the provision of human identification capabilities to meet immediate and future re­ quirements by ensuring that the objectives of preventing and detecting crime and reducing the fear of crime are met effectively. Amongst those requirements, both current and future, are those related to the delivery of video­based surveillance and authentication. The use of CCTV systems and related technologies within diverse areas of policing are well established. However, in supporting this proposal PITO are fulfilling their role to assist and support further development and research into improving current capabilities to ensure their efficiency and effectiveness in meeting the strategic and technological needs of the police service now and in the future. 5 Police Scientific Development Branch (PSDB) Police Scientific Development Branch (PSDB) is a core part of the Home Office, providing technical and scientific advice to the UK Government and police service on a wide range of policing and security issues. This includes ad­ vice on the implementation of CCTV systems for crime reduction and security purposes. Guidance is also provided to police on the effective use of systems and techniques for extracting information from evidential video recordings. The development of new equipment and techniques is encouraged through collaborative working with industry and academia.
Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video Meta­Data (REVEAL) 2 PART B: DESCRIPTION OF THE PROPOSED RESEARCH 5 Background Strategic Importance for the UK The Home Office has recently outlined a five­year strategic framework [1] whose goal is “to ensure that the police service is equipped to exploit opportunities in science and technology to deliver effective policing”. A key com­ ponent of this strategy has been the early identification of technology supporting future police service capabilities. Technology requirements for the interpretation of CCTV video streams include “improving the effectiveness of CCTV at detecting crime and supporting prosecutions”, “effective use of intelligence­gathering technology”, “maximising the forensic value of evidence”, and “the effective management of investigations ­ especially timely collection and management of evidence”. During any major crime investigation, the establishment of the identity of vehicles and individuals involved in crime incidents necessitates the extremely time­consuming process of manually annotating all available CCTV tapes and digital archives. Therefore, any technology for recovering intelligence automatically from video footage must be a priority for development of the evidence­gathering capability of our Police Forces. This project aims to advance re­ search in the recovery of evidence from video footage. There are two key crime­oriented applications that will di­ rectly benefit from this research. First, video summarisation of CCTV archives i.e. the automatic generation of a gallery of mug­shots and numberplates for all moving objects. Such a gallery represents the most effective method of enlisting the knowledge of local police officers and the general public. Second, automatic annotation of video footage to ensure all evidence should be capable of automatic entry into HOLMES 2 – the investigation manage­ ment system used by police forces to collect, manage and analyse intelligence data. Two novel areas of investigation are proposed. First, methods for representing and analysing crowds are to be de­ veloped to process typically crowded scenes. Second, multi­modal data fusion couples the linguistic structure of cur­ rent police annotation practice with the meta­data structure of the video interpretation process to generate a rich ho­ mogenous data representation that can drive the annotation process[35]. Previous Research Video Summarisation The proposed on­demand extraction of video evidence using keywords, in effect, requires the extraction of a sum­ mary of the video. Video summarisation involves the intellectual challenge of describing the contents of a sequence and a technical challenge of dealing with extracting information from, and managing, large data streams. Summari­ sation techniques have been pioneered in extracting key events in highly structured situations like soccer matches to generate highlights [2] where high­level abstractions involve manual labelling of objects and events. The descrip­ tion of such events should have well­grounded semantics to distinguish bounded events from unbounded events, and should take advantage of what is known about the description of events, whether or not one requires specialist structures to represent events which are pre­linguistic or if extant linguist structures will suffice [3]. A number of techniques have been developed for generating descriptions of video in terms of conceptual and presentational quali­ ties and there are mark­up languages which have been developed to use such multimedia abstractions [4]. A combi­ nation of techniques for video summarisation embedded in a well­grounded abstraction will help in creating a sys­ tem for systematically summarising video archives. The augmentation of the visual features with a semantically richer linguistic description of events will help address the intellectual challenge. Robust Detection of Objects in Video Sequence Threat assessment technologies rely on the detection and tracking of objects within the scene. In the USA, the ambi­ tious Visual Surveillance and Monitoring project [5] integrated twelve research laboratories in the implementation of people and vehicle tracking using calibrated cameras. In Europe much of this work is reported through the PETS series of workshops. Recent work in the UK has focussed on multi­camera tracking [6] and auto­calibration [7,8]. In fact, learning is likely to be a key feature of future plug and play surveillance components e.g. by observing a large number of events to construct calibration, appearance and activity models. Such activity scene models could be used to modify the operation of classical object tracker or combined systems are used for dynamic scene interpre­ tation including behavioural analysis [9]. Despite the success, security cameras are typically placed either close to the public (near field) or in public spaces with high levels of activity (crowds). The resultant high degree of occlu­ sion undermines these classical detection and tracking techniques that suit far field scenes exhibiting relatively
Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video Meta­Data (REVEAL) 3 sparse activity – although these approaches have been extended to groups [10]. Predicting crowd dynamics to un­ derpin the design of safe public spaces has attracted substantial interest over the past few years. In computer graph­ ics, crowds are modelled by agents, cellular automata or particle systems with behaviours that obey both physical and psychological rules [11,12]. Despite this, the visual analysis of crowds is still a relatively unexplored domain [13,14]. Dynamic scene interpretation and behavioural analysis Dynamic scene interpretation relies on object tracking systems to perform behavioural analysis or activity recogni­ tion [15]. A state­of­the­art survey on approaches to learning and understanding scene activity may be found in Buxton [16] detailing probabilistic, neural or symbolic networks approaches to recognising temporal scenarios. In probabilistic/neural network approaches, nodes usually correspond to scenarios that are recognised at a given instant with a computed probability [17,18] e.g. Bayesian networks or classifiers, and hidden Markov model (HMM). Events of interest within a typical scene are usually salient because of the interaction of two or more objects – whether people or vehicles. Brand et al introduced the Coupled HMM (CHMM) formulation for modelling interac­ tions between two or more pedestrians [19]. Symbolic network approaches to the recognition of scenarios may use a declarative representation of scenarios defined as a set of spatio­temporal and logical constraints. Traditional con­ straint resolution [20] or temporal constraint propagation [21] techniques are employed to recognise scenarios [22]. Vu et al propose a novel approach to temporal scenario recognition which employs a declarative model to represent scenarios and a logic­based approach to recognise pre­defined scenario models [23]. Elsewhere a probabilistic con­ text­free parser was used to build action representations of increased complexity [24]. Automatic Construction of Ontology and Thesauri for Indexing Images During the last three decades significant effort has been spent on image retrieval systems that focus exclusively on vision­specific features such as colour distribution, shape and texture – content based image retrieval systems (CBIR)[25]. However, CBIR systems have an implicit limitation in that visual properties are not sufficient to iden­ tify arbitrary classes of objects [26]. The solution of associating keywords to an image sequence is not only time­ consuming but the choice of keywords between operator is likely to be highly variable [27]. This has led to the so­ called multimodal systems that use linguistic features extracted from textual captions or descriptions together with the visual features for storing and retrieving images in databases [28]. Query expansion – referring to the augmenta­ tion of search terms – derives additional search terms from a thesaurus or from related documents. However when images are of a specialist nature, there is a need to create domain­specific thesauri. These could be built manually by expert lexicographers but are typically time­consuming to build, highly subjective and error­prone, and may well exhibit inadequate domain coverage. One solution is the automatic generation of thesauri for specialised domains from representative collateral texts [29,30]. A new approach to thesauri construction for specialist domains can be based on recent developments in corpus­based lexicography and terminology in which keywords [31,32] and the semantic relations of a taxonomy [33,34] of a specialist corpus can be derived from randomly but systematically organised collection of texts. Demonstrated on a large collection of scene of crime images and collateral texts, the SOCIS system was able to extract the terms required for indexing the images from a free text description of the con­ tents of the image. The system is also capable of learning the correspondence between keywords and image features [35]. 6 Programme and Methodology Aims and Objectives Within the context of the recently published Home Office Police Science and Technology Strategy [1], the strategic objective of the REVEAL project is to promote those key technologies which will enable automated extraction of evidence from CCTV archives, and to allow integration within existing HOLMES2 crime management systems. These key issues can be summarised as the following scientific objectives:
· To develop models and algorithms for capturing the conceptual structure (i.e. the Visual Evidence Thesaurus) underpinning the work of surveillance experts annotating video streams, based on corpus­based lexicography and terminology, and making explicit the link between linguistic and meta­data content.
· To develop methods of automatically characterising CCTV sequences (including distinguishing between indoor and outdoor scenes, near and far­fields, sparse and crowded scenes) and extracting the semantic landscape (in­ cluding spatio­temporal activity, geometric calibration, and the occlusion structure of the scene).
Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video Meta­Data (REVEAL) 4 · To develop crowd models which can capture the global spatio­temporal motion characteristics of multi­ directional people flows, and to develop the image processing algorithms that populate such models while oper­ ating in high levels of visual occlusion.
· To develop a rich Surveillance Meta­data Model for describing extracted image content including camera and scene descriptor, the spatio­temporal semantic landscape, crowd dynamics, as well as the more classical moving object descriptors.
· To develop methods of integrating the linguistic structure (the Visual Evidence Thesaurus) and the visual­ content (the Surveillance Meta­Data Model) through co­learning to enable the automatic annotation of video data streams, and facilitate the retrieval of video evidence from high­level queries. Methodology Work Package 1 – Development of Visual Evidence Thesaurus (Surrey: 12 person months) This workpackage will test the hypothesis that experts share a special language within the video surveillance do­ main and build a thesaurus and the accompanying conceptual structure from the documents in video surveillance together with transcripts of interviews with experts whilst they comment on exemplar surveillance tapes. This work will build on existing corpus­based methods developed at Surrey. Protocols developed for verifying automatically extracted terminology and ontology will be tested on video surveillance experts. The key deliverable of the work­ package will be a conceptually organised thesaurus for specific subdomains of video surveillance, for example, in­ door surveillance or monitoring of public places. The refined method should be adaptable to thesaurus generation for other surveillance specialisms. The thesaurus will be used as a source of keywords for indexing video sequences using standardised terminology – based on recent ISO standards. The index will be used for subsequent retrieval of sequences of interest within a collection of videotapes. A related deliverable will be to associate the generated ter­ minology and ontology with the HOLMES2 data model. This work is unique in that thesauri are typically developed for indexing static images and little or no work has been done in the area of sequences. Milestone M1: Automatic Ontology Extraction Methodology (Month 12) Work Package 2 ­ Extracting Visual Semantics (Kingston 24 person months) The required annotation stream will be automatically derived from visual­based streams of camera­specific, scene­ specific and object­specific meta­data that are extracted using a layered image processing algorithms. While classi­ cal detection/tracking algorithms can generate the object­specific meta­data associated with visual surveillance im­ agery, a number of significant challenges remain: geometric and chromatic self­calibration, generation of a seman­ tic landscape capturing time­varying scene activity, and the ability to interpret crowded scenes. Milestone M2: Object Detection and Tracking Package (Month 06) Camera­specific meta­data includes geometric and chromatic self­calibration. Geometric calibration is vital if real world information on velocity or inter­object distances are to be embedded in the meta­data. Existing work [7,8] on self­calibration in which a camera infers its relationship to the world will be extended to non­planar ground sur­ faces and indoor scenes where the background is often heavily occluded. In such cases it is necessary to treat the moving objects as virtual calibration devices and perform regularisation or parametric fitting to recover the ground surface and to establish a local coordinate system. To ensure maximum utility of available evidence, colour calibra­ tion will be required to support queries based on colour. Calibration techniques will extend Retinex­type approaches to take advantage of the range of daily and object colours available to accommodate day/night cycles, street or in­ door lighting as well as inter­camera variability. Milestone M3: Milestone M4: Adaptive Chromatic Calibration Algorithm (Month 12) Automatic Geometric Calibration Algorithm (Month 18) Scene­specific meta­data is expressed as a semantic landscape – a temporally varying global description of the scene built directly from the detection and tracking data used to characterise the scene. Such a landscape would in­ clude the activity landscape and occlusion landscape. Such activity and occlusion information is vital to the object trackers to both improve trajectory prediction and resolve track ambiguity. We shall investigate methods of embed­ ding this information within the classical object tracking algorithms. Associating textual names supplied by human mark­up operators with these extracted modes of activity and occlusion regions will significantly enrich the annota­ tion data stream. Offering a temporally varying probability distribution model of activity in the scene (e.g. using HMMs [9]), the activity landscape identifies the object­specific pathways, stopping places, and entrance/exit
Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video Meta­Data (REVEAL) 5
zones) associated specific structures in the environment – doorways, routes, meeting places, queuing areas. Also automatically derived from the activity data, the occlusion landscape captures the short and long­term occlusion regions associated with permanent and temporary scene furniture such as walls and park cars. Milestone M5: Automated Construction of Semantic Landscape (Month 24) Work Package 3 – Development of Surveillance Meta­data Model (Sira: 6 person months) The meta­data streams delivered under WP 2 must be integrated to establish a visual knowledge ontology (or Sur­ veillance Meta­data Model) that can extrapolate from low­level computer vision concepts to high level representa­ tions of activity or behaviour. We will build on the Cognitive Vision ontology developed within the ECVision pro­ ject (IST 35454) to structure this knowledge representation from fundamental concepts of observable features (building on the work of the MPEG7 community) through to objects, scenes and events. The resulting structure needs to incorporate subtle relationships such as between crowd and individual to ensure that the ontology reflects the richness of the application context rather than ease of description. As the visual knowledge and linguistic struc­ tures are co­learnt, a methodology will be developed to enable the visual evidence thesaurus to suggest and update the higher level cognitive abstractions in the Surveillance Meta­data Model. Milestone M6: Definition of the Surveillance Meta­data Model (Month 12) Work Package 4 ­ Analysing Crowds (Kingston: 12 person months) Since classical object tracking completely breaks down as the density of objects increases, a novel approach must be sought. Essentially a crowd is a hierarchical entity having its own fuzzy physical extent and an emergent collective behaviour, yet is composed of individuals who join and leave. Novel representational forms that capture this dual aspect are to be developed, which are capable of being linked to the Visual Evidence Thesaurus. Rather than relying on traditional tracking, optical flow data can be combined with the previously described activity landscape to de­ scribe regions of varying density and fast or slow flow, which can act as input prior probabilities for populating the crowd model. Having characterised the global characteristics, the motion patterns will be used to deploy banks of local object trackers that identify and follow the individuals within a crowd. Such trajectory data can update crowd model accuracy and introduce more specific quantitative information on density and interactions of people. The trackers will adopt a head­and­shoulder active appearance model of the gradient magnitude structures associated with people in crowds, and use optical flow information to establish temporal coherence. Milestone M7: Crowd Analysis Software (Month 24) Work Package 5: Text­Meta­data Mining and Multi­modal Data Fusion (Surrey: 12 person months) Multi­modal data fusion is the integration of disparate data from a variety of sources into a rich, homogenous and formal framework. Here, the fusion of the Visual Evidence Thesaurus and the Surveillance Meta­data Model has two outcomes: video summarisation – the automatic annotation of video evidence from the visual meta­data gener­ ated in WP2 and WP3; and retrieval of key sequences from a video archive using high­level textual queries. A vari­ ety of existing multi­net neural systems and adaptive information extraction systems will be evaluated to perform the fusion. Information extraction based methods for text mining, developed in the SoCIS project, will be tested on a set of exemplar tapes in which experts describe events and objects appearing on a tape. This method has been suc­ cessful in fusing visual and textual descriptions of still images in forensic science. The novelty here is that the two sets of features will be extracted automatically and the learning will be automated. If successful, the method can be adapted for other domains where any video data is accompanied by text or sound. The deliverable of this workpack­ age will contribute to the emerging discipline of adaptive information extraction. Milestone M8: Multi­modal Data Fusion component (Month 24) Work Package 6 – Video Summarisation (Surrey: 12 person months; Kingston: 12 person months; SIRA: 6 person months) Using the digital video recording platform supplied by Overview Ltd, an automatic annotation prototype will be developed to validate the effectiveness of the coupling of the Visual Evidence Thesaurus and the Surveillance Meta­data Model structure. Larger numbers of exemplar video sequences will be indexed on the different modalities (WP1, WP2 and WP4) and the system will learn the correspondence between the modalities using unsupervised neural networks. The generation of textual descriptors will require commentary from experts in video surveillance, relevance feedback from potential end­users and collections of text collateral to specific video sequences. For man­
Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video Meta­Data (REVEAL) 6 agement and processing of large volumes of video data, emerging Grid middleware will be evaluated for the secure provision of data management and processing. Extensive user testing is involved in this workpackage not only for testing the system’s efficacy but, informed by WP3 for enriching the joint index. Milestone M9: Prototype video summarisation system (Month 36)
Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video Meta­Data (REVEAL) 7 7 Relevance to Beneficiaries The principal beneficiaries of the results will be users of CCTV archives – especially those involved in the admini­ stration of justice i.e. the UK Police Forces represented by partners PITO and PSDB. Given the proportion of effort of any investigation spent reviewing video data, advancing annotation technologies will free up officers allowing crime investigation teams to be more effective at solving crimes. As importantly, the extraction and integration of evidence into crime management systems will facilitate the cross­referencing of evidence from many CCTV sources. Access to this evidence will be eased for a range of interested parties (detectives, lawyers, judges and ju­ ries). Our approach based on knowledge extraction and linking different information modalities will be of benefit to cognitive scientists and linguists interested in how video events are understood, interpreted and described by hu­ mans. Other relevant academic areas include multimedia databases and digital libraries. Overview Ltd, who are specifically interested in annotation processes which enable accurate retrieval, will contribute their meta­data based digital video recording platform. The second contribution will be in crowd analysis benefiting video surveillance technologies and associated markets in people counting, marketing, and the design of safe public spaces, as well as the academic image processing community in general. Both Ipsotek Ltd and Crowd Dynamics Ltd are providing expertise and software to support this activity, and access to appropriate video datasets. 8 Dissemination and Exploitation Research results will be disseminated in proceedings of international conferences and journals in the fields of Computer Vision, Artificial Intelligence and Information Systems – specifically BMVC, ICCV, IJCNN and SCI. Project progress will be reported on a specially designed Website hosted by the Kingston University and a functional prototype of the final system will be demonstrated on this website. 9 Justification of Resources Personnel: Over this 3 year grant, Kingston University will appoint two Research Assistants contributing 36 person months to the computer vision (WP2, WP4) activity and twelve person months to the common Video Summarisation (WP6) activity. The University of Surrey will appoint one RA for three years to address the textual data mining and data fusion activities (WP1, WP5) and integrate and validate these within the Video Summarisation (WP6) work package. Finally Sira Ltd will contribute twelve person months over the three years to the development of the Sur­ veillance Meta­data Model and its refinement and validation within the Video Summarisation (WP6) work package. Kingston is also requesting funding for technical support for the integration activities associated with the Video Summarisation prototype supplied by Overview Ltd to Kingston University. In addition, a small level of funding is requested to support the administrative and clerical activity of coordinating a project with seven partners. Travel: Funding is requested for travel to the quarterly Steering Committee and to allow the collaborating partners to meet to discuss the ongoing work. In addition, conference costs are requested to disseminate project results. Consumables: Each of the Research Assistants will be furnished with a workstation appropriately specified to sup­ port the video acquisition, high volume, and high computation demands expected of the development work. Some data protection issues require the purchase of suitable secure physical storage of video media. 10 References [1] Police Science and Technology Strategy 2003­2008, Home Office Policy Unit, January 16. (2003) [2] Tenny, C. and J. Pujstevosky (eds.) “Events as grammatical objects” California: CSLI Publications. (2000) [3] Y. Fu, A. Ekin, A.M. Tekalp and R. Mehrotra, “Temporal segmentation of video objects for hierarchical object­ based motion description”, IEEE Trans. Image Processing, Vol. 11(2), pp. 135­145 (2002) [4] E. Megalou and T. Hadzilacos, “Semantic abstractions in the multimedia domain”, IEEE Transactions on Knowl­ edge and Data Engineering, Vol. 15(1), pp. 136­160 (2003) [5] R. Collins, A. Lipton, T. Kanade, H. Fujiyoshi, D. Duggins, Y. Tsin, D. Tolliver, N. Enomoto, and O. Hasegawa, “A System for Video Surveillance and Monitoring”, VSAM Final Report, CMU­RI­TR­00­12, Robotics Institute, Carnegie Mellon University, May (2000) [6] J. Black and T. Ellis, “Multi­Camera Image Tracking”, Proceedings 2nd IEEE International Workshop on Per­ formance Evaluation of Tracking and Surveillance, Kauai, Hawaii, December 9 (2001)
Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video Meta­Data (REVEAL) 8 [7] G.A. Jones, J. Renno, and P. Remagnino, “Auto­Calibration in Multiple­Camera Surveillance Environments”, IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, Copenhagen, June (2002) [8] T. J. Ellis, D. Makris and J. K. Black, “Learning a Multi­Camera Topology”, Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, Nice, , October 11­12 (2003) pp [9] P. Remagnino and G.A. Jones, “Spatial and Probabilistic Modelling of Pedestrian Behaviour”, British Machine Vision Conference, pp 685­694, (2001) [10] F. Cupillard, F. Brémond and M. Thonnat, Behaviour Recognition for Individuals, Groups of people and Crowd, IEE Symposium on Intelligent Distributed Surveillance Systems, London, 26 February (2003) [11] S. Raupp Musse, D.Thalmann, “Hierarchical Model for Real Time Simulation of Virtual Human Crowds”, IEEE Transactions on Visualization and Computer Graphics, Vol.7(2), pp.152­164 (2001) [12] Laure Heigeas and Annie Luciani and Joelle Thollot and Nicolas Castagne, "A Physically­Based Particle Model of Emergent Crowd Behaviors", Int. Conference on the Computer Graphics and Vision, Sept. 5 ­ 10, Moscow (2003) [13] D.B. Yang, H.H. González­Baños, L.J. Guibas, "Counting People in Crowds with a Real­Time Network of Simple Image Sensors", International Conference on Computer Vision, Nice, pp 122­129 (2003) [14] B. Maurin, O. Masoud and N.P. Papanikolopoulos, Monitoring Crowded Traffic Scenes, IEEE 5th International Conference on Intelligent Transportation Systems. Singapore, September 3–6, (2002) [15] A. Ali and J. Aggarwal, “Segmentation and Recognition of Continuous Human Activity”, Proceedings IEEE Workshop on Detection and Recognition of Events in Video, pp 28­35, July 8 (2001) [16] H. Buxton, “Generative Models for Learning and Understanding Scene Activity”, Technical Report DIKU­TR­ 2002/01 (ISSN 0107­8283), The University of Copenhagen, pp 71­81, June (2002) [17] S. Hongeng, F. Brémond and R. Nevatia. “Representation and Optimal Recognition of Human Activities”, In IEEE Proceedings of Computer Vision and Pattern Recognition, South Carolina, USA, 2000. [18] N.M. Oliver, B. Rosario and A.P. Pentland, “A Bayesian Computer Vision System for Modeling Human Interac­ tions”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), pp 831­843, 2000. [19] M. Brand, N. Oliver, and A. Pentland. Coupled Hidden Markov Models for Complex Action Recognition. IEEE Conference on Computer Vision and Pattern Recognition, pages 994–999, Puerto Rico, (1996) [20] N. Rota, “Contribution à la reconnaissance de comportements humains à partir de séquences vidéos”. Thèse, INRIA­Université de Nice Sophia Antipolis, 2001/10. [21] N.T. Siebel and S.J. Maybank, S. J. “Fusion of Multiple Tracking Algorithms for Robust People Tracking”. In Proceedings 7th European Conference on Computer Vision, Copenhagen, Denmark, May 2002. [22] F. Cupillard, F. Brémond and M. Thonnat, “Group Behavior Recognition with Multiple Cameras”, IEEE Proc. of the Workshop on Applications of Computer Vision, Orlando, 2002/12. [23] V.­T. Vu, F. Brémond and M. Thonnat, “Temporal Constraints for Video Interpretation”, ECAI Lyon (2002) [24] Y. Ivanov and C. Stauffer and A. Bobick and E. Grimson, "Video Surveillance of Interactions", IEEE Conference on Computer Vision and Pattern Recognition Workshop on Visual Surveillance, Fort Collins, November (1999) [25] R.C. Veltkamp and M. Tanase, “Content­Based Image Retrieval Systems: A Survey. (Technical Report UU­CS­ 2000­34). Institute of Information and Computing Sciences, Univ. of Utrecht, The Netherlands (2000) [26] McG.D Squire, W. Muller, H. Muller, and T. Pun, T, “Content­Based Query of Image databases: Inspirations from Text Retrieval”, Pattern Recognition Letters 21. Elsevier, pp 1193­1198 (2000) [27] Ogle, V.E., Stonebraker, M.: “Chabot: retrieval from a relational database of images”, IEEE Computer Maga­ zine, Vol. 28(9), pp 40­48 (1995) [28] Paek, S., Sable C.L., Hatzivassiloglou, V., Jaimes, A., Schiffman, B.H., Chang, S.F., McKeown, K.R.: Integration of Visual and Text­Based Approaches for the Content Labeling and Classification of Photographs. ACM SIGIR'99 Workshop on Multimedia Indexing and Retrieval, Berkeley, CA (1999) [29] Jing, Y., Croft, W.B.: “An Association Thesaurus for Information Retrieval”. In: Bretano, F., Seitz, F.: (eds.): Proceedings of the RIAO’94 Conference. CIS­CASSIS, Paris, France (1994) 146­160 [30] Grefenstette, G.: Explorations in Automatic Thesaurus Discovery. Kluwer Academic, Boston, USA (1994) [31] K. Ahmad, M. Tariq, B. Vrusias and C. Handy, "Corpus­Based Thesaurus Construction for Image Retrieval in Specialist Domains", 25th European Conf. on Information Retrieval Research, Pisa, April 14­16, pp 502­510 (2003) [32] Leech, G.: The State of the Art in Corpus Linguistics. In: Aijmer, K., Altenberg, B. (eds.): English Corpus Lin­ guistics: In honour of Jan Svartvik. Longman, London (1991) [33] Ahmad, K., Rogers, M.A.: Corpus­Based Terminology Extraction. In: Budin, G., Wright S.A. (eds.): Handbook of Terminology Management, Vol.2. John Benjamins Publishers, Amsterdam (2000) 725­760. [34] Hearst, M.: Automatic Acquisition of Hyponyms from Large Text Corpora. Proceedings of the Fourteenth Interna­ tional Conference on Computational Linguistics. Nantes, France. (1992)
Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video Meta­Data (REVEAL) 9 [35] K. Ahmad, M. Casey, and B. Vrusias, “Combining Multiple Modes of Information using Unsupervised Neural Classifiers. In (Ed.) Terry Windeatt and Fabio Rolli, International Workshop on Multiple Classifier Systems, LNCS 2709, Heidelberg: Springer­Verlag. pp 236­245 (2003)
Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video Meta­Data (REVEAL) 10 Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video Meta­Data (REVEAL)
APPENDIX 1: DIAGRAMMATIC WORKPLAN WP 1 2 3 4 5 6 Year 1 Year 2
Year 3 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Name Development of Visual Evidence Thesaurus Extracting Visual Semantics Development of Surveillance Meta­data Model Analysing Crowds Text­Meta­data Mining and Multi­modal Data Fusion Video Summarisation M1 M2 M3 M4 M5 M6 M7 M8 M9 Milestones M1 Automatic Ontology Extraction Methodology M2 Object Detection and Tracking Package M3 Adaptive Chromatic Calibration Algorithm M4 Adaptive Geometric Calibration Algorithm M5 Automated Construction of Semantic Landscape M6 Definition of the Surveillance Meta­data Model M7 Crowd Analysis Software M8 Multi­modal Data Fusion component M9 Prototype video summarisation system Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video Meta­Data (REVEAL) 9 APPENDIX 2: LETTERS IN SUPPORT OF REVEAL PROPOSAL
Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video Meta­Data (REVEAL) 10 Variables A Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video Meta­ Data Variables B REVEAL
Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video Meta­Data (REVEAL) 11