Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video MetaData REVEAL An EPSRC Proposal The CASE for SUPPORT Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video MetaData (REVEAL) i CONTENTS PART A: TRACK RECORD 1 Digital Imaging Research Centre, Kingston University Dr Graeme A. Jones, Digital Imaging Research Centre Professor Tim Ellis, Digital Imaging Research Centre 1 1 1 1 2 Centre for Knowledge Management, University of Surrey 1 Professor Khurshid Ahmad, Centre for Knowledge Management Bogdan Vrusias, Centre for Knowledge Management 1 2 3 Sira Ltd Dr John Gilby, Innovation 2 2 4 Police Information Technology Organisation (PITO) 2 5 Police Scientific Development Branch (PSDB) 2 PART B: DESCRIPTION OF THE PROPOSED RESEARCH 5 Background 3 3 Strategic Importance for the UK Previous Research 3 3 6 Programme and Methodology 4 Aims and Objectives Methodology 4 5 7 Relevance to Beneficiaries 8 8 Dissemination and Exploitation 8 9 Justification of Resources 8 10 References 8 APPENDIX 1: DIAGRAMMATIC WORKPLAN APPENDIX 2: LETTERS IN SUPPORT OF REVEAL PROPOSAL Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video MetaData (REVEAL) 9 10 ii PART A: TRACK RECORD 1 Digital Imaging Research Centre, Kingston University The Digital Imaging Research Centre (DIRC) at Kingston University has developed one of the largest visual sur veillance groups nationally. Research activity and industrial consultancy has focused on the development of robust, plug and play surveillance components, integrated widearea multicamera systems, behavioural analysis for public transport applications. DIRC has been involved in many EPSRC and EU visual surveillance projects including As sessment of Image Processing Techniques as a Means of Improving Personal Security in Public Transport GR/M29436/02; Traffic Simulation and Optimisation on an Intelligent Video Surveillance Network GR/N17706/01; Annotated Digital Video for Surveillance and Optimised Retrieval (ADVISOR) IST1999 11287; and ProActive Integrated Systems for Security Management by Technological Institutional and Commu nication Assistance (PRISMATICA) GRD1200010601. More recently Kingston (along with Sira Ltd and the Universities of Kent and Reading) are leading the EPSRC ViTAB Network (Videobased Threat Assessment and Biometrics GR/S64301/01) which aims to encourage those videointerpretation technologies most likely to have the greatest impact on future police identification, authentication and threat assessment capabilities. Dr Graeme A. Jones, Digital Imaging Research Centre Currently the Director of DIRC, Dr Graeme A. Jones has over 15 years experience in image and video analysis and has (co) authored over 100 technical papers related to computer vision. Dr. Jones chaired the British Machine Vi sion Association workshop on Visual Surveillance and was cochair of the IAPR Workshop on Advanced Video based Surveillance Systems in 2001. He is currently coinvestigator on an EPSRC grant (GR/N17706/01) on simu lation and optimisation of video surveillance networks and on the INMOVE project (IST200037422) on intelligent mobile video environments. He is principal investigator on the EPSRC ViTAB Network (see above). Professor Tim Ellis, Digital Imaging Research Centre Tim Ellis was recently appointed Professor and Head of School in Computing and Information Systems at Kingston University, and prior to this, a Reader in Machine Vision at City University. Starting with a contract in 1987 to in vestigate the feasibility of detecting and identifying visual events in short image sequences (funded by the Scientific Research and Development Branch of the Home Office), Professor Ellis has over 15 years expertise in visual sur veillance and published over 80 technical papers. His recent EPSRC IMCASM project (Intelligent MultiCamera Surveillance and Monitoring GR/M58030) focused on the problem of integrating the tracking of pedestrians and vehicles across a network of video cameras. He is currently a member of the Executive Committee of the British Machine Vision Association. 2 Centre for Knowledge Management, University of Surrey The Centre, located within a 6*rated UoA (Electronics), has 7 academics, 3 RAs and 25 PhD students, and is amongst the active centres in Europe in terminology, knowledge and ontology acquisition. The Centre explores the link between different modalities of communication: text, image, and numbers. A model for integrating and co learning visual features and textual descriptions of collections of (crime scene) images was developed in the EPSRCScene of Crime Information System Project (GR/M89041/01). This fusion of text and image modalities led to a highperformance image retrieval system. Subsequently, the model was adapted in the EPSRCTelevision in Words Project (GR/R67194/01) for automatically creating audio descriptions of moving images by using narrative structures of films. The fusion of textual information (financial news) and numerical information (shareprice movement) has helped develop models of how market sentiment effects values of financial instruments and led to system that generates buy/sell signals (IST GIDA Project 200031123, IST Project ACE 22271). The simultaneous processing of large volumes of news/time serial data requires the investigation of new architectures especially the newly emergent GRIDbased systems: a goal of the Centre’s new ESRC eScience demonstrator project FINGRID. Professor Khurshid Ahmad, Centre for Knowledge Management Khurshid Ahmad is Professor of Artificial Intelligence at the University of Surrey and Head of Department. He has published over 125 articles, and two books, in the areas of terminology and ontology, extraction, knowledge discov ery, and multinet neural systems. He is a member of the EPSRC College of Peer Reviewers. He is the Principal Investigator of ESRC FINGRID – an eScience project for financial analysis, and of the EUIST GIDA project for investigating the role of automatic text analysis on financial trading. He was principal investigator on the Scene of Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video MetaData (REVEAL) 1 Crime Information Systems Project (19992002) and has led ten EU IST and ESPRIT projects since 1989, 3 EPSRC and one Teaching Company. He has successfully supervised 15 PhD students. Bogdan Vrusias, Centre for Knowledge Management Bogdan Vrusias is Lecturer in Neural Computing. His contributions lie in neural network research, data mining techniques, and expert systems. He was the RA on the Scene of Crime Information System Project between 2000 02 and where he specified and developed a multinet classifier. Earlier, he developed, Profiler’s Workbench, for commercially profiling opinion poll data using statistical and neural classifiers, under the auspices of the EPSRC/DTI Teaching Company Scheme TCS1940 (19982000). 3 Sira Ltd Sira Ltd is an independent (non profit distributing) research organisation which carries out collaborative research in cooperation with industry and academia. Activities span fundamental research through to technology commerciali sation in the areas of sensing and imaging. It also plays a key role in postgraduate training and has been instrumen tal in the development of a number of methodologies to increase the takeup of science and technology into industry and society. Core to these activities is its expertise in the provision of holistic systems incorporating physical meas urements with advanced data processing. Sira is the lead partner in the Intersect (Intelligent Sensing) and Imaging Faraday Partnerships and participates in research projects supported by the EPSRC, NERC, European Space Agency, European Commission, National Measurement System, Carbon Trust and BNSC. It manages two FP5 pro jects in intelligent instrumentation and is a core member of GOSPEL (IST507610) Network of Excellence on arti ficial olfaction. Dr John Gilby, Innovation Having taken an MA in Engineering and Mathematics from the University of Cambridge, Dr John H Gilby was awarded a PhD from University of Surrey on adaptive control of robot arms. He is Sira’s Chief Scientist with 20 years experience of measurement, instrumentation and sensing systems, and is currently responsible for directing Sira’s research in enabling technologies. As coordinator of the Sira/UCL Postgraduate Training Partnership he was responsible for the overall supervision of the eighty PhD students involved in the scheme, personally supervising 16 students in areas such as machine learning, intelligent systems and systems engineering. He is a member of the UKIVA Committee and is coopted to the Executive Committee of the BMVA. Dr Gilby is a member of ECVision Network in Cognitive Vision and an expert advisor for the FP6 IST. A member of the EPSRC Peer Review College, he participates in the Software Technologies panel, chairs the People and Interactivity panel and participates in re views of EPSRCfunded machine vision projects. He is currently expert advisor to Police Scientific Development Branch on the VITAL project to develop a testbed for video motion detectors. 4 Police Information Technology Organisation (PITO) PITO is a nondepartmental public body that has been charged under the Police Act of 1998 to support and manage the delivery of information and communications technologies to the police services of England and Wales. A major part of this effort is focussed on the provision of human identification capabilities to meet immediate and future re quirements by ensuring that the objectives of preventing and detecting crime and reducing the fear of crime are met effectively. Amongst those requirements, both current and future, are those related to the delivery of videobased surveillance and authentication. The use of CCTV systems and related technologies within diverse areas of policing are well established. However, in supporting this proposal PITO are fulfilling their role to assist and support further development and research into improving current capabilities to ensure their efficiency and effectiveness in meeting the strategic and technological needs of the police service now and in the future. 5 Police Scientific Development Branch (PSDB) Police Scientific Development Branch (PSDB) is a core part of the Home Office, providing technical and scientific advice to the UK Government and police service on a wide range of policing and security issues. This includes ad vice on the implementation of CCTV systems for crime reduction and security purposes. Guidance is also provided to police on the effective use of systems and techniques for extracting information from evidential video recordings. The development of new equipment and techniques is encouraged through collaborative working with industry and academia. Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video MetaData (REVEAL) 2 PART B: DESCRIPTION OF THE PROPOSED RESEARCH 5 Background Strategic Importance for the UK The Home Office has recently outlined a fiveyear strategic framework [1] whose goal is “to ensure that the police service is equipped to exploit opportunities in science and technology to deliver effective policing”. A key com ponent of this strategy has been the early identification of technology supporting future police service capabilities. Technology requirements for the interpretation of CCTV video streams include “improving the effectiveness of CCTV at detecting crime and supporting prosecutions”, “effective use of intelligencegathering technology”, “maximising the forensic value of evidence”, and “the effective management of investigations especially timely collection and management of evidence”. During any major crime investigation, the establishment of the identity of vehicles and individuals involved in crime incidents necessitates the extremely timeconsuming process of manually annotating all available CCTV tapes and digital archives. Therefore, any technology for recovering intelligence automatically from video footage must be a priority for development of the evidencegathering capability of our Police Forces. This project aims to advance re search in the recovery of evidence from video footage. There are two key crimeoriented applications that will di rectly benefit from this research. First, video summarisation of CCTV archives i.e. the automatic generation of a gallery of mugshots and numberplates for all moving objects. Such a gallery represents the most effective method of enlisting the knowledge of local police officers and the general public. Second, automatic annotation of video footage to ensure all evidence should be capable of automatic entry into HOLMES 2 – the investigation manage ment system used by police forces to collect, manage and analyse intelligence data. Two novel areas of investigation are proposed. First, methods for representing and analysing crowds are to be de veloped to process typically crowded scenes. Second, multimodal data fusion couples the linguistic structure of cur rent police annotation practice with the metadata structure of the video interpretation process to generate a rich ho mogenous data representation that can drive the annotation process[35]. Previous Research Video Summarisation The proposed ondemand extraction of video evidence using keywords, in effect, requires the extraction of a sum mary of the video. Video summarisation involves the intellectual challenge of describing the contents of a sequence and a technical challenge of dealing with extracting information from, and managing, large data streams. Summari sation techniques have been pioneered in extracting key events in highly structured situations like soccer matches to generate highlights [2] where highlevel abstractions involve manual labelling of objects and events. The descrip tion of such events should have wellgrounded semantics to distinguish bounded events from unbounded events, and should take advantage of what is known about the description of events, whether or not one requires specialist structures to represent events which are prelinguistic or if extant linguist structures will suffice [3]. A number of techniques have been developed for generating descriptions of video in terms of conceptual and presentational quali ties and there are markup languages which have been developed to use such multimedia abstractions [4]. A combi nation of techniques for video summarisation embedded in a wellgrounded abstraction will help in creating a sys tem for systematically summarising video archives. The augmentation of the visual features with a semantically richer linguistic description of events will help address the intellectual challenge. Robust Detection of Objects in Video Sequence Threat assessment technologies rely on the detection and tracking of objects within the scene. In the USA, the ambi tious Visual Surveillance and Monitoring project [5] integrated twelve research laboratories in the implementation of people and vehicle tracking using calibrated cameras. In Europe much of this work is reported through the PETS series of workshops. Recent work in the UK has focussed on multicamera tracking [6] and autocalibration [7,8]. In fact, learning is likely to be a key feature of future plug and play surveillance components e.g. by observing a large number of events to construct calibration, appearance and activity models. Such activity scene models could be used to modify the operation of classical object tracker or combined systems are used for dynamic scene interpre tation including behavioural analysis [9]. Despite the success, security cameras are typically placed either close to the public (near field) or in public spaces with high levels of activity (crowds). The resultant high degree of occlu sion undermines these classical detection and tracking techniques that suit far field scenes exhibiting relatively Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video MetaData (REVEAL) 3 sparse activity – although these approaches have been extended to groups [10]. Predicting crowd dynamics to un derpin the design of safe public spaces has attracted substantial interest over the past few years. In computer graph ics, crowds are modelled by agents, cellular automata or particle systems with behaviours that obey both physical and psychological rules [11,12]. Despite this, the visual analysis of crowds is still a relatively unexplored domain [13,14]. Dynamic scene interpretation and behavioural analysis Dynamic scene interpretation relies on object tracking systems to perform behavioural analysis or activity recogni tion [15]. A stateoftheart survey on approaches to learning and understanding scene activity may be found in Buxton [16] detailing probabilistic, neural or symbolic networks approaches to recognising temporal scenarios. In probabilistic/neural network approaches, nodes usually correspond to scenarios that are recognised at a given instant with a computed probability [17,18] e.g. Bayesian networks or classifiers, and hidden Markov model (HMM). Events of interest within a typical scene are usually salient because of the interaction of two or more objects – whether people or vehicles. Brand et al introduced the Coupled HMM (CHMM) formulation for modelling interac tions between two or more pedestrians [19]. Symbolic network approaches to the recognition of scenarios may use a declarative representation of scenarios defined as a set of spatiotemporal and logical constraints. Traditional con straint resolution [20] or temporal constraint propagation [21] techniques are employed to recognise scenarios [22]. Vu et al propose a novel approach to temporal scenario recognition which employs a declarative model to represent scenarios and a logicbased approach to recognise predefined scenario models [23]. Elsewhere a probabilistic con textfree parser was used to build action representations of increased complexity [24]. Automatic Construction of Ontology and Thesauri for Indexing Images During the last three decades significant effort has been spent on image retrieval systems that focus exclusively on visionspecific features such as colour distribution, shape and texture – content based image retrieval systems (CBIR)[25]. However, CBIR systems have an implicit limitation in that visual properties are not sufficient to iden tify arbitrary classes of objects [26]. The solution of associating keywords to an image sequence is not only time consuming but the choice of keywords between operator is likely to be highly variable [27]. This has led to the so called multimodal systems that use linguistic features extracted from textual captions or descriptions together with the visual features for storing and retrieving images in databases [28]. Query expansion – referring to the augmenta tion of search terms – derives additional search terms from a thesaurus or from related documents. However when images are of a specialist nature, there is a need to create domainspecific thesauri. These could be built manually by expert lexicographers but are typically timeconsuming to build, highly subjective and errorprone, and may well exhibit inadequate domain coverage. One solution is the automatic generation of thesauri for specialised domains from representative collateral texts [29,30]. A new approach to thesauri construction for specialist domains can be based on recent developments in corpusbased lexicography and terminology in which keywords [31,32] and the semantic relations of a taxonomy [33,34] of a specialist corpus can be derived from randomly but systematically organised collection of texts. Demonstrated on a large collection of scene of crime images and collateral texts, the SOCIS system was able to extract the terms required for indexing the images from a free text description of the con tents of the image. The system is also capable of learning the correspondence between keywords and image features [35]. 6 Programme and Methodology Aims and Objectives Within the context of the recently published Home Office Police Science and Technology Strategy [1], the strategic objective of the REVEAL project is to promote those key technologies which will enable automated extraction of evidence from CCTV archives, and to allow integration within existing HOLMES2 crime management systems. These key issues can be summarised as the following scientific objectives: · To develop models and algorithms for capturing the conceptual structure (i.e. the Visual Evidence Thesaurus) underpinning the work of surveillance experts annotating video streams, based on corpusbased lexicography and terminology, and making explicit the link between linguistic and metadata content. · To develop methods of automatically characterising CCTV sequences (including distinguishing between indoor and outdoor scenes, near and farfields, sparse and crowded scenes) and extracting the semantic landscape (in cluding spatiotemporal activity, geometric calibration, and the occlusion structure of the scene). Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video MetaData (REVEAL) 4 · To develop crowd models which can capture the global spatiotemporal motion characteristics of multi directional people flows, and to develop the image processing algorithms that populate such models while oper ating in high levels of visual occlusion. · To develop a rich Surveillance Metadata Model for describing extracted image content including camera and scene descriptor, the spatiotemporal semantic landscape, crowd dynamics, as well as the more classical moving object descriptors. · To develop methods of integrating the linguistic structure (the Visual Evidence Thesaurus) and the visual content (the Surveillance MetaData Model) through colearning to enable the automatic annotation of video data streams, and facilitate the retrieval of video evidence from highlevel queries. Methodology Work Package 1 – Development of Visual Evidence Thesaurus (Surrey: 12 person months) This workpackage will test the hypothesis that experts share a special language within the video surveillance do main and build a thesaurus and the accompanying conceptual structure from the documents in video surveillance together with transcripts of interviews with experts whilst they comment on exemplar surveillance tapes. This work will build on existing corpusbased methods developed at Surrey. Protocols developed for verifying automatically extracted terminology and ontology will be tested on video surveillance experts. The key deliverable of the work package will be a conceptually organised thesaurus for specific subdomains of video surveillance, for example, in door surveillance or monitoring of public places. The refined method should be adaptable to thesaurus generation for other surveillance specialisms. The thesaurus will be used as a source of keywords for indexing video sequences using standardised terminology – based on recent ISO standards. The index will be used for subsequent retrieval of sequences of interest within a collection of videotapes. A related deliverable will be to associate the generated ter minology and ontology with the HOLMES2 data model. This work is unique in that thesauri are typically developed for indexing static images and little or no work has been done in the area of sequences. Milestone M1: Automatic Ontology Extraction Methodology (Month 12) Work Package 2 Extracting Visual Semantics (Kingston 24 person months) The required annotation stream will be automatically derived from visualbased streams of cameraspecific, scene specific and objectspecific metadata that are extracted using a layered image processing algorithms. While classi cal detection/tracking algorithms can generate the objectspecific metadata associated with visual surveillance im agery, a number of significant challenges remain: geometric and chromatic selfcalibration, generation of a seman tic landscape capturing timevarying scene activity, and the ability to interpret crowded scenes. Milestone M2: Object Detection and Tracking Package (Month 06) Cameraspecific metadata includes geometric and chromatic selfcalibration. Geometric calibration is vital if real world information on velocity or interobject distances are to be embedded in the metadata. Existing work [7,8] on selfcalibration in which a camera infers its relationship to the world will be extended to nonplanar ground sur faces and indoor scenes where the background is often heavily occluded. In such cases it is necessary to treat the moving objects as virtual calibration devices and perform regularisation or parametric fitting to recover the ground surface and to establish a local coordinate system. To ensure maximum utility of available evidence, colour calibra tion will be required to support queries based on colour. Calibration techniques will extend Retinextype approaches to take advantage of the range of daily and object colours available to accommodate day/night cycles, street or in door lighting as well as intercamera variability. Milestone M3: Milestone M4: Adaptive Chromatic Calibration Algorithm (Month 12) Automatic Geometric Calibration Algorithm (Month 18) Scenespecific metadata is expressed as a semantic landscape – a temporally varying global description of the scene built directly from the detection and tracking data used to characterise the scene. Such a landscape would in clude the activity landscape and occlusion landscape. Such activity and occlusion information is vital to the object trackers to both improve trajectory prediction and resolve track ambiguity. We shall investigate methods of embed ding this information within the classical object tracking algorithms. Associating textual names supplied by human markup operators with these extracted modes of activity and occlusion regions will significantly enrich the annota tion data stream. Offering a temporally varying probability distribution model of activity in the scene (e.g. using HMMs [9]), the activity landscape identifies the objectspecific pathways, stopping places, and entrance/exit Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video MetaData (REVEAL) 5 zones) associated specific structures in the environment – doorways, routes, meeting places, queuing areas. Also automatically derived from the activity data, the occlusion landscape captures the short and longterm occlusion regions associated with permanent and temporary scene furniture such as walls and park cars. Milestone M5: Automated Construction of Semantic Landscape (Month 24) Work Package 3 – Development of Surveillance Metadata Model (Sira: 6 person months) The metadata streams delivered under WP 2 must be integrated to establish a visual knowledge ontology (or Sur veillance Metadata Model) that can extrapolate from lowlevel computer vision concepts to high level representa tions of activity or behaviour. We will build on the Cognitive Vision ontology developed within the ECVision pro ject (IST 35454) to structure this knowledge representation from fundamental concepts of observable features (building on the work of the MPEG7 community) through to objects, scenes and events. The resulting structure needs to incorporate subtle relationships such as between crowd and individual to ensure that the ontology reflects the richness of the application context rather than ease of description. As the visual knowledge and linguistic struc tures are colearnt, a methodology will be developed to enable the visual evidence thesaurus to suggest and update the higher level cognitive abstractions in the Surveillance Metadata Model. Milestone M6: Definition of the Surveillance Metadata Model (Month 12) Work Package 4 Analysing Crowds (Kingston: 12 person months) Since classical object tracking completely breaks down as the density of objects increases, a novel approach must be sought. Essentially a crowd is a hierarchical entity having its own fuzzy physical extent and an emergent collective behaviour, yet is composed of individuals who join and leave. Novel representational forms that capture this dual aspect are to be developed, which are capable of being linked to the Visual Evidence Thesaurus. Rather than relying on traditional tracking, optical flow data can be combined with the previously described activity landscape to de scribe regions of varying density and fast or slow flow, which can act as input prior probabilities for populating the crowd model. Having characterised the global characteristics, the motion patterns will be used to deploy banks of local object trackers that identify and follow the individuals within a crowd. Such trajectory data can update crowd model accuracy and introduce more specific quantitative information on density and interactions of people. The trackers will adopt a headandshoulder active appearance model of the gradient magnitude structures associated with people in crowds, and use optical flow information to establish temporal coherence. Milestone M7: Crowd Analysis Software (Month 24) Work Package 5: TextMetadata Mining and Multimodal Data Fusion (Surrey: 12 person months) Multimodal data fusion is the integration of disparate data from a variety of sources into a rich, homogenous and formal framework. Here, the fusion of the Visual Evidence Thesaurus and the Surveillance Metadata Model has two outcomes: video summarisation – the automatic annotation of video evidence from the visual metadata gener ated in WP2 and WP3; and retrieval of key sequences from a video archive using highlevel textual queries. A vari ety of existing multinet neural systems and adaptive information extraction systems will be evaluated to perform the fusion. Information extraction based methods for text mining, developed in the SoCIS project, will be tested on a set of exemplar tapes in which experts describe events and objects appearing on a tape. This method has been suc cessful in fusing visual and textual descriptions of still images in forensic science. The novelty here is that the two sets of features will be extracted automatically and the learning will be automated. If successful, the method can be adapted for other domains where any video data is accompanied by text or sound. The deliverable of this workpack age will contribute to the emerging discipline of adaptive information extraction. Milestone M8: Multimodal Data Fusion component (Month 24) Work Package 6 – Video Summarisation (Surrey: 12 person months; Kingston: 12 person months; SIRA: 6 person months) Using the digital video recording platform supplied by Overview Ltd, an automatic annotation prototype will be developed to validate the effectiveness of the coupling of the Visual Evidence Thesaurus and the Surveillance Metadata Model structure. Larger numbers of exemplar video sequences will be indexed on the different modalities (WP1, WP2 and WP4) and the system will learn the correspondence between the modalities using unsupervised neural networks. The generation of textual descriptors will require commentary from experts in video surveillance, relevance feedback from potential endusers and collections of text collateral to specific video sequences. For man Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video MetaData (REVEAL) 6 agement and processing of large volumes of video data, emerging Grid middleware will be evaluated for the secure provision of data management and processing. Extensive user testing is involved in this workpackage not only for testing the system’s efficacy but, informed by WP3 for enriching the joint index. Milestone M9: Prototype video summarisation system (Month 36) Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video MetaData (REVEAL) 7 7 Relevance to Beneficiaries The principal beneficiaries of the results will be users of CCTV archives – especially those involved in the admini stration of justice i.e. the UK Police Forces represented by partners PITO and PSDB. Given the proportion of effort of any investigation spent reviewing video data, advancing annotation technologies will free up officers allowing crime investigation teams to be more effective at solving crimes. As importantly, the extraction and integration of evidence into crime management systems will facilitate the crossreferencing of evidence from many CCTV sources. Access to this evidence will be eased for a range of interested parties (detectives, lawyers, judges and ju ries). Our approach based on knowledge extraction and linking different information modalities will be of benefit to cognitive scientists and linguists interested in how video events are understood, interpreted and described by hu mans. Other relevant academic areas include multimedia databases and digital libraries. Overview Ltd, who are specifically interested in annotation processes which enable accurate retrieval, will contribute their metadata based digital video recording platform. The second contribution will be in crowd analysis benefiting video surveillance technologies and associated markets in people counting, marketing, and the design of safe public spaces, as well as the academic image processing community in general. Both Ipsotek Ltd and Crowd Dynamics Ltd are providing expertise and software to support this activity, and access to appropriate video datasets. 8 Dissemination and Exploitation Research results will be disseminated in proceedings of international conferences and journals in the fields of Computer Vision, Artificial Intelligence and Information Systems – specifically BMVC, ICCV, IJCNN and SCI. Project progress will be reported on a specially designed Website hosted by the Kingston University and a functional prototype of the final system will be demonstrated on this website. 9 Justification of Resources Personnel: Over this 3 year grant, Kingston University will appoint two Research Assistants contributing 36 person months to the computer vision (WP2, WP4) activity and twelve person months to the common Video Summarisation (WP6) activity. The University of Surrey will appoint one RA for three years to address the textual data mining and data fusion activities (WP1, WP5) and integrate and validate these within the Video Summarisation (WP6) work package. Finally Sira Ltd will contribute twelve person months over the three years to the development of the Sur veillance Metadata Model and its refinement and validation within the Video Summarisation (WP6) work package. Kingston is also requesting funding for technical support for the integration activities associated with the Video Summarisation prototype supplied by Overview Ltd to Kingston University. In addition, a small level of funding is requested to support the administrative and clerical activity of coordinating a project with seven partners. Travel: Funding is requested for travel to the quarterly Steering Committee and to allow the collaborating partners to meet to discuss the ongoing work. In addition, conference costs are requested to disseminate project results. Consumables: Each of the Research Assistants will be furnished with a workstation appropriately specified to sup port the video acquisition, high volume, and high computation demands expected of the development work. Some data protection issues require the purchase of suitable secure physical storage of video media. 10 References [1] Police Science and Technology Strategy 20032008, Home Office Policy Unit, January 16. (2003) [2] Tenny, C. and J. Pujstevosky (eds.) “Events as grammatical objects” California: CSLI Publications. (2000) [3] Y. Fu, A. Ekin, A.M. Tekalp and R. Mehrotra, “Temporal segmentation of video objects for hierarchical object based motion description”, IEEE Trans. Image Processing, Vol. 11(2), pp. 135145 (2002) [4] E. Megalou and T. Hadzilacos, “Semantic abstractions in the multimedia domain”, IEEE Transactions on Knowl edge and Data Engineering, Vol. 15(1), pp. 136160 (2003) [5] R. Collins, A. Lipton, T. Kanade, H. Fujiyoshi, D. Duggins, Y. Tsin, D. Tolliver, N. Enomoto, and O. Hasegawa, “A System for Video Surveillance and Monitoring”, VSAM Final Report, CMURITR0012, Robotics Institute, Carnegie Mellon University, May (2000) [6] J. Black and T. Ellis, “MultiCamera Image Tracking”, Proceedings 2nd IEEE International Workshop on Per formance Evaluation of Tracking and Surveillance, Kauai, Hawaii, December 9 (2001) Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video MetaData (REVEAL) 8 [7] G.A. Jones, J. Renno, and P. Remagnino, “AutoCalibration in MultipleCamera Surveillance Environments”, IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, Copenhagen, June (2002) [8] T. J. Ellis, D. Makris and J. K. Black, “Learning a MultiCamera Topology”, Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, Nice, , October 1112 (2003) pp [9] P. Remagnino and G.A. Jones, “Spatial and Probabilistic Modelling of Pedestrian Behaviour”, British Machine Vision Conference, pp 685694, (2001) [10] F. Cupillard, F. Brémond and M. Thonnat, Behaviour Recognition for Individuals, Groups of people and Crowd, IEE Symposium on Intelligent Distributed Surveillance Systems, London, 26 February (2003) [11] S. Raupp Musse, D.Thalmann, “Hierarchical Model for Real Time Simulation of Virtual Human Crowds”, IEEE Transactions on Visualization and Computer Graphics, Vol.7(2), pp.152164 (2001) [12] Laure Heigeas and Annie Luciani and Joelle Thollot and Nicolas Castagne, "A PhysicallyBased Particle Model of Emergent Crowd Behaviors", Int. Conference on the Computer Graphics and Vision, Sept. 5 10, Moscow (2003) [13] D.B. Yang, H.H. GonzálezBaños, L.J. Guibas, "Counting People in Crowds with a RealTime Network of Simple Image Sensors", International Conference on Computer Vision, Nice, pp 122129 (2003) [14] B. Maurin, O. Masoud and N.P. Papanikolopoulos, Monitoring Crowded Traffic Scenes, IEEE 5th International Conference on Intelligent Transportation Systems. Singapore, September 3–6, (2002) [15] A. Ali and J. Aggarwal, “Segmentation and Recognition of Continuous Human Activity”, Proceedings IEEE Workshop on Detection and Recognition of Events in Video, pp 2835, July 8 (2001) [16] H. Buxton, “Generative Models for Learning and Understanding Scene Activity”, Technical Report DIKUTR 2002/01 (ISSN 01078283), The University of Copenhagen, pp 7181, June (2002) [17] S. Hongeng, F. Brémond and R. Nevatia. “Representation and Optimal Recognition of Human Activities”, In IEEE Proceedings of Computer Vision and Pattern Recognition, South Carolina, USA, 2000. [18] N.M. Oliver, B. Rosario and A.P. Pentland, “A Bayesian Computer Vision System for Modeling Human Interac tions”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), pp 831843, 2000. [19] M. Brand, N. Oliver, and A. Pentland. Coupled Hidden Markov Models for Complex Action Recognition. IEEE Conference on Computer Vision and Pattern Recognition, pages 994–999, Puerto Rico, (1996) [20] N. Rota, “Contribution à la reconnaissance de comportements humains à partir de séquences vidéos”. Thèse, INRIAUniversité de Nice Sophia Antipolis, 2001/10. [21] N.T. Siebel and S.J. Maybank, S. J. “Fusion of Multiple Tracking Algorithms for Robust People Tracking”. In Proceedings 7th European Conference on Computer Vision, Copenhagen, Denmark, May 2002. [22] F. Cupillard, F. Brémond and M. Thonnat, “Group Behavior Recognition with Multiple Cameras”, IEEE Proc. of the Workshop on Applications of Computer Vision, Orlando, 2002/12. [23] V.T. Vu, F. Brémond and M. Thonnat, “Temporal Constraints for Video Interpretation”, ECAI Lyon (2002) [24] Y. Ivanov and C. Stauffer and A. Bobick and E. Grimson, "Video Surveillance of Interactions", IEEE Conference on Computer Vision and Pattern Recognition Workshop on Visual Surveillance, Fort Collins, November (1999) [25] R.C. Veltkamp and M. Tanase, “ContentBased Image Retrieval Systems: A Survey. (Technical Report UUCS 200034). Institute of Information and Computing Sciences, Univ. of Utrecht, The Netherlands (2000) [26] McG.D Squire, W. Muller, H. Muller, and T. Pun, T, “ContentBased Query of Image databases: Inspirations from Text Retrieval”, Pattern Recognition Letters 21. Elsevier, pp 11931198 (2000) [27] Ogle, V.E., Stonebraker, M.: “Chabot: retrieval from a relational database of images”, IEEE Computer Maga zine, Vol. 28(9), pp 4048 (1995) [28] Paek, S., Sable C.L., Hatzivassiloglou, V., Jaimes, A., Schiffman, B.H., Chang, S.F., McKeown, K.R.: Integration of Visual and TextBased Approaches for the Content Labeling and Classification of Photographs. ACM SIGIR'99 Workshop on Multimedia Indexing and Retrieval, Berkeley, CA (1999) [29] Jing, Y., Croft, W.B.: “An Association Thesaurus for Information Retrieval”. In: Bretano, F., Seitz, F.: (eds.): Proceedings of the RIAO’94 Conference. CISCASSIS, Paris, France (1994) 146160 [30] Grefenstette, G.: Explorations in Automatic Thesaurus Discovery. Kluwer Academic, Boston, USA (1994) [31] K. Ahmad, M. Tariq, B. Vrusias and C. Handy, "CorpusBased Thesaurus Construction for Image Retrieval in Specialist Domains", 25th European Conf. on Information Retrieval Research, Pisa, April 1416, pp 502510 (2003) [32] Leech, G.: The State of the Art in Corpus Linguistics. In: Aijmer, K., Altenberg, B. (eds.): English Corpus Lin guistics: In honour of Jan Svartvik. Longman, London (1991) [33] Ahmad, K., Rogers, M.A.: CorpusBased Terminology Extraction. In: Budin, G., Wright S.A. (eds.): Handbook of Terminology Management, Vol.2. John Benjamins Publishers, Amsterdam (2000) 725760. [34] Hearst, M.: Automatic Acquisition of Hyponyms from Large Text Corpora. Proceedings of the Fourteenth Interna tional Conference on Computational Linguistics. Nantes, France. (1992) Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video MetaData (REVEAL) 9 [35] K. Ahmad, M. Casey, and B. Vrusias, “Combining Multiple Modes of Information using Unsupervised Neural Classifiers. In (Ed.) Terry Windeatt and Fabio Rolli, International Workshop on Multiple Classifier Systems, LNCS 2709, Heidelberg: SpringerVerlag. pp 236245 (2003) Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video MetaData (REVEAL) 10 Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video MetaData (REVEAL) APPENDIX 1: DIAGRAMMATIC WORKPLAN WP 1 2 3 4 5 6 Year 1 Year 2 Year 3 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Name Development of Visual Evidence Thesaurus Extracting Visual Semantics Development of Surveillance Metadata Model Analysing Crowds TextMetadata Mining and Multimodal Data Fusion Video Summarisation M1 M2 M3 M4 M5 M6 M7 M8 M9 Milestones M1 Automatic Ontology Extraction Methodology M2 Object Detection and Tracking Package M3 Adaptive Chromatic Calibration Algorithm M4 Adaptive Geometric Calibration Algorithm M5 Automated Construction of Semantic Landscape M6 Definition of the Surveillance Metadata Model M7 Crowd Analysis Software M8 Multimodal Data Fusion component M9 Prototype video summarisation system Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video MetaData (REVEAL) 9 APPENDIX 2: LETTERS IN SUPPORT OF REVEAL PROPOSAL Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video MetaData (REVEAL) 10 Variables A Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video Meta Data Variables B REVEAL Recovering Evidence from Video by fusing Video Evidence Thesaurus and Video MetaData (REVEAL) 11
© Copyright 2025 Paperzz