ERC ALLEGRO Active large-scale learning for visual recognition Cordelia Schmid INRIA Grenoble ERC advanced grant ALLEGRO Massive and ever growing amount of digital image and video content – Flickr and YouTube – Audiovisual archives (BBC, INA) – Personal collections 2 ERC advanced grant ALLEGRO Comes with additional information ‒ Text ‒ Audio ‒ Other metadata A rather sparse and noisy, yet rich and diverse source of annotation 3 Active large-scale learning for visual recognition Object detection Active large-scale learning Action recognition 4 Some recent results • Weakly supervised learning of actors and actions from scripts – Data: movies (DVDs) + transcripts obtained from the web • Large-scale event recognition – Data: NIST TrecVid Multimedia event detection dataset • Weakly supervised learning of object detectors from YouTube videos – Data: YouTube videos collected by keyword search 5 Scripts as weak supervision Challenges: • Imprecise temporal localization • No explicit spatial localization • NLP problems, scripts ≠ training labels vs. Get-out-car 24:25 Uncertainty “… Will gets out of the Chevrolet. …” “… Erin exits her new truck…” 24:51 6 Joint Learning of Actors and Actions [Bojanowski et al. ICCV 2013] Rick? Rick? Walks? Walks? Rick walks up behind Ilsa Joint Learning of Actors and Actions [Bojanowski et al. ICCV 2013] Rick Walks Rick walks up behind Ilsa Finding Actions and Actors in Movies [Bojanowski, Bach, Laptev, Ponce, Sivic, Schmid, 2013] Some recent results • Weakly supervised learning of actors and actions from scripts – Data: movies (DVDs) + transcripts obtained from the web • Large-scale event recognition – Data: NIST TrecVid Multimedia event detection dataset • Weakly supervised learning of object detectors from YouTube videos – Data: YouTube videos collected by keyword search 10 Event recognition • For example birthday party, parade … Birthday party Grooming an animal NIST TrecVid Multi-media event detection task (MED) TrecVid MED 2013 datasets • Training: 100 positive video clips per event category, 5000 negative videos, 30 events • Testing on 98000 videos clips, i.e., 4000 hours, test annotations done manually, not provided, evaluation done directly by NIST • Videos come from publicly available, user-generated content on various Internet sites Our approach for event classification • Visual features: motion with dense trajectory features and static with SIFT • Audio features, OCR, ASR • EU IP AXES first for event detection TrecVid MED 2013 - results rank 1 rank 2 Horse riding competition rank 3 TrecVid MED 2013 - results rank 1 rank 2 Tuning a musical instrument rank 3 Some recent methods • Weakly supervised learning of actors and actions from scripts – Data: movies (DVDs) + transcripts obtained from the web • Large-scale event recognition – Data: NIST TrecVid Multimedia event detection dataset • Weakly supervised learning of object detectors from YouTube videos – Data: YouTube videos collected by keyword search 16 WS learning from YouTube videos • Data: YouTube videos collected by keyword search, for example birds and cats 17 WS learning fromYouTube videos Localize object tubes 18 Localizing and selecting tubes Motion Segments Candidate Tubes Automatically Selected Tube 19 Issues with data access • Which inage/video datasets can be used for research? • Which images/videos can be used for illustration in talks and presentations? • What can be made available on-line? • Large-scale collection and annotation is a significant effort – Precise definition of requirements and the legal issues is a must 20 Issues with data access – ERC ALLEGRO • Ethical issues request prior to the start of ERC ALLEGRO • Ethical advisor at INRIA (committee with a legal expert) • Response to my open questions • Indicated a large number of limitations of data use 21 Initial solutions • Use of existing large-scale datasets – – – – Good practice: comparison to the state of the art NIST (National Institute of Standards and Technology)TrecVid ImageNET, 20K classes, 14M images, hosted by Stanford Univ. Autism dataset involving children from Georgia Tech • Filming of our own dataset – Explicit agreement from all the participants (members of our team) • Images/videos with creative commons license – In case of videos almost never present 22 Open problems / question • Large-scale harvesting of data from the web, in particular videos? • Providing the data to the community for comparison? • In the future: open access data to archives? • Large-scale collection and annotation is a significant effort – Precise definition of requirements and the legal issues is a must 23
© Copyright 2026 Paperzz