ERC ALLEGRO Active large-scale learning for visual recognition

ERC ALLEGRO
Active large-scale learning
for visual recognition
Cordelia Schmid
INRIA Grenoble
ERC advanced grant ALLEGRO
Massive and ever growing
amount of digital image and
video content
– Flickr and YouTube
– Audiovisual archives (BBC, INA)
– Personal collections
2
ERC advanced grant ALLEGRO
Comes with additional information
‒ Text
‒ Audio
‒ Other metadata
A rather sparse and noisy, yet rich
and diverse source of annotation
3
Active large-scale learning for visual recognition
Object detection
Active large-scale
learning
Action recognition
4
Some recent results
• Weakly supervised learning of actors and actions from
scripts
– Data: movies (DVDs) + transcripts obtained from the web
• Large-scale event recognition
– Data: NIST TrecVid Multimedia event detection dataset
• Weakly supervised learning of object detectors from
YouTube videos
– Data: YouTube videos collected by keyword search
5
Scripts as weak supervision
Challenges:
• Imprecise temporal localization
• No explicit spatial localization
• NLP problems, scripts ≠ training labels
vs. Get-out-car
24:25
Uncertainty
“… Will gets out of the Chevrolet. …”
“… Erin exits her new truck…”
24:51
6
Joint Learning of Actors and Actions
[Bojanowski et al. ICCV 2013]
Rick?
Rick?
Walks?
Walks?
Rick walks up behind Ilsa
Joint Learning of Actors and Actions
[Bojanowski et al. ICCV 2013]
Rick
Walks
Rick walks up behind Ilsa
Finding Actions and Actors in Movies
[Bojanowski, Bach, Laptev, Ponce, Sivic, Schmid, 2013]
Some recent results
• Weakly supervised learning of actors and actions from
scripts
– Data: movies (DVDs) + transcripts obtained from the web
• Large-scale event recognition
– Data: NIST TrecVid Multimedia event detection dataset
• Weakly supervised learning of object detectors from
YouTube videos
– Data: YouTube videos collected by keyword search
10
Event recognition
• For example birthday party, parade …
Birthday party
Grooming an animal
NIST TrecVid Multi-media event detection task (MED)
TrecVid MED 2013 datasets
• Training: 100 positive video clips per event category,
5000 negative videos, 30 events
• Testing on 98000 videos clips, i.e., 4000 hours, test
annotations done manually, not provided, evaluation done
directly by NIST
• Videos come from publicly available, user-generated
content on various Internet sites
Our approach for event classification
• Visual features: motion with dense trajectory features and
static with SIFT
• Audio features, OCR, ASR
• EU IP AXES first for event detection
TrecVid MED 2013 - results
rank 1
rank 2
Horse riding competition
rank 3
TrecVid MED 2013 - results
rank 1
rank 2
Tuning a musical instrument
rank 3
Some recent methods
• Weakly supervised learning of actors and actions from
scripts
– Data: movies (DVDs) + transcripts obtained from the web
• Large-scale event recognition
– Data: NIST TrecVid Multimedia event detection dataset
• Weakly supervised learning of object detectors from
YouTube videos
– Data: YouTube videos collected by keyword search
16
WS learning from YouTube videos
• Data: YouTube videos collected by keyword search, for
example birds and cats
17
WS learning fromYouTube videos
Localize
object tubes
18
Localizing and selecting tubes
Motion Segments
Candidate Tubes
Automatically
Selected Tube
19
Issues with data access
• Which inage/video datasets can be used for research?
• Which images/videos can be used for illustration in talks
and presentations?
• What can be made available on-line?
• Large-scale collection and annotation is a significant
effort
– Precise definition of requirements and the legal issues is a must
20
Issues with data access – ERC ALLEGRO
• Ethical issues request prior to the start of ERC ALLEGRO
• Ethical advisor at INRIA (committee with a legal expert)
• Response to my open questions
• Indicated a large number of limitations of data use
21
Initial solutions
• Use of existing large-scale datasets
–
–
–
–
Good practice: comparison to the state of the art
NIST (National Institute of Standards and Technology)TrecVid
ImageNET, 20K classes, 14M images, hosted by Stanford Univ.
Autism dataset involving children from Georgia Tech
• Filming of our own dataset
– Explicit agreement from all the participants (members of our team)
• Images/videos with creative commons license
– In case of videos almost never present
22
Open problems / question
• Large-scale harvesting of data from the web, in particular
videos?
• Providing the data to the community for comparison?
• In the future: open access data to archives?
• Large-scale collection and annotation is a significant
effort
– Precise definition of requirements and the legal issues is a must
23