Results References Acknowledgment Problem Statement Approach

Monte Carlo Tree Search for Scheduling Activity Recognition
Mohamed R. Amer, Sinisa Todorovic, Alan Fern
Oregon State University, Corvallis, OR, USA
Song-Chun Zhu
University of California, Los Angeles, CA, USA
Activity Parsing as a Search Problem
Motivation
Querying surveillance videos requires running many
detectors of:
β€’ group activities,
β€’ individual actions,
β€’ objects.
Running all the detectors is inefficient as they may
provide little to no information for answering the query.
Problem Statement
Given: a large video with high resolution and multiple
activities happening at the same time, and a query.
Goal: perform cost-sensitive parsing to answer the query.
Results
Action: π‘Žπ‘‘ ∈ π‘π‘Ÿπ‘œπ‘π‘’π‘ π‘ , π‘‘π‘’π‘‘π‘’π‘π‘‘π‘œπ‘Ÿ, π‘ π‘π‘Žπ‘‘π‘–π‘œπ‘‘π‘’π‘šπ‘π‘œπ‘Ÿπ‘Žπ‘™ π‘£π‘œπ‘™π‘’π‘šπ‘’ ,
State: 𝑆0:𝑑 = 𝑠0 π‘Ž0 , π‘Ž1 , … , π‘Žπ‘‘βˆ’1 ,
New from [4]
Policy: Ξ  = 𝑠0 , π‘Ž0:𝑑 𝑑 = 1,2, … , 𝐡 , where 𝐡 is the inference budget,
Reward: π‘Ÿ(Ξ ) = I log 𝑝 π’‘π’ˆ(Ξ ) > πœƒ , where πœƒ is an input threshold.
The cost-sensitive inference for a policy, Ξ , is conducted by taking the set of
actions, π‘Ž0:𝐡 , that provides the maximum utility, 𝑄, for states, 𝑆0:𝐡 :
Ξ  = argmaxa 𝑄 𝑆0:𝑑 , π‘Žπ‘‘ , βˆ€ 𝑑 = 1,2, … , 𝐡 ,
1
π‘Žπ‘π‘π‘’π‘π‘‘,
𝑖𝑓 π‘Ÿ Ξ  =
0
π‘Ÿπ‘’π‘—π‘’π‘π‘‘.
Classification accuracy on New Collective Activity dataset
Parsing spatiotemporal And-Or Graphs can be defined in
terms of (Ξ±, Ξ², Ι£, Ο‰):
β€’ Ξ±: detecting objects, individual actions, group activities
directly from video features.
β€’ Ξ²: predicting an activity from its detected parts by
bottom-up composition.
β€’ Ι£: top-down prediction of an activity using its context.
β€’ Ο‰: predicting an activity at a given time based on
tracking across the video.
log 𝑝
∝
𝑇
𝑀𝛼
πœ™
𝜏
π‘₯𝑙 , 𝑦, β„Ž
+
Ξ± detector of activity
+
𝑖,𝑗
𝑇
𝑀𝛽
πœ™
𝜏
𝜏
π‘₯𝑖𝑙+ , π‘₯𝑗𝑙+ , 𝑦, β„Ž
Ξ² relations between parts of the activities
Limited budget
Infinite budget
[6] Tracking detections:
Approach
𝜏
π’‘π’ˆπ‘™
Classification accuracy on UCLA Courtyard dataset
𝑇
𝑀𝛾
πœ™
(V3), (QL), [4] Parse graph with no tracking:
(V2), [9] Parse graph with tracking the query:
Classification accuracy on Collective Activity dataset
𝜏
π‘₯π‘™βˆ’ , 𝑦, β„Ž
Ι£ context of activity
+
𝑇
𝑀𝛼
πœ™
𝜏 πœβˆ’
π‘₯𝑙 , π‘₯𝑙 , 𝑦, β„Ž
NEW from [4]
(V1) Parse graph with tracking all:
Ο‰ tracking of activity
New from [4]
References
Learning
Selection
Simulation
Backpropagation
Selection: select state 𝑆0:𝑑 .
Simulation: update the expected utility for the simulations, given π‘Ÿ Ξ π‘ π‘–π‘š .
Backpropagation: propagate the expected utility back to the root node by updating all
the nodes on the path using:
β€²
𝑄
𝑆0:𝑑 , π‘Žπ‘‘ ←
β€²
𝑄
𝑆0:𝑑 , π‘Žπ‘‘ + 𝐢 π‘Ÿ Ξ π‘ π‘–π‘š
βˆ€ 𝑆0:𝑑 ∈ Ξ sim , 𝑑 = {1,2, … , 𝐡}
[4] M. Amer, D. Xie, M. Zhao, S. Todorovic, S-C. Zhu. β€œCost-sensitive topdown/bottom-up inference for multi-scale activity recognition” ECCV12.
[6] W. Choi, S. Savarse. β€œA unified frame work for multi-target tracking and
collective activity recognition” ECCV12.
[7] W. Choi, K. Shahid, S. Savarse. β€œWhat are they doing?: Collective
activity classification using spatio-temporal relationship among people.”
ICCV-W09
[9] S. Khamis, V. Morariu, L.S. Davis. ”Combining per-frame and per-track
cues for multi-person action recognition.” ECCV12
Acknowledgment
NSF IIS 1018490, ONR MURI N00014-10-1-0933,
DARPA MSEE FA 8650-11-1-7149