Monte Carlo Tree Search for Scheduling Activity Recognition
Mohamed R. Amer, Sinisa Todorovic, Alan Fern
Oregon State University, Corvallis, OR, USA
Song-Chun Zhu
University of California, Los Angeles, CA, USA
Activity Parsing as a Search Problem
Motivation
Querying surveillance videos requires running many
detectors of:
β’ group activities,
β’ individual actions,
β’ objects.
Running all the detectors is inefficient as they may
provide little to no information for answering the query.
Problem Statement
Given: a large video with high resolution and multiple
activities happening at the same time, and a query.
Goal: perform cost-sensitive parsing to answer the query.
Results
Action: ππ‘ β ππππππ π , πππ‘πππ‘ππ, π πππ‘πππ‘πππππππ π£πππ’ππ ,
State: π0:π‘ = π 0 π0 , π1 , β¦ , ππ‘β1 ,
New from [4]
Policy: Ξ = π 0 , π0:π‘ π‘ = 1,2, β¦ , π΅ , where π΅ is the inference budget,
Reward: π(Ξ ) = I log π ππ(Ξ ) > π , where π is an input threshold.
The cost-sensitive inference for a policy, Ξ , is conducted by taking the set of
actions, π0:π΅ , that provides the maximum utility, π, for states, π0:π΅ :
Ξ = argmaxa π π0:π‘ , ππ‘ , β π‘ = 1,2, β¦ , π΅ ,
1
ππππππ‘,
ππ π Ξ =
0
ππππππ‘.
Classification accuracy on New Collective Activity dataset
Parsing spatiotemporal And-Or Graphs can be defined in
terms of (Ξ±, Ξ², Ι£, Ο):
β’ Ξ±: detecting objects, individual actions, group activities
directly from video features.
β’ Ξ²: predicting an activity from its detected parts by
bottom-up composition.
β’ Ι£: top-down prediction of an activity using its context.
β’ Ο: predicting an activity at a given time based on
tracking across the video.
log π
β
π
π€πΌ
π
π
π₯π , π¦, β
+
Ξ± detector of activity
+
π,π
π
π€π½
π
π
π
π₯ππ+ , π₯ππ+ , π¦, β
Ξ² relations between parts of the activities
Limited budget
Infinite budget
[6] Tracking detections:
Approach
π
πππ
Classification accuracy on UCLA Courtyard dataset
π
π€πΎ
π
(V3), (QL), [4] Parse graph with no tracking:
(V2), [9] Parse graph with tracking the query:
Classification accuracy on Collective Activity dataset
π
π₯πβ , π¦, β
Ι£ context of activity
+
π
π€πΌ
π
π πβ
π₯π , π₯π , π¦, β
NEW from [4]
(V1) Parse graph with tracking all:
Ο tracking of activity
New from [4]
References
Learning
Selection
Simulation
Backpropagation
Selection: select state π0:π‘ .
Simulation: update the expected utility for the simulations, given π Ξ π ππ .
Backpropagation: propagate the expected utility back to the root node by updating all
the nodes on the path using:
β²
π
π0:π‘ , ππ‘ β
β²
π
π0:π‘ , ππ‘ + πΆ π Ξ π ππ
β π0:π‘ β Ξ sim , π‘ = {1,2, β¦ , π΅}
[4] M. Amer, D. Xie, M. Zhao, S. Todorovic, S-C. Zhu. βCost-sensitive topdown/bottom-up inference for multi-scale activity recognitionβ ECCV12.
[6] W. Choi, S. Savarse. βA unified frame work for multi-target tracking and
collective activity recognitionβ ECCV12.
[7] W. Choi, K. Shahid, S. Savarse. βWhat are they doing?: Collective
activity classification using spatio-temporal relationship among people.β
ICCV-W09
[9] S. Khamis, V. Morariu, L.S. Davis. βCombining per-frame and per-track
cues for multi-person action recognition.β ECCV12
Acknowledgment
NSF IIS 1018490, ONR MURI N00014-10-1-0933,
DARPA MSEE FA 8650-11-1-7149
© Copyright 2026 Paperzz