ge-scale
056
“on
057 the coast,
at
058least three birds”
059
060
500
061
062
063
064
065
066
067
068
069
070
071
072
073
074
075
076
077
078
079
080
“bright “possibly
Hierarchies
object” outdoors”
missionTime (ms)
40
60
‣We deal with varied images, classes, and
recognition actions.
feline
dog
vehicle
‣Even when budget does not let us compute
everything, want to provide best performance.
n retriever
ce bear)
tabby cat
(dungeness crab)
garbage truck
(boathouse)
1: Top: Visualization of specific object classification
f interest in daily life, which are often subtrees in a
cale object taxonomy, e.g. the ImageNet hierarchy.
: Adapting the ImageNet classifier allows us to perccurate prediction (bold), while the original classifier
ion (in parentheses) suffers from a higher confusion.
Related Work
feature
selectionto intermediwithStatic
classification
tasks corresponding
trees in this hierarchy. While it is reasonable to train
e classifiers for specific tasks, this quickly become
ble as there are a huge number of possible
tasks - 1
1
btreeCascade
in the hierarchy may be a latent task2 one may 2
3
ter. Viola & Jones CVPR 2001
3
4
4
s, it would
beneficial
Chen etbe
al. ICML
2012 to have a system which
a large number of object categories in the world, and
1 classi- 1
s able to quickly adapt to specific incoming
tasks once deployed. We are particularly interested
cenario
where tasks feature
are not explicitly given, but imDynamic
specified with a set of query images, or a stream of
selection
mages
in an online fasion. This has practical imporfor example,
one NIPS
may2011
want to have a single mobile
Gao & Koller
al. NIPS 2012on a field trip after a few
t adaptsBenbouzid
to plant et
recognition
K
R
Karayev
et
al.
NIPS
2012
queries, and that shifts to grocery recognitions when
ps by the grocery store. This is a new challenge beRK
1
2
3
f
4
1
2
1
2
3
4
3
4
1
2
3
4
081
082
083
084
085
086
087
088
089
0
1
090
0
2
091
3
3
4 092
4
093
094 1
095 2
096 3
097 4
098
099 1
100
101
102
103
104
105
106
107
K
7! R or R
1
2
R
3
4
1
2
3
3
4
4
2
3
R
K
1
1
2
2
2
2
3
3
3
3
4
4
4
4
‣We learn to select action of maximum
• Bit vector m of length F : initially all bits are 1 and are set to 0 when the corresponding
expectedfeature
value isfrom
every
state.
computed.
• For each hf , a vector of size df representing the values; 0 until observed.
‣State •isCost
updated
result
of of
the
feature cwith
2 [0, the
1], for
fraction
the budget
spent.
4
3
• action.
Bias feature 1.
selected
3
0
2
q7
q6
q5
q4
q3
q2
q1
q0
d2
d1
d0
random
0
1
Number in action sequence
3
2
q7
q6
q5
q4
q3
q2
q1
q0
d2
d1
d0
optimal
Optimal
1.0 Random
0
1
2
3
Number in action sequence
3
These
features
define the
presentingto
enough
have
a closed-loop (dy‣We
train
classifier
ondynamic
subsetsstate,
of features,
give information
answer at to
any
time.
3
0
0
d0
namic) policy that may select different features for different test instances. The static
state hasNumber
all
Feature
d : sign of dimension i D
of the above features except for the observed feature values. This enables only an open-loop
(static)
q : label of datapoint,
2
if in quadrant o
Action
selection
policy, which
is exactly the same for all instances. Policy learned with the static state is used as a
baseline in experiments.
Reward
definition
Policy:
The state-action feature function (s, a) effectively block-codes these features: it is 0 everywhere
except the block corresponding to the action considered. In implementation, we train F separate
regressions with a tied regularization parameter, which is K-fold cross-validated.
Action-Value function:
3
3
i
o
D
Optimal
1.0 Random
Cost
1
10
q7
q6
q5
q4
q3
q2
q1
q0
d2
d1
d0
static, non-myopic
0
1
2
3
Number in action sequence
4
q7
q6
q5
q4
q3
q2
q1
q0
d2
d1
d0
Dynamic, greedy
Dynamic, non-myopic
4
RK
0.6
0.4
0.2
0.0
0
0
1
2
3
Number in action sequence
0.8
0.6
0.4
0.617
0.618
0.610
0.453
0.4
0.455
0.444
0.355
0.3
0.318
0.2
0.0
0
2
4
6
8
Cost
learning the policy
=4
=2
=1
=
1
=
4
1
2
12
0.1
0.000
0.0
14
Optimal
Training algorithm
cf
4.2.
Bs Scene recognition.
Random
Static,
greedy
0.331
Random
Static, greedy
Static,Test-time
non-myopic
cost
0.000
Dynamic,
Static, Dynamic,
Dynamic, greedy
Figure 3. Accuracy as a function of cpu-cost during test-time. The
non-myopic greedy non-myopic
curve is generated
by gradually non-myopic
increasing . Miser champiDynamic,
ons the accuracy/cost tradeoff and obtains similar accuracy as the
Scene Recognition. The Scene-15 data set (Lazebnik et al.,
2006) is from a very different data domain. It contains 4485
images from 15 scene classes and the task is to classify images according to scene. Figure 4 shows one example image for each scene category. We follow the procedure used
by Lazebnik et al. (2006); Li et al. (2010), randomly sampling 100 images from each class, resulting in 1500 training
images. From the remaining 2985 images, we randomly
sample 20
20images from each class
25 as validation, and30leave
the rest 2685 for test.
for each of the 14 feature channels. This dataset was split
60-40 into training and test sets for our experiments.
Figure 4 shows the results, including learned policy traFor all evaluated
budgets, our dynamic, non5jectories. 10
15
myopic method outperforms
all others on the area under the
Max Budget
We use a diverse set of visual descriptors varying in compuerror vs. cost curve metric. Our results
this dataset
match
tation time on
and accuracy:
GIST, spatial
HOG, Local Binary
Pattern, self-similarity, texton histogram, geometric texton,
1.00 2)
the reported results of Active Classification
[12] (Figure
geometric color, and Object Bank (Li et al., 2010). The auObject Bankboth
apply 177
object detectors to each
and Greedy Miser [26] (Figure thors
3), from
although
methods
image, where each object detector works 1independently of
use an additional powerful feature
(ObjectBank)
0.95 as .an independent
eachchannel
other. We treat
each object detector
The Scene-15 dataset [18] contains 4485 images from
15 visual scene classes. The task is to classify images according to scene. Following [24], we extracted 14 different
visual features (GIST, HOG, TinyImages, LBP, SIFT, Line
Histograms, Self-Similarity, Textons, Color Histograms,
and variations). The features vary in cost from 0.3 seconds to 8 seconds, and in single-feature accuracy from
0.32 (TinyImages) to .82 (HOG). Separate multi-class linear SVMs were trained on each feature channel, using a random 100 positive example images per class for training. We
used the liblinear implementation, and K-fold crossvalidated the penalty parameter C. The trained SVMs were
evaluated on the images not used for training, resulting in a
dataset of 2238 vectors of 210 confidence values: 15 classes
⇡0
random;
for i
1 to max iterations do
States, Actions, Costs, Labels GatherSamples(D, ⇡i 1 );
gi
UpdateClassifier(States, Labels);
Rewards ComputeRewards(States, Costs, Labels, gi , LB , );
⇡i
UpdatePolicy(States, Actions, Rewards);
end
Algorithm 1: Because reward computation depends on the classifier, and the distribution of states
depends on thesetup
policy, g and ⇡ are trained iteratively.
Evaluation
ImageNet
‣Actions are class evaluations.
4.3. ImageNet and maximizing specificity.
‣Progressively
enhance
The full ImageNet
dataset has over 10K categories and
over a million images [5]. The classes are organized in
a hierarchical
which can be exploited for novel
specificity
withstructure,
increasing
recognition tasks. We evaluate on a 65-class subset introbudget. Detailed results for this and other experiments are on the project page
descriptor and end up with a total of 184 different visual
descriptors.
We split the training data 30/70 and0.90
use the smaller subset
to construct a kernel and train 15 one-vs-all SVMs for each
descriptor. We use the predictions of these SVMs on the
larger subset as the features of miser (totaling d = 184⇥15 =
2760 features.) As loss function `,0.85
we use the multi-class
log-loss (Hastie et al., 2009) and maintain 15 tree-ensemble
classifiers H 1 , . . . , H 15 , one for each class. During each
1
Our method is the Dynamic, non-myopic policy: observes feature values,
and has plans ahead.
14
suburban
industrial
kitchen
living room
coast
forest
highway
inside city
mountain
open country
street
tall building
office
store
Figure 4. Sample images of the Scene 15 classification task.
geo color
geo texton
geo map8x8
texton
iteration,
we construct 15 regression trees (depth 3) and
ssim
update
sparseall
siftclassifiers. For a given image, each classifier’s
(normalized exponential) output represents the probability
gistPadding
ofline
this hists
data point belonging to one class.
denseSIFT
lbphf the feature-extraction-cost as the cpu-time reWe compute
lbpthe computation for the visual descriptor, the
quired
for
tiny image
kernel
construction and the SVM evaluation. Each visual
hog2x2
gistis used by 15 one-vs-all features. The moment
descriptor
0
1
2
3
4
5
static, greedy
any one of these features is used, we set the feature extracin action
sequence
tion cost of all other Number
features that
are based
on the same visual
geodescriptor
color to only the SVM evaluation time (e.g. if the
first
feature is used, the cost of all other HOGgeo HOG-based
texton
geo
map8x8
based features is reduced to the time required to evaluate
texton
the SVM).
ssimFigure 3 summarizes the results on the Scene-15
data
set. sift
As baseline we use stage-wise regression (Friedsparse
gistPadding
man, 2001) and an SVM with the averaged kernel of all deline hists
scriptors.
We also apply stage-wise regression with Early
denseSIFT
Exits. lbphf
As this is multi-class classification instead of regressionlbp
we introduce an early exit every 10 trees (300 in
tiny image
total),
and we remove test-inputs whose maximum classhog2x2
gistis greater than a threshold s. We generate the
likelihood
0 exit by gradually
1
2
curve of early
increasing
the3 value for 4s.
Number
in action
sequence
The last baseline is original
vision
features
with `1 regularization, and we notice that its accuracy never exceeds 0.74,
and therefore we do not plot it. The miser curve is generated by varying loss/feature-cost trade-off . For each
setting we choose the iteration that has the best validation
accuracy, and all results are obtained
by averaging over 10
Random
randomly generated training/testing splits.
dynamic, non-myopic
Static, greedy
Both, multiple-kernel SVM and Static,
stage-wise
regression
non-myopic
achieve high accuracy, but their need to extract all features
Dynamic, greedy
significantly increases their cost. Early Exit has only limDynamic,
ited improvement due to the inability
to select a non-myopic
few expensive but important features in early iterations. As before,
miser champions the cost/accuracy trade-off and its accuracy drops gently with increasing .
All experiments (on both data sets) were conducted on a
desktop with dual 6-core Intel i7 cpus with 2.66GHz. The
training time for miser requires comparable amount of time
0.80
0.75
(see front page for the link).
10
20
(a)Seman c hierarchy
30
Max Budget
vehicle
(c)Reward of predic on
car
boat
0
1
60
0
1
0
bird
50
1
animal
dog
40
(b)Accuracy of predic on
en ty
0
1
0
ground truth
0
0
2
0
0
ground truth
Figure 2. Illustration of the
formulation
with et
a simple
hierarchy.
[Figure
from Deng
al. CVPR
2012]
The numbers correspond to the accuracy (middle) and the reward
(information gain) of a prediction (right) given the ground truth.
• Static, non-myopic: does not observe feature values, but plans ahead
(using MDP with = 1).
• Dynamic, greedy: observes feature values, but selects actions greedily.
bedroom
s = 0.05
expensive features (cost 150) are always extracted within
early iterations. This highlights a great advantage of miser
over some other cascade algorithms (Raykar et al., 2010),
which learn cascades with pre-assigned feature costs and
cannot extract good but expensive features until the very
end.
0.1
12
0.452
SVM with multiple kernels with only half its test-time cost.
baselines and our method, trained given the correct 0.2
minimal
budget.
10
s = 0.4
s = 0.1
policy execution; the opacity of the nodes corresponds to the amount of reward obtained in that state.) Note that the static,
non-myopic policy correctly learns to select the cheap features first, but is not able to correctly branch, while our dynamic,
0.3plots in the bottom half give the error vs. cost numbers.
non-myopic approach finds the optimal strategy. The
3.4 Learning the classifier.
Input: D = {xn , yn }N
n=1 ; LB
Result: Trained ⇡, g
10
0.6
8
=0
Effect of . Note that solving the MDP with these features and with
=
0
finds
a
Static,
greedy
a = hf
policy: the value ofreward
takingdefinition
an action in a state is exactly the expected reward to be obtained.
1 When
Linear weights:
c
I
(Y
;
h
)(B
)
0.5
H
f
s
f
s
I
(Y
;
h
)
= 1, the value of taking an action is the entire area
the curve as defined in Figure 1,2 and we
Hs above
f
Figure 3: Evaluation on the synthetic example (best viewed in color). The data and the feature costs are shown at top left; the
learn the Static, non-myopic policy—another baseline.
sample feature trajectories of different policies at top
0.4right. (The opacity of the edges corresponds to their prevalence during
0.2
6
Cost
Area under Error vs.
Cost curveon Scene 15 data set
Performance
Final Error
0.5
0.7
4
4
‣Actions are feature computations.
0.6
2
The Greedy Miser
0.7
0.8
Dynamic, greedy
Dynamic, non-myopic
4
dynamic, non-myopic
Scenes-15
Static, greedy
Static, non-myopic
Static, greedy
Static, non-myopic
0.8
4
3
1
lowing features from the state:
1
5
• Static, greedy: does not observe feature values, selects actions greedily.
2
2
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
1
We evaluate these baseline policies:
1
1
235
‣On a task where the optimal policy is known
to be dynamic and non-myopic, we show that
our method successfully matches it.
Action
055
‣We model the problem of state traversal
as Static
a Markov
Decision
Process.feature vector. The featurization function (s) extracts the folvs. Dynamic
state-action
Synthetic Example
Action
234
Text
Area under the Error vs. Cost curve
‣Human perception is Anytime and
054 progressive.
UC Berkeley
Evaluation
Action
233
MPI for Informatics
Action
232
Model
Area under the Error vs. Cost curve
W COPY. DO NOT DISTRIBUTE.
During training, we gather samples starting from either a random feasible state, with probability ✏,
or from the initial empty state otherwise. Both ✏ and ⌧ parameters decay exponentially with the
number of training iterations. Training is terminated if ⇡✓i+1 returns the exact same sequence of
RK
RK
RK
episodes ⇠ on a validation set as ⇡✓i .
RK
Action
#****
Action
Motivation
230
231
d3
http://sergeykarayev.com/recognition-on-a-budget/
ICCV
d1
228
229
Error
227
Sergey Karayev · Mario Fritz · Trevor Darrell
As commonly done, we learn the ✓ by policy iteration. First, we gather (s, a, r, s ) samples by
running episodes (to completion) with the current policy parameters ✓i . From these samples, Q̂(s, a)
UC= Berkeley
values are computed, and ✓i+1 are given by L2 -regularized least squares solution to Q̂(s, a)
✓T (s, a), on all states that we have seen in training.
0
Error
Anytime Recognition of Objects and Scenes
226
Accuracy
225
bird
cat
boat
car
vehicle
object
animal
dog
Cost: 1
bird
cat
boat
car
vehicle
object
animal
dog
Cost: 9
1.0
evant in many visual tasks with high level semantic 0.9
labels,
0.8
e.g., detection, scene understanding, describing images
with
0.7
boat
boat
bird sentences, etc. Our focus here,
bird multiclass image classifica0.6
cat
cat
car
car
tion,
To our knowledge0.5
this is
vehicleserves as a building block. vehicle
object
object
the first time that optimizing the
accuracy-specificity0.4tradeanimal
animal
0.3
off has been addressed in large
scale visual recognition.
dog
dog
0.2 (1)
In this work, we make the following contributions:
0.1
introducing the problem of classification in a hierarchy sub0.0
Cost: ject
17 to an accuracy bound Cost:
while26maximizing information
gain (or other measures), (2) proposing the Dual Accuracy
ro
(F
cla
mi
na
of
at
(F
X
gro
bu
x
no
wh
i.e
© Copyright 2026 Paperzz