Learning with Hidden Information using A Max-Margin

2014 22nd International Conference on Pattern Recognition
Learning with Hidden Information using
A Max-Margin Latent Variable Model
Ziheng Wang
Tian Gao
Qiang Ji
Department of Electrical, Computer
& Systems Engineering
Rensselaer Polytechnic Institute
Troy, NY, USA, 12180
Email: [email protected]
Department of Electrical, Computer
& Systems Engineering
Rensselaer Polytechnic Institute
Troy, NY, USA, 12180
Email: [email protected]
Department of Electrical, Computer
& Systems Engineering
Rensselaer Polytechnic Institute
Troy, NY, USA, 12180
Email: [email protected]
Abstract—Classifier learning is challenging when the training
data is inadequate in either quantity or quality. Prior knowledge
hence is important in such cases to improve the performance of
classification. In this paper we study a specific type of prior
knowledge called hidden information, which is only available
during training but not available during testing. Hidden information has abundant applications in many areas but has not
been thoroughly studied. In this paper, we propose to exploit the
hidden information during training to help design an improved
classifier. Towards this goal, we introduce a novel approach
which automatically learns and transfers the useful hidden
information through a latent variable model. Experiments on
both digit recognition and gesture recognition tasks demonstrate
the effectiveness of the proposed method in capturing hidden
information for improved classification.
I.
([SUHVVLRQ5HFRJQLWLRQ
IDFLDODFWLRQXQLWV
IDFHLPDJH
5DLVHFKHHN
6WUHWFKPRXWK QR
5DLVHH\HOLG
QR
5DLVHEURZ
QR
2EMHFW5HFRJQLWLRQ
DWWULEXWHV
REMHFWLPDJH
+DVEHDN
+DVZLQJ
/HDWKHU
+DVOHJ
>@
>@
>@
>@
>@
>@
>@
$FWLRQ5HFRJQLWLRQ
ERXQGLQJER[
DFWLRQLPDJH
<HV
<HV
1R
<HV
Fig. 1: Examples of different hidden information in different
applications: facial action units for expression recognition,
joint positions for gesture recognition, attributes for object
recognition, and bounding boxes for action recognition.
I NTRODUCTION
Classification is a fundamental problem in pattern recognition. Classifier learning has been predominantly data-driven
and can be formulated as follows: given a set of n i.i.d
training pairs (x1 , y1 ), · · · , (xn , yn ) sampled from an unknown
distribution P (x, y), where x ∈ X is the feature vector and
y ∈ Y is the class label, learn a classifier f : X → Y that can
classify unseen samples as accurately as possible.
These issues motivate us to ask whether it is possible to
incorporate the information to which we only have access
during training, and if possible, whether it can still improve
the learning performance. In this paper we give positive
answers by proposing a novel approach that integrates all the
related information through a latent variable model. We call
the information which is only available during training but
not available during testing as hidden information. Hidden
information exists in different formats, and here we focus on
the one which can be represented as additional features hi ∈ H
for each training sample (xi , yi ). In the following of this paper
we also call x as primary features and h as hidden features.
However, data-based learning could be challenging when
the training data is limited in either quantity or quality. To
address this issue a growing body of work has been done
to incorporate extra sources of information in addition to the
training data to improve the classification performance. For instance, contextual information (e.g. related objects besides the
target object) [1]–[3], and object attributes [4], [5] have been
exploited to improve image-based object recognition. Depth
videos have been combined with RGB videos to enhance the
performance of gesture recognition [6].
In fact, hidden information can be utilized in many important areas. Besides the primary training data, hidden information supplies additional sources of information to learn
the classifier. Figure 1 shows some examples of hidden information in different applications. For example, human labeled
action units can be obtained along with the training images
for facial expression recognition. Human joint positions can be
collected for each training instance to help recognize gestures
with RGB measurements. Attributes can also be used as hidden
information for image-based object classification. Besides, for
action recognition we can use bounding boxes as hidden
information.
Despite their success, the extra information needs to be
explicitly or implicitly obtained during both training and
testing, which could be impractical in many applications due to
two reasons. First, acquiring useful information during testing
could be expensive. For example acquiring the depth videos are
expensive especially for large-scale surveillance applications.
Second, information such as object attributes needs to be
indirectly derived from the measurements in the testing phase,
which may not be accurate. Errors could propagate and even
hurt classification.
1051-4651/14 $31.00 © 2014 IEEE
DOI 10.1109/ICPR.2014.248
\HV
3XOOOLSFRUQHU \HV
*HVWXUH5HFRJQLWLRQ
-RLQW3RVLWLRQV
5*%LPDJH
1389
Learning with hidden information also has a close analogy
to human learning. Human usually turns to teachers or books
to help them learn more efficiently and effectively. Likewise,
classifier learning can also benefit from the hidden information.
Learning with hidden information is similar to constructing a
building with a scaffold, which is used to facilitate the learning
of the model during training and then is disassembled during
testing.
results with SVM+ but is easier to implement. Besides, Laptin
el al. [15] proved that SVM+ can be reformulated as a special
case of instance weighted SVM, and Liang et al. [16] studied
the connection between SVM+ and multi-task learning.
Hidden information has also been used to improve the
learning of classifiers other than SVM. Chen et al. [17] propose
to incorporate hidden information to enhance the AdaBoost
classifier where hidden information is used to as additional
targets to construct weak classifiers. Yang and Patras [18]
use privileged information to help selecting split functions
to construct the conditional random regression forest. Hidden
information in these two approaches serve as auxiliary targets
to provide richer information about the class label.
Nevertheless, incorporating hidden information is challenging. It cannot be simply used as additional features and
combined with the primary feature since it is absent during
testing. Hence we need a learning algorithm that can successfully transfer the useful hidden information to learn the target
classifier y = f (x). In this paper we propose to incorporate
hidden information through a latent variable, which serves
as a pivot to relate the primary feature x, the target label y
and the hidden information h simultaneously. By connecting
x and y through a latent variable z, the model can better
interpret the intermediate complex structure of the feature
space and thereby better capture the relationship between the
input feature and output class label. Such a latent variable
model has been demonstrated to yield superior classification
performance, especially when huge intra-class variations exist
[7]. Moreover, by connecting h to z, the learnt latent space will
be changed accordingly to the hidden information. The useful
prior knowledge within the hidden information is implicitly
captured and transferred to learn the target classifier through
the latent variable and the model is learned in a max-margin
approach for optimal discrimination performance. Our basic
assumption is that learning jointly with the hidden information
will positively influence our estimation of the relationship
between x and y, and therefore result in a better classifier
than the one learned purely from the training data.
However, all of these approaches relate the hidden information with the primary data or parameters of the classifier
through direct regressions, which are too strong assumptions
for many applications. Instead, our proposed method assumes
a latent relationship between the primary data and hidden
information. It is automatically learned during training through
a max-margin approach. In addition, the existing approaches
are limited to binary classification problems while the proposed
method can be used for multi-class classification.
Besides classifier learning, hidden information has also
been used in other learning problems. For instance, [19] used
hidden information as auxiliary targets to help feature selection
based on the assumption that features commonly effective for
the target and hidden information will be better for classification. Feyereisl and Aickelin [20] used hidden information for
clustering. Fouad et al. [21] incorporated hidden information
for metric learning.
III.
The remainder of this paper is organized as follows. A brief
literature review of the related works is provided in Section
II. The proposed algorithm will be introduced in detailed in
Section III. Experimental results are illustrated in Section IV.
Finally, we conclude the paper in Section V.
II.
P ROPOSED A LGORITHM
Instead of associating hidden information and the primary
data through direct regression, the proposed model relates them
with a latent variable and implicitly captures and transfers the
information from h to learn the target classifier f : X →
Y. In this section, we first present a mathematical definition
of learning with hidden information, and then introduce the
proposed approach in detail.
R ELATED W ORK
Learning with hidden information was originally proposed
by Vapnik et al. [8], [9] where hidden information is also
called privileged information. Since then various approaches
have been developed to capture hidden information for different applications. Vapnik et al. [8] proposed an SVM+
algorithm where the slack variable ξi for each training instance
is modeled as a function of the privileged information hi .
The basic idea is that privileged information indicates which
sample is easy to classify and which is hard to classify.
Corresponding theories have also proved
√ that SVM+ improves
the learning rate of SVM from O(1/ n) to O(1/n), when
privileged information is the ground truth value of the slack
variables [8], [10]. Efficient implementations of SVM+ have
also been proposed in [11], [12]. Niu and Wu [13] further
studied using L1 regularization in SVM+ to capture hidden
information. Instead of relating hidden information to the
slack variable, Sharmanska et al. [14] proposed to transfer the
score rank generated according to the hidden information to
the primary data modality. Empirical evaluations on multiple
datasets demonstrate that rank transfer achieves comparable
A. Problem Definition
Learning with hidden information represents a paradigm
shift from the traditional classification learning problem. Mathematically it is stated as follows: given a set of training data:
(x1 , y1 ), · · · , (xn , yn )
x ∈ X,
y∈Y
where x is the input feature and y is the output class
label, as well as the hidden information which can be properly
represented as additional features for each training instance:
h 1 , · · · , hn
h∈H
the goal of learning with hidden information is to learn a
classifier f : X → Y which can have better performance than
a classifier learned only from the training samples to classify
unseen samples.
We would like to emphasize that the learned classifier
f : X → Y only uses the primary feature x as the input
since hidden information h will not be present during testing.
1390
Therefore hidden information h cannot be simply treated as
additional features that can be combined with the primary
feature x. However, it will indirectly or implicitly influence the
choice of the classifier (or its parameters) during the learning
in the training phase.
B. Capturing Hidden Information with A Latent Variable
The key to learning with hidden information is how to
properly extract and transfer useful piece of hidden information
to the original data modality. In this paper we achieve this goal
with a latent variable model. Figure 2 shows the graphical
depiction of the proposed model, where x ∈ X stands for the
primary feature, h ∈ H represents hidden information, y ∈ Y
is the class label, and z is a latent node connected to all the
other three variables x, h, and y. In this paper we assume the
situation of a discrete latent node z ∈ {1, · · · , K}. However
it can be extended to a discrete vector or a continuous node.
Besides, we study the case where the class label y can take C
values {1, · · · , C}.
The total potential of the model Ψ(x, h, z, y) is defined
in Equation 1, where {bkz } and {dcy } are the biases for each
state of the latent variable z and class label y. The first set
k
of parameters {wxz
} measure the compatibility between the
primary feature x with each latent state. Similarly the second
k
} measure the compatibility between the
set of parameters {whz
hidden information h with each state of the latent variable.
k,c
The third set of parameters {wyz
} model the compatibility
between each state of the latent node and each class. 1(·) is
the indication function.
Ψ(x, h, z, y) =
K
k
wxz
· x · 1(z = k)
k=1
+
+
+
K
k
whz
· h · 1(z = k)
k=1
C
K k=1 c=1
K
bkz ·
k=1
k,c
wyz
· 1(z = k, y = c)
1(z = k) +
C
dcy · 1(y = c)
(1)
c=1
Concatenating the parameters together, the total potential
can be written as a linear model
Ψ(x, h, z, y) = w · Φ(x, h, z, y)
(2)
w is a vector consisting of all the model parameters
where
k
k
k,c
}, {whz
}, {wyz
}, {bkz }, {dcy } and Φ(x, h, z, y) is the
{wxz
joint feature vector constructed by arranging all the features
in order with the corresponding parameters in w.
We can see that the latent variable z acts as a pivot to
connect all the related components (x, h, y) during training.
Below we analyze the proposed model in detail.
Part 1 – Connecting X and Y through Z (X-Z-Y): In
the first place, we can see that the input feature x and the
output class variable y are not directed related as in SVM or
other popular classifiers. They are connected through a latent
variable z. From a bottom-up point of view, the raw input
ℎ
Fig. 2: The proposed model to incorporate hidden information,
where y is the class label, x is the primary feature, h is the
hidden information, and z is a latent node connecting all the
other three nodes.
features x is decomposed into a set of latent states before used
for discriminating the class label y. This allows us to capture
a more complex intermediate structure of the feature space
to better characterize the relationship between the input and
output. By transforming the feature space into a discrete latent
space, the model can also more effectively deal with large
intra-class variations which is prevalent in many classification
applications.
Part 2 – Connecting X and H through Z (X-Z-H):
Second, the latent variable also serves as a bridge to relate
the hidden information h with the primary features x, and in
this way the latent space also captures relationships among
the primary data modality and hidden information. Compared
to existing works which assume direct regression between
hidden information and primary data, in our model we do
not make explicit assumptions. The relationship between x
and h is totally latent and is automatically discovered through
learning. Moreover, By attaching the hidden information h
to the latent variable, the estimation of the relation between
kc
k
} and {wxz
}) will also be
x and y (i.e. parameters {wyz
changed accordingly during training and hence the hidden
information is implicitly transferred for classifying y from
x. We assume that hidden information will bring positive
influences by providing auxiliary and useful information to
the learning problem.
Part 3 – Connecting H and Y through Z (H-Z-Y):
Finally, the latent variable also relates the hidden information h
with the target class label y. This ensures that the information
extracted or transferred from the hidden information must favor
the discrimination of class y.
C. Max-Margin Model Learning and Inference
The model is learned in a max-margin Latent Structural
SVM framework [22], which has demonstrated superior classification performance in many applications. Compared to
other probabilistic learning approaches for undirected graphical models, the max-margin approach is also more efficient
to implement since it does not need to deal with the partition
function.
To simultaneously learn the relationship between x, h and
k
k
k,c
}, {whz
}, {wyz
}, {bkz }, {dcy }), during
y (i.e. parameters {wxz
training we maximize the margin for classifying y based on
x and h through the latent variable z, using the training
1391
data {(xi , yi }ni=1 and hidden information {hi }ni=1 . This is
equivalent to minimizing the objective function shown in
Equation 3.
1
min w + C
w 2
−C
n i=1
n i=1
max[w · Φ(xi , hi , ŷi , ẑi ) + Δ(yi , ŷi )]
Fig. 3: A baseline latent variable model without the hidden
information node h. It is used during testing to classify unseen
samples based only on the primary feature x.
ŷi ,ẑi
max w · Φ(xi , hi , yi , zi )
zi
(3)
IV.
To be consistent with the related works, we name the
baseline latent variable model (see Figure 3 learned purely
from the training data as LSSVM and the proposed algorithm
as LSSVM+, and compare both of them with support vector
machine (SVM) and the SVM+ model. While all the models
can be extended to use more sophisticated kernels, here we use
linear kernels to compare their performances. For the two latent
variable models, K-means algorithm is used to cluster the data
to initialize the hidden states. The corresponding experiment is
performed 20 times for these two models and we report their
average results.
However, such a model cannot be directly used for testing
since the model requires the input of hidden information h.
The hidden information node h has to be properly detached
from the latent variable. To address this issue we propose
the following method. First, we infer the latent state for each
training instance with Equation 4
zi∗ = arg max w · Φ(xi , hi , yi , zi )
zi
(4)
Then we use the memorized latent states {zi∗ } as well as the
training data {(xi , yi }ni=1 to relearn a baseline latent variable
model without node h (see Figure 3), and use that model
for testing. Now the latent variable is still unknown during
testing, but is observed during training. Therefore learning can
be formulated as a standard structural SVM and the objective
function is shown in Equation 5.
n 1
min w + C2
max[w · Φ(xi , ŷi , ẑi )+
w 2
ŷi ,ẑi
i=1
n
Δ(yi , ŷi , ẑi , zi∗ )] − C2
w · Φ(xi , yi , zi∗ )
A. Digit Recognition
In the first experiment we test the performance of the
proposed algorithm for digit recognition, following the same
evaluation protocol used in [8]. The goal of this experiment
is to classify digit 5 and digit 8 images. The training data
contains 50 images of digit 5 and 50 images of digit 8, and
the testing data consists of 1866 images selected from the
MNIST hand written digit dataset [23]. Note that the original
experiment in [8] was based on an idealistic setting where a
huge validation dataset (about 4000 digit samples) is available
so that the learned model is almost optimal through model
selection. However in practice one may never be able to obtain
a validation dataset larger than the testing data. Hence in our
experiments we tune the parameters for all models based on
five fold cross-validation within the available training data.
(5)
i=1
In this way, the hidden information is indirectly transferred
for learning the baseline model through the memorized latent
states for each training instance.
Images are resized to 10 × 10 pixels and the pixel values
are used as the primary feature x. Figure 4 illustrates some
digit samples from the training data.
Compared to a baseline latent variable model purely
learned from the training data, the parameter values obtained
through the proposed algorithm are different since the latent
space is learned with the guidance of the hidden information.
In this sense the latent variable can be seen as “partially latent”
since it incorporates the information from external expertise.
Fig. 4: Example of 10 × 10 digit images in the training data.
Top row corresponds to digit 5 and bottom row corresponds
to digit 8.
During testing, the label of each sample is predicted with
Equation 6.
fy (x) = max w · Φ(x, z, y),
y ∗ = arg max fy (x)
y
E XPERIMENT
In this section we demonstrate the effectiveness of the
proposed algorithm to incorporate hidden information on two
applications, namely handwritten digit recognition and human
gesture recognition. The holistic descriptions of the digit
images and the human joint positions extracted from the depth
video are used as the hidden information respectively.
The concave-convex procedure (CCCP) [22] is employed
to solve this optimization problem by iteratively finding the
maximum a posteriori (MAP) estimate of the latent variable
and solving a standard structural SVM optimizing problem
with the latent variable completely observed. In addition,
subgradient descent algorithm is adopted to solve the structural
SVM optimization subroutine.
z
The hidden information we use in this case, is the holistic
descriptions from the domain expert made for each digit
picture in the training set. The descriptions are about a total
(6)
1392
of 21 properties of each digit such as the thickness of the
stroke and the degree of tilt. Each property is measured with an
integer from 0 to m. For example a subset of these properties
are: tilting to the right (0 - 3); thickness of the line (0 - 4);
stability (0 - 3); uniformity (0 - 3). Figure 5 shows two digits
and their corresponding quantized holistic description of four
properties. The description for each digit is translated into a
21 dimensional feature vector and used as hidden information.
Tilt
Thickness
Stability
Uniformity
2
3
3
2
Tilt
Thickness
Stability
Uniformity
3
0
0
3
Fig. 5: Two digit images and the corresponding values of four
properties. Left - digit 5. Right - digit 8.
Figure 6 shows the performances of all the models when
we gradually increase the number of training data. For every
training data size that is less than 100, 12 different random
samples are selected and the average results are reported, as
in [8]. Both the latent variable models LSSVM and LSSVM+
use 5 hidden states.
gesture has an RGB and a corresponding depth video for
both training and testing. For our purpose, RGB videos are
used as the primary measurement, and the depth videos are
excluded during testing and used as the source of hidden
information. Figure 7 shows one sample RGB image frame
and its corresponding depth image.
The dataset is designed for one-shot learning. In other
words, each gesture has only one RGB video measurement
during training. Therefore training a video-based SVM or
LSSVM classifier is impossible. In this experiment we chose
to do the frame-based classification instead. The classifier is
learned to predict the gesture label for each frame. A video
sequence will be assigned to the gesture that has the most votes
from its frames.
The HOG and HOF features were extracted from each RGB
frame as the primary input feature x. To obtain these features,
the gradient and optical flow were firstly computed for each
pixel and quantized into one of 8 directions. The image was
then sequentially decomposed into 1, 4 and 9 blocks, and the
histogram of gradient (HOG) and optical flows (HOF) were
calculated for each block. Concatenating all the histograms
finally resulted in a 112 dimensional feature vector x for each
frame.
$YHUDJH$FFXUDF\5DWH
[112,79,156]
[271,47,78]
[181,58,155]
[228,96,125]
[94,146,145]
[65,194,109]
690
690
/6690
/6690
ℎ
[144,30,114]
1XPEHURI7UDLQLQJ'DWD
(a)
144
30
114
112
79
156
94
146
145
65
194
109
181
58
155
228
96
125
271
47
78
(b)
Fig. 7: (a) an RGB image frame from a gesture video; (b) the
corresponding depth image, 3D joint positions and the hidden
information h.
Fig. 6: Average classification accuracy of each model with
respect to the number of training data.
From the results we can see that by incorporating the expert
descriptions of the digit images, both SVM+ and LSSVM+
model outperform their counterparts, which demonstrate the
usefulness of hidden information and the effectiveness of
the proposed methods. Moreover, the improvement achieved
LSSVM+ is significantly larger than the improvement achieved
by SVM+. Regardless of the number of training samples, the
proposed method LSSVM+ always achieves the best performance, compared to other models with or without using hidden
information. Another observation is that when we increase the
number of training samples, the improvement by incorporating
hidden information is gradually decreased. This makes sense
because more data increases classification performance and
therefore the space of improvement would decrease. This also
suggests that hidden information is more important when the
number of training data is smaller.
B. Gesture Recognition
The second experiment is to classify 10 unique gestures
in the devel01 gesture dataset [24]. In the dataset, each
The depth video can provide a lot of useful information
and in this experiment we chose to use the manually labeled
human joint positions as hidden information. As shown in
Figure 7b, the center of the head and two hands as well as
the two shoulder joints and two elbow joints are manually
labelled in the provided depth videos for all the training data.
Their 3D positions are concatenated into a 21 dimensional
feature vector as the hidden information h. A total of 80 RGB
videos are used during testing.
The latent variable models are evaluated with respect to
the number of latent states, as illustrated in Figure 8 where
the x axis represents the number of latent states and the y
axis represents the average accuracy rate. When the number
of latent states increases, the accuracy rates of both LSSVM
and LSSVM+ increase and reach the peak when approximately 40 latent states are chosen. By incorporating joint
positions as hidden information, LSSVM+ always outperforms
LSSVM, regardless of the number of latent states. In particular,
LSSVM+ improves LSSVM by about 5% on average. The two
horizontal lines in Figure 8 show the accuracies of SVM and
SVM+, which are much lower than the best performance of
the proposed method LSSVM+. Moreover, the improvement of
1393
TABLE I: Classification Accuracy for Each Gesture
Algorithm
SVM
SVM+
LSSVM
LSSVM+
G1
100
100
100
100
G2
100
100
95.83
97.22
G3
85.71
71.43
32.14
55.36
G4
42.86
42.86
73.21
75.00
G5
100
100
95.83
100
G6
100
100
98.75
100
LSSVM+ is much greater than SVM+ over their perspective
counterparts.
[4]
[5]
$YHUDJH$FFXUDF\5DWH
[6]
[7]
/6690
/6690
690
690
[8]
[9]
1XPEHURI+LGGHQ6WDWHV
Fig. 8: Average classification accuracy for each algorithm with
respect to the number of hidden states
[10]
[11]
Detailed classification accuracy rates for each gesture are
provided in Table I to compare the performance of individual
gestures, where both the latent models used 40 hidden states.
The recognition accuracies for 9 out of 10 gestures are
increased by the proposed algorithm in varying degrees. In
particular significant improvements are observed for gestures
G4 and G10.
V.
C ONCLUSION
In this paper, we studied a novel problem of learning
with hidden information, where additional information about
training samples are available during training but absent during
testing. We further proposed a novel approach to incorporate hidden information through a max-margin latent variable
model. Experiments on both digit and gesture recognition tasks
demonstrated the feasibility and effectiveness of our approach
to capture hidden information. The proposed method can be
readily extended to more sophisticated models that involve a
richer set of latent components.
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
ACKNOWLEDGEMENT
The work described in this paper is supported in part by
the grant IIS 1145152 from the National Science Foundation.
[20]
[21]
R EFERENCES
[1]
[2]
[3]
T. Antonio, “Contextual priming for object detection,” Int. J. Comput.
Vision, vol. 53, pp. 169–191, 2003.
M. Marszalek, I. Laptev, and C. Schmid, “Actions in context,” in
Computer Vision and Pattern Recognition, IEEE Conference on, 2009,
pp. 2929–2936.
X. Wang and Q. Ji, “Incorporating contextual knowledge to dynamic
bayesian networks for event recognition,” in International Conference
on Pattern Recognition, 2012.
[22]
[23]
[24]
1394
G7
75.00
66.67
94.79
95.83
G8
12.50
50
53.13
56.25
G9
100
100
100
100
G10
37.50
37.50
65.63
85.94
Average
77.52
78.65
83.57
88.48
A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth, “Describing objects
by their attributes,” in Computer Vision and Pattern Recognition, IEEE
Conference on, June 2009, pp. 1778–1785.
Y. Wang and G. Mori, “A discriminative latent model of object classes
and attributes,” in ECCV, 2010, pp. 155–168.
J. Wang, Z. Liu, Y. Wu, and J. Yuan, “Mining actionlet ensemble for
action recognition with depth cameras,” in Computer Vision and Pattern
Recognition, IEEE Conference on, 2012.
A. Quattoni, S. Wang, L. P. Morency, M. Collins, and T. Darrell,
“Hidden-state conditional random fields,” in IEEE Transactions on
Pattern Analysis and Machine Intelligence, 2007.
V. Vapnik and A. Vashist, “A new learning paradigm: Learning using
privileged information,” Neural Networks, vol. 22, pp. 544–557, July
2009.
V. Vapnik, A. Vashist, and N. Pavlovitch, “Learning using hidden
information (learning with teacher),” in Neural Networks, 2009. IJCNN
2009. International Joint Conference on. IEEE, 2009, pp. 3188–3195.
D. Pechyony and V. Vapnik, “On the Theory of Learning with Privileged
Information,” in Advances in Neural Information Processing Systems
23, 2010.
D. Pechyony, R. Izmailov, A. Vashist, and V. Vapnik, “Smo-style
algorithms for learning using privileged information,” in DMIN’10,
2010, pp. 235–241.
D. Pechyony and V. Vapnik, “Fast optimization algorithms for solving
svm,” Statistical Learning and Data Science. Chapman & Hall, 2011.
L. Niu and J. Wu, “Nonlinear l-1 support vector machines for learning
using privileged information,” in Data Mining Workshops (ICDMW),
2012 IEEE 12th International Conference on. IEEE, 2012, pp. 495–
499.
V. Sharmanska, I. Austria, N. Quadrianto, and C. H. Lampert, “Learning
to rank using privileged information,” in ICCV, 2013.
M. Lapin, M. Hein, and B. Schiele, “Learning using privileged information: Svm+ and weighted svm,” arXiv preprint arXiv:1306.3161, 2013.
L. Liang and V. Cherkassky, “Connection between svm+ and multitask learning,” in Neural Networks, 2008. IJCNN 2008.(IEEE World
Congress on Computational Intelligence). IEEE International Joint
Conference on. IEEE, 2008, pp. 2048–2054.
J. Chen, X. Liu, and S. Lyu, “Boosting with side information,” in
Proceeding of 11th Asian Conference on Computer Vision, Nov 2012.
H. Yang and I. Patras, “Privileged information-based conditional regression forest for facial feature detection,” in Automatic Face and Gesture
Recognition, 10th IEEE International Conference and Workshops on,
2013.
H. Wang, F. Nie, H. Huang, S. Risacher, A. J. Saykin, and L. Shen,
“Identifying ad-sensitive and cognition-relevant imaging biomarkers via
joint classification and regression,” in MICCAI’11, 2011, pp. 115–123.
J. Feyereisl and U. Aickelin, “Privileged information for data clustering,” Information Sciences, vol. 194, pp. 4–23, 2012.
S. Fouad, P. Tino, S. Raychaudhury, and P. Schneider, “Learning using
privileged information in prototype based models,” in Artificial Neural
Networks and Machine Learning–ICANN 2012. Springer, 2012, pp.
322–329.
C.-N. J. Yu and T. Joachims, “Learning structural svms with latent
variables,” in Proceedings of the 26th Annual International Conference
on Machine Learning, 2009, pp. 1169–1176.
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
learning applied to document recognition,” Proceedings of the IEEE,
vol. 86, no. 11, 1998.
ChaLearn, “ChaLearn gesture dataset (CGD2011),” California, 2011.