Is Attribute-Based Zero-Shot Learning an Ill

Is Attribute-Based Zero-Shot Learning
an Ill-Posed Strategy?
Ibrahim Alabdulmohsin1 , Moustapha Cisse2 , and Xiangliang Zhang1(B)
1
Computer, Electrical and Mathematical Sciences and Engineering Division,
King Abdullah University of Science and Technology (KAUST),
Thuwal 23955-6900, Saudi Arabia
{ibrahim.alabdulmohsin,xiangliang.zhang}@kaust.edu.sa
2
Facebook Artificial Intelligence Research (FAIR), Menlo Park, USA
[email protected]
http://mine.kaust.edu.sa
Abstract. One transfer learning approach that has gained a wide popularity lately is attribute-based zero-shot learning. Its goal is to learn
novel classes that were never seen during the training stage. The classical route towards realizing this goal is to incorporate a prior knowledge,
in the form of a semantic embedding of classes, and to learn to predict classes indirectly via their semantic attributes. Despite the amount
of research devoted to this subject lately, no known algorithm has yet
reported a predictive accuracy that could exceed the accuracy of supervised learning with very few training examples. For instance, the direct
attribute prediction (DAP) algorithm, which forms a standard baseline
for the task, is known to be as accurate as supervised learning when as
few as two examples from each hidden class are used for training on some
popular benchmark datasets! In this paper, we argue that this lack of
significant results in the literature is not a coincidence; attribute-based
zero-shot learning is fundamentally an ill-posed strategy. The key insight
is the observation that the mechanical task of predicting an attribute
is, in fact, quite different from the epistemological task of learning the
“correct meaning” of the attribute itself. This renders attribute-based
zero-shot learning fundamentally ill-posed. In more precise mathematical terms, attribute-based zero-shot learning is equivalent to the mirage
goal of learning with respect to one distribution of instances, with the
hope of being able to predict with respect to any arbitrary distribution.
We demonstrate this overlooked fact on some synthetic and real datasets.
The data and software related to this paper are available at https://mine.
kaust.edu.sa/Pages/zero-shot-learning.aspx.
Keywords: Zero-shot learning
label classification
1
· Attribute-based classification · Multi-
Introduction
Humans are capable of learning new concepts using a few empirical observations. This remarkable ability is arguably accomplished via transfer learning
c Springer International Publishing AG 2016
P. Frasconi et al. (Eds.): ECML PKDD 2016, Part I, LNAI 9851, pp. 749–760, 2016.
DOI: 10.1007/978-3-319-46128-1 47
750
I. Alabdulmohsin et al.
techniques, such as the bootstrapping learning strategy, where agents learn simple tasks first before tackling more complex activities [22]. For instance, humans
begin to cruise and crawl before they learn how to walk. Learning to cruise and
crawl allows infants to improve their locomotion skills, body balance, control
of limbs, and the perception of depth, all of which are crucial pre-requisites for
learning the more complex activity of walking [11,16].
In many machine learning applications, a similar transfer learning strategy is
desired when labeled examples are difficult to obtain that can faithfully represent
the entire target set Y. This is often the case, for example, in image classification
and in neural image decoding [14,17]. The transfer learning strategy typically
employed in this setting is either called few-shot, one-shot, or zero-shot learning, depending on how many labeled examples are available during the training
stage [10,14]. Here, a desired target set Y (classes) is learned indirectly by learning semantic attributes instead. These attributes are, then, used to predict the
classes in Y.
The motivation behind the attribute-based learning approach with scarce
data is close in spirit to the rationale of the bootstrapping learning strategy.
In brief terms, it helps to learn simple tasks first before attempting to learn
more complex activities. In the context of classification, semantic attributes are
(chosen to be) abundant, where a single attribute spans multiple classes. Hence,
labeled examples for the semantic attributes are more plentiful, which makes
the task of predicting attributes relatively easy. Moreover, the target set Y is
embedded in the space of semantic attributes, a.k.a. the semantic space, which
makes it possible, perhaps, to predict classes that were rarely seen, if ever, during
the training stage.
In this paper, we focus on the attribute-based zero-shot learning setting,
where a finite number of semantic attributes is used to predict novel classes that
were never seen during the training stage. More formally [14]:
Definition 1 (Attribute-Based Zero-Shot Setting). In the attribute-based
zero-shot setting, we have an instance space X , a semantic space A, and a
target set Y, where |A| < ∞ and |Y| < ∞. A sample S comprises of m
examples {(Xi , Yi , Ai )}i=1,...,m , with Xi ∈ X , Yi ∈ Y, and Ai ∈ A. Moreover, Y is partitioned into two non-empty subsets: the set of visible classes
YV = (Xi ,Yi ,Ai )∈S {Yi } and the set of hidden classes YH = Y\YV . The goal
is to use S to learn a hypothesis f : X → YH that can correctly predict the
hidden classes YH .
The key part of Definition 1 is the final goal. Unlike the traditional setting
of learning, we no longer assume that the sample size m is large enough for
all classes in Y to be seen during the training stage. In general, we allow Y
to be partitioned into two non-empty subsets YV and YH , which, respectively,
correspond to the visible and the hidden classes in the given sample S. The
classical argument for why the goal of learning to predict the hidden classes
is possible in this setting is that the hidden classes YH are coupled with the
instances X and the visible classes YV via the semantic space A [14].
Is Attribute-Based Zero-Shot Learning an Ill-Posed Strategy?
751
Fig. 1. In the polygon shape recognition problem, the instances are images of polygons
and we have five disjoint classes: equilateral triangles, non-equilateral triangles, squares,
non-square rectangles, and non-rectangular parallelograms.
To illustrate the traditional argument for attribute-based zero-shot learning,
let us consider the simple polygon shape recognition problem shown in Fig. 1.
In this problem, the instance space X is the set of images of polygons, i.e. twodimensional shapes bounded by a closed chain of a finite number of line segments,
while the target set Y is the set of the five disjoint classes shown in Fig. 1.
In the traditional setting of learning, a large sample of instances S would be
collected and a classifier would be trained on the sample (e.g. using one-vs-all
or one-vs-one). One of the fundamental assumptions in the traditional setting of
learning for guaranteeing generalization is the stationarity assumption; examples
in the sample S are assumed to be drawn i.i.d. from the same distribution as
the future examples. Along with a few additional assumptions, learning in the
traditional setting can be rigorously shown to be feasible [1,5,19,23].
In the zero-shot learning setting, by contrast, it is assumed that the target set
Y is partitioned into two non-empty subsets Y = YV ∪ YH . During the training
stage, only instances from the visible classes YV are seen. The goal is to be
able to predict the hidden classes correctly. This goal is arguably achieved by
introducing a coupling between YV and YH via the semantic space. For example,
we recognize that the five classes in Fig. 1 can be completely determined by the
values of the following three binary attributes:
– a1 : Does the polygon contain 4 sides?
– a2 : Are all sides in the polygon of equal length?
– a3 : Does the polygon contain, at least, one acute angle?
The set of all possible answers to these three binary questions forms a semantic space A for the polygon shape recognition problem. Given an instance X with
the semantic embedding A = (a1 , a2 , a3 ) ∈ {0, 1}3 , its class can be uniquely
determined. For example, any equilateral triangle has the semantic embedding
(0, 1, 1), which means that the latter polygon (1) does not contain four sides, (2)
its sides are all of equal length, and (3) it contains some acute angles. Among
the five classes in Y, only the class of equilateral triangles satisfy this semantic
embedding. Similarly, the four remaining classes have unique semantic embeddings as well.
Because the classes can be inferred from the values of the three binary
attributes mentioned above, it is often argued that hidden classes can be predicted by, first, learning to predict the values of the semantic attributes based
on a sample (training set) S, and, second, by using those predicted attributes
to predict the hidden classes in YH via some hand-crafted mappings [7,9,13–
15,17,20,21]. In our example in Fig. 1, suppose the class of non-square rectangles
752
I. Alabdulmohsin et al.
is never seen during the training stage. If we know that a polygon has the semantic embedding (1, 0, 0), which means that it has four sides, its sides are not all of
equal length, and it does not contain any acute angles, then it seems reasonable
to conclude that it is a non-square rectangle even if we have not seen any nonsquare rectangles in the sample S. Does this imply that zero-shot learning is a
well-posed approach? We will show that the answer is, in fact, negative. The key
ingredient in our argument is the fact that the mechanical task of predicting an
attribute is quite different from the epistemological task of learning the correct
meaning of the attribute.
The rest of the paper is structured as follows. We first explain why the
two tasks of “predicting” an attribute and “learning” an attribute are quite
different from each other. We will illustrate this overlooked fact on the simple
shape recognition problem of Fig. 1 and demonstrate it in a greater depth on
some synthetic and real datasets afterward. Next, we use such a distinction
between “predicting” and “learning” to argue that the attribute-based zero-shot
learning approach is fundamentally ill-posed, which, we believe, explains why
the previous zero-shot learning algorithms proposed in the literature have not
performed significantly better than supervised learning with very few training
examples.
2
2.1
Why Learning and Predicting Are Two Different Tasks
The Polygon Shape Recognition Problem
Let us return to the original polygon shape recognition example of Fig. 1. Suppose that the two classes of non-square rectangles and non-rectangular parallelograms are hidden from the sample S. That is:
YV = {equilateral triangles, non-equilateral triangles, squares}
YH = {non-square rectangles, non-rectangular parallelograms}
In the attribute-based zero-shot learning setting, we learn to predict the three
semantic attributes (a1 , a2 , a3 ) mentioned earlier based on the sample S that only
contains examples from the visible three classes. Once we learn to predict them
correctly based on the sample S, we are supposed to be able to recognize the two
hidden classes via their semantic embeddings. The semantic embedding for nonsquare rectangles is, in this example, (1, 0, 0), while the semantic embeddings for
non-rectangular parallelograms is the set {(1, 0, 1), (1, 1, 1)}.
To see why this is, in fact, an incorrect approach, we note that the task of
predicting an attribute aims, by definition, at utilizing all the relevant information in the sample S that aid the prediction task. In our example, since only the
three visible classes YV are seen in the sample S, a good predictor should infer
from S the following logical assertions:
1. If a polygon does not contain four sides, then it contains one acute angle.
Formally:
(a1 = 0) → (a3 = 1)
Is Attribute-Based Zero-Shot Learning an Ill-Posed Strategy?
753
From this, the contrapositive assertion (a3 = 0) → (a1 = 1) is deduced as
well.
2. If the sides of a polygon are not of equal length, then it does not contain four
sides. Formally:
(a2 = 0) → (a1 = 0)
Again, its contrapositive assertion also holds.
3. If the polygon does not contain an acute angle, then all of its sides are of
equal length. Formally:
(a3 = 0) → (a2 = 1)
These logical assertions and others are likely to be used by a good predictor,
at least implicitly, since they are always true in the sample S. In addition, such
a predictor would have a good generalization ability if the instances continued
to be drawn i.i.d. from the same distribution as the training sample S, i.e. if the
set of visible classes remained unchanged.
If, on the other hand, an instance is now drawn from a hidden class in YH ,
then some of these logical assertions would no longer hold and the original algorithm that was trained to predict the semantic attributes would fail. This follows
from the fact that instances drawn from the hidden classes have a different distribution. Therefore, the fact that classes can be uniquely determined by the
values of the semantic attributes is of little importance here because the semantic attributes are likely to be predicted correctly for the visible classes only.
Needless to mention, this violates the original goal of the attribute-based zeroshot learning setting.
2.2
Optical Digit Recognition
To show that the previous argument on the polygon shape recognition problem
is not a contrived argument, let us look into a real classification problem, in
which we can visualize the decision rule used by the predictors. We will use the
optical digit recognition problem to illustrate our argument. In order to be able
to interpret the decision rule used by the predictor, we will use the linear support
vector machine (SVM) algorithm [4,6], trained without the bias term using the
LIBLINEAR package [8].
One way of introducing a semantic space for the ten digits is to use the sevensegment display shown in Fig. 2. That is, the instance space X is the set of noisy
digits, the classes are the ten digits Y = {0, 1, 2, . . . , 9}, and the semantic space
is A = {−1, +1}7 corresponding to the seven segments. For example, using
the order of segments (a,b,c,d,e,f,g) shown in Fig. 2, the digit 0 in Fig. 2
has the semantic embedding (1, 1, 1, 0, 1, 1, 1) while the digit 1 has the semantic
embedding (0, 0, 1, 0, 0, 1, 0), and so on.
In our implementation, we run the experiment as follows1 . First, a perfect
digit is generated, which is later contaminated with noise. In particular, every
1
The MATLAB implementation codes that generate the images in this section are
available at https://mine.kaust.edu.sa/Pages/zero-shot-learning.aspx.
754
I. Alabdulmohsin et al.
Fig. 2. In optical digit recognition, a semantic space can be introdued using the seven
segment display segments shown in this figure.
Fig. 3. The instance space in the optical digit recognition problem is the set of noisy
digits, where the value of every pixel is flipped with probability 0.1.
pixel is flipped with probability 0.1. As a result, the instance space is the set of
noisy digits, as depicted in Fig. 3. Then, five digits are chosen to be in the visible
set of classes, and the remaining digits are hidden during the training stage. We
train classifiers that predict the attributes (i.e. segments) using the visible set
of classes, and use those classifiers to predict the hidden classes afterward.
Now, the classical argument for attribute-based zero-shot learning goes as
follows:
1. Every digit can be uniquely determined by its seven-segment display. When
an exact match is not found, one can carry out a nearest neighbor search
[7,9,15,17,20,21] or a maximum a posteriori estimation method [13,14].
2. Every segment in {a,b,c,d,e,f,g} is a concept class by itself that spans
multiple digits. Hence, the number of training examples available for each
segment is large, which makes it easier to predict.
3. Because each of the seven segments spans multiple classes, we no longer need
to see all of the ten digits during the training stage in order to learn to predict
the seven segments reliably.
Is Attribute-Based Zero-Shot Learning an Ill-Posed Strategy?
755
This argument clearly rests on the assumption that “learning” a concept is
equivalent to the task of reliably “predicting”it. From the earlier discussion in
the polygon shape matching problem, this assumption is, in fact, invalid. Figure 4
shows what happens when a proper subset of the ten digits is seen during the
training stage. As shown in the figure, the linear classifier trained using SVM
exploits the relevant information available in the training set to maximize its
prediction accuracy of the attributes.
For example, when we train a classifier to predict the segment ‘a’ using the
visible classes {0, 1, 2, 3, 4}, a good predictor would use the fact that the segment
‘a’, which is the target concept, always co-occurs with the segment ‘g’. Therefore, the contrapositive rule implies that the absence of ‘g’ implies the absence
of the segment ‘a’. This is clearly seen in Fig. 4 (top left corner). Of course,
what the predictor learns is even more complex than this, as shown in Fig. 4.
When novel instances from the hidden classes are present, these correlations no
longer hold and the algorithm fails to predict the semantic attributes correctly.
To reiterate, such a failure is fundamentally due to the fact that the hidden
classes constitute a different distribution of instances from the one seen during
the training stage.
The results of applying a linear SVM using binary relevance to predict the
seven segments is shown in Fig. 4. In this figure, the blue regions correspond
to the pixels that contribute positively to the decision rule for predicting the
corresponding segment, while the red regions contribute negatively. There are
two key takeaways from this figure. First, the prediction rule used by the classifier
does not correspond to the “true” meaning of the semantic attribute. After all,
the goal of classification is to be able to “predict”the attribute as opposed to
learning what it actually means. Second, changing the set of visible classes can
change the prediction rule for the same attribute quite notably. Both observations
challenge the rationale behind the attribute-based zero-shot learning setting.
2.3
Zero-Shot Learning on Popular Datasets
Next, we examine the performance of zero-shot learning on benchmark datasets.
Two of the most popular datasets for evaluating zero-shot learning algorithms
are the Animals with Attributes (AwA) dataset [14] and the aPascal-aYahoo
dataset [9]2 . We briefly describe each dataset next.
The Animals with Attributes (AwA) Dataset: The AwA dataset was
collected by querying search engines, such as Google and Flickr, for images of
50 animals. Afterward, these images were manually handled to remove outliers
and duplicates. The final dataset contains 30,475 images, where the minimum
number of images per class is 92 and the maximum is 1,168. In addition, 85
2
These datasets are available at:
http://attributes.kyb.tuebingen.mpg.de/
http://vision.cs.uiuc.edu/attributes/.
756
I. Alabdulmohsin et al.
Fig. 4. In this figure, linear SVM was implemented to predict the seven segments of
noisy optical images. The columns correspond to the seven segments {a, b, c, d, e, f, g}
while the rows correspond to different choices of visible classes. From top to bottom,
the visible classes are {0, 1, 2, 3, 4} for the first row, {5, 6, 7, 8, 9} for the second row,
{0, 2, 4, 6, 8} for the third row, and {1, 3, 5, 7, 9} for the fourth row. Every figure depicts
the coefficient vector w that is learned by linear SVM, where blue regions correspond
to the pixels that contribute positively towards the corresponding segment and red
regions contribute negatively. The black figures are not applicable because the training
sample S either lacks negative examples for the corresponding attribute or it lacks
positive examples. (Color figure online)
attributes are introduced. In the zero-shot learning setting, 40 (visible) classes
are used for training and 10 (hidden) classes are used for testing [14].
The aPascal-aYahoo (aP-AY) Dataset: The aP-aY dataset contains 12,695
images, which were chosen from the PASCAL VOC 2008 data set [9]. These
images are used during the training stage of the zero-shot learning setting. In
addition, a total of 2,644 images were collected from the Yahoo image search
engine to be used during the test stage. Both sets of images have disjoint classes.
More specifically, the training dataset contains 20 classes while the test dataset
contains 12 classes. Moreover, every image has been annotated with 64 binary
attributes.
Results: Table 1 presents some fairly-recent reported results on the two datasets
AwA and aP-aY. The zero-shot learning algorithms provided in the table are
Is Attribute-Based Zero-Shot Learning an Ill-Posed Strategy?
757
the direct attribute prediction (DAP) algorithm proposed in [13,14], which is
one of the standard baseline methods for this task, the indirect attribute prediction (IAP) algorithm proposed in [13,14], the embarrassingly simple zero-shot
learning algorithm proposed in [18], and the zero-shot random forest algorithm
proposed in [12]. The best reported prediction accuracy for the AwA dataset
is 49.3 % while the best reported prediction accuracy for the aP-aY dataset
is 26.0 %.
Table 1. In this table, some fairly-recent results of zero-shot learning algorithms are
presented. The reported figures of the four zero-shot learning algorithms are for the
multiclass prediction accuracy on the two datasets AwA and aP-aY, where the figures
are taken from the original papers [12–14, 18]
Algorithm
AwA
direct attribute prediction
41.4 % 19.1 %
aP-aY
indirect attribute prediction
42.2 % 16.9 %
embarrassingly simple zero-shot
49.3 % 15.1 %
zero-shot random forest
43.0 % 26.0 %
Average # Training Examples/Visible Class
607.38
Best Reported Accuracy
49.3 % 26.0 %
Equivalent # Training Examples/Hidden Class ≈ 20
634.75
≈2
In order to properly interpret the reported results, we have also provided
in Table 1 the number of training examples from the hidden classes that would
suffice, in a traditional supervised learning setting, to obtain the same accuracy
reported by the zero-shot learning algorithms in the literature. These latter
figures are obtained from the experimental study conducted in [14]. Note that
while about 600 examples per visible class are used during the training stage, the
best reported zero-shot prediction accuracy on the hidden classes is equivalent
to the accuracy of supervised learning using fewer than 20 training examples
per hidden class. In fact, the zero-shot learning accuracy reported on the aPaY is worse than the accuracy of supervised learning when as few as 2 training
examples per hidden class are used.
When the area under the curve (AUC) is used as a performance measure,
which is known to be more robust to class imbalance than the prediction accuracy, then the apparent merit of zero-shot learning becomes even more questionable. For instance, the popular direct attribute prediction (DAP) on the AwA
dataset achieves an AUC of 0.81, which is equivalent to the performance of
supervised learning using as few as 10 training examples from each hidden class
only (c.f. Tables 4 and 7b in [14]). Recall, by contrast, that over 600 examples
per visible class are used for training.
758
3
I. Alabdulmohsin et al.
A Mathematical Formalism
The above argument and empirical evidence on the ill-posedness of attributebased zero-shot learning can be formalized mathematically. Incidentally, this
will allow us to identify paradigms of zero-shot learning for which the above
argument no longer holds.
As stated in Sect. 2 and illustrated in Fig. 4, the fundamental problem with
attribute-based zero-shot learning is that it aims at learning concept classes
(semantic attributes) with respect to one distribution of instances (i.e. when
conditioned on the visible set of classes) with the goal of being able to predict
those concept classes for an arbitrary distribution of instances (i.e. when conditioned on the unknown hidden set of classes). Clearly, this is an ill-posed strategy
that violates the core assumptions of statistical learning theory.
To remedy this problem, we can cast zero-shot learning as a domain adaptation problem [18]. In the standard domain adaptation setting, it is assumed that
the training examples are drawn i.i.d. from some source distribution DS whereas
future test examples are drawn i.i.d. from a different target distribution DT . Let
h : X → Y be a predictor. Then, the average misclassification error rate of h
with respect to DT is bounded by:
E(X,Y )∼DT {h(X) = Y } ≤ E(X,Y )∼DS {h(X) = Y } + dT V (DS , DT ),
(1)
where dT V (DS , DT ) is the total-variation distance between the two probability
distributions DS and DT [2]. Similar bounds that also hold with a high probability can be found in [3]. Hence, learning a good predictor h with respect to
some source distribution DS does not guarantee a good prediction accuracy with
respect to an arbitrary target distribution DT unless the two distributions are
nearly identical.
Therefore, in order to turn zero-shot learning into a well-posed strategy, it is
imperative that a common representation R(X) is used, such that the induced
distribution of R(X) remains nearly unchanged when the instances X are conditioned on the visible set of classes or on the hidden set of classes. Then, by
learning to predict semantic attributes given R(X), generalization bounds, such
as the one provided in Eq. (1), guarantee a good prediction accuracy in the
zero-shot setting. One method that can accomplish this goal is to divide the
instances Xi into multiple local segments Xi → (Zi,1 , Zi,2 , . . .) ∈ Z r such that a
classifier h : Z → A is trained to predict the semantic attributes in every local
segment separately. If these local segments have a stable distribution across the
visible and hidden set of classes, then zero-shot learning is feasible. A prototypical example of this approach is segmenting sounds into phonemes in word
recognition systems and using those phonemes to recognize words (classes) [17].
4
Conclusion
Attribute-based zero-shot learning is a transfer learning strategy that has been
widely studied in the literature. Its aim is to learn to predict novel classes
Is Attribute-Based Zero-Shot Learning an Ill-Posed Strategy?
759
that are never seen during the training stage by learning to predict semantic
attributes instead. In this paper, we argue that attribute-based zero-shot learning is an ill-posed strategy because the two tasks of “predicting” and “learning”
an attribute are fundamentally different. We demonstrate our argument on synthetic datasets and use it, finally, to explain the poor performance results that
have been reported so far in the literature for various zero-shot learning algorithms on popular benchmark datasets.
Acknowledgment. Research reported in this publication was supported by King
Abdullah University of Science and Technology (KAUST) and the Saudi Arabian Oil
Company (Saudi Aramco).
References
1. Abu-Mostafa, Y.S., Magdon-Ismail, M., Lin, H.T.: Learning from data (2012)
2. Alabdulmohsin, I.: Algorithmic stability and uniform generalization. In: NIPS, pp.
19–27. Curran Associates, Inc. (2015)
3. Ben-David, S., Blitzer, J., Crammer, K., Pereira, F.: Analysis of representations
for domain adaptation. In: Schölkopf, B., Platt, J., Hoffman, T. (eds.) Advances
in Neural Information Processing Systems 19, pp. 137–144. MIT Press, Cambridge
(2006). http://books.nips.cc/papers/files/nips19/NIPS2006 0838.pdf
4. Boser, B.E., Guyon, I., Vapnik, V.: A training algorithm for optimal margin classifiers. In: Fifth Annual Workshop on Computational Learning Theory, pp. 144–152
(1992)
5. Bousquet, O., Boucheron, S., Lugosi, G.: Introduction to statistical learning theory.
In: Bousquet, O., von Luxburg, U., Rätsch, G. (eds.) Machine Learning 2003. LNCS
(LNAI), vol. 3176, pp. 169–207. Springer, Heidelberg (2004)
6. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)
7. Dinu, G., Baroni, M.: Improving zero-shot learning by mitigating the hubness
problem. In: ICLR: Workshop Track (2015). arXiv:1412.6568
8. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: a library
for large linear classification. JMLR 9, 1871–1874 (2008)
9. Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.: Describing objects by their
attributes. In: CVPR, pp. 1778–1785. IEEE (2009)
10. Fei-Fei, L., Fergus, R., Perona, P.: One-shot learning of object categories. IEEE
Trans. Pattern Anal. Mach. Intell. 28(4), 594–611 (2006)
11. Haehl, V., Vardaxis, V., Ulrich, B.: Learning to cruise: Bernstein’s theory applied
to skill acquisition during infancy. Hum. Mov. Sci. 19(5), 685–715 (2000)
12. Jayaraman, D., Grauman, K.: Zero-shot recognition with unreliable attributes. In:
NIPS, pp. 3464–3472 (2014)
13. Lampert, C.H., Nickisch, H., Harmeling, S.: Learning to detect unseen object
classes by between-class attribute transfer. In: IEEE Conference on Computer
Vision and Pattern Recognition, CVPR 2009, pp. 951–958. IEEE (2009)
14. Lampert, C.H., Nickisch, H., Harmeling, S.: Attribute-based classification for zeroshot visual object categorization. IEEE Trans. Pattern Anal. Mach. Intell. 36(3),
453–465 (2014)
15. Liu, J., Kuipers, B., Savarese, S.: Recognizing human actions by attributes. In:
CVPR, pp. 3337–3344. IEEE (2011)
760
I. Alabdulmohsin et al.
16. Rader, N., Bausano, M., Richards, J.E.: On the nature of the visual-cliff-avoidance
response in human infants. Child Dev. 51(1), 61–68 (1980)
17. Palatucci, M., Pomerleau, D., Hinton, G.E., Mitchell, T.M.: Zero-shot learning
with semantic output codes. In: NIPS, pp. 1410–1418 (2009)
18. Romera-Paredes, B., Torr, P.: An embarrassingly simple approach to zero-shot
learning. In: ICML, pp. 2152–2161 (2015)
19. Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning: From Theory
to Algorithms. Cambridge University Press, Cambridge (2014)
20. Shigeto, Y., Suzuki, I., Hara, K., Shimbo, M., Matsumoto, Y.: Ridge regression,
hubness, and zero-shot learning. In: Appice, A., Rodrigues, P.P., Santos Costa, V.,
Soares, C., Gama, J., Jorge, A. (eds.) ECML PKDD 2015. LNCS (LNAI), vol.
9284, pp. 135–151. Springer, Heidelberg (2015). doi:10.1007/978-3-319-23528-8 9
21. Socher, R., Ganjoo, M., Manning, C.D., Ng, A.: Zero-shot learning through crossmodal transfer. In: NIPS, pp. 935–943 (2013)
22. Thrun, S., Mitchell, T.M.: Lifelong robot learning. Rob. Auton. Syst. 15, 25–46
(1995)
23. Vapnik, V.N.: An overview of statistical learning theory. IEEE Trans. Neural Netw.
10(5), 988–999 (1999)