Learning Affordances for Categorizing Objects and Their Properties
Nilgün Dag̃, İlkay Atıl, Sinan Kalkan, Erol Şahin
KOVAN Research Lab, Middle East Technical University, Ankara, TURKEY
{nilgundag,ilkayatil,skalkan,erol}@ceng.metu.edu.tr
Abstract
In this paper, we demonstrate that simple interactions with objects in the environment leads to a manifestation of the perceptual properties of objects. This is
achieved by deriving a condensed representation of the
effects of actions (called effect prototypes in the paper),
and investigating the relevance between perceptual features extracted from the objects and the actions that can
be applied to them. With this at hand, we show that the
agent can categorize (i.e., partition) its raw sensory perceptual feature vector, extracted from the environment,
which is an important step for development of concepts
and language. Moreover, after learning how to predict
the effect prototypes of objects, the agent can categorize
objects based on the predicted effects of actions that can
be applied on them.
1
Introduction
The main goal of the computer vision community
is to make computers see as well as we, humans, do.
One promising approach to achieving this goal is a developmental one, which proposes to learn about visual
perception of objects and events in the environment by
interacting (i.e., applying the actions in his repertoire)
with them.
The concept of affordances, first proposed by J. J.
Gibson [4], is one way of linking development of action
and perception in an embodied environment. Gibson
defined affordances as the action possibilities offered
by the environment to the agent. By interacting with the
environment, the developing organism can discover the
actions that the objects in the environment can afford.
However, affordances is not only related to the development of the action-based development of the organism,
and but, as also argued by E. J. Gibson [3], learning
affordances is discovering features and invariant properties of things.
In this paper, we take a developmental approach to-
wards discovering about the objects in the environment.
A developing organism is, at first, not aware of the
meaning of the properties of the objects. By interacting with the objects, it can derive the relation between
the actions and the objects’ properties that are effected
by them. This derivation leads to a categorization of
objects’ features, which is an important step for understanding and communicating about objects.
We model an affordance (a) as a relation between an
object (e), an action (b) and an effect (f ):
a = (e, b, f ),
(1)
which can be extended using equivalence classes to
model relations between multiple instances of objects,
actions and effects (see, for example, [2]).
Using developmental methods for perception (and
artificial intelligence in general), especially those methods linking perception and action, is an active research
area. For example, Kraft et al. [5] uses Object Action Complexes (which is, in essence, the same as affordances) for deriving objectness of a set of local visual
descriptors that move together in space. Montesano and
Lopes [6] proposes an affordance-based approach for
learning object grasping. We refer to [2] for a detailed
review of the relevant perception-action-based studies.
The most relevant work to our current paper is by
Ugur et al. [8], where they learn affordances of the
objects in the environment by simple interactions and
create effect prototypes for each affordance (which describes a condensed representation of the effect produced by the relevant action). Ugur et al. [8] use these
effect prototypes for making plans for reaching a goal
that would require more than one action. In contrast, we
make use of these effect prototypes for a more visionrelated reason: namely, (1) for investigating perceptual
properties of objects and (2) for categorizing objects
based on effect prototypes of actions that can be applied
on them. Especially the second contribution of the current paper is an important novelty and up to the authors’
knowledge, the current paper is the first in utilizing the
effects of the actions on the objects for categorizing the
objects.
2
Methods
In this section, we describe how we acquire our data
and process it.
2.1
Data
We have acquired the range data using SwissRanger
4000 range camera which can capture depth of scenes
with a resolution of 176x144 at 30fps. SwissRanger
4000 provides three kinds of information (as three
176x144 images): the range data, the amplitude of the
signal and the confidence of the signal.
We used a setup that had a low-amplitude background, which allowed us to make a clean segmentation
of the objects.
We have ten objects; four different sized cups, three
different sized cubes, a salt shaker, a sphere (ball) and a
cylinder (see figure 1).
We perform four actions, namely push-left,
push-right, rotate-45-degrees-cw (clockwise),
rotate-45-degrees-ccw
(counterclockwise), rotate-90-degrees-cw (clockwise) and rotate-90-degrees-ccw (counterclockwise) actions on ten different objects (see figure
1), giving us 360 samples (20 for each push action and
80 for each rotation). Scene captures are performed
before and after each execution of the six actions, we
name them as initial and final captures of the scene.
The actions are performed by the user as though they
were applied by a robot arm.
2.2
Features
From the segmented range data, we extract the following position, orientation, shape and size related features: X position of the visible object center on the
image plane, average Z position (i.e depth) of the object data points in the scene, 8 orientation sizes, 8 first
and second order statistics and 3 Principle Component
Analysis sizes, giving us in total 21 features.
We find three principal components from the objects:
We fix the first principal component to be the vertical
axis. The other two axes are then the principal components extracted from the top view of the object
First order statistics features are the mean gray level,
gray level standard deviation, coefficient of variation,
kurtosis, energy and entropy; and second order statistics are the angular second moment and auto-correlation
properties of the projected image. These correspond to
shape related features of the object.
Orientation sizes of the object are the distances between the extreme points of the top view of the object
along the eight different orientations (0, 45, 90, 135,
180, 225, 270, 315). The sizes along eight different orientations is an estimation of the orientation of the object.
The features extracted from the object before the action are called initial features whereas the ones after the
action are called final features. The difference between
the final and the initial features define the effect features.
2.3
Deriving Effect Prototypes of Actions
In this subsection, we descibe how we derive a map
over the feature vector (for each action) that gives information about what parts of the feature vector are relevant for what actions. We call this condensed representation of the effects of actions as effect prototype.
For deriving the effect prototypes, we compute the
mean of each element of the initial set of features (denoted as µi ) and the mean and standard deviation of
each element of the effect features (denoted as µe and
σe ) If the standard deviation of a feature element is too
big (σe /|µi | > Tc ), then the change caused by the action on that feature element is not consistent. If the variance is not big (σe /|µi | < Tc ), we have three different
cases: (1) The feature element does not change with
the action if |µi |/|µe | < Td . (2) The feature element
consistently decreases with the action if µe < 0. (3)
Otherwise, the feature element consistently increases.
We have empirically determined Tc and Td as 0.22 and
0.18, respectively.
2.4
Predicting Effects of Actions
Before learning the mapping between the initial set
of features and the effect features, we cluster the effect space using Spectral Clustering method in [1]. The
clusters in the effect space are used as labels in a Support Vector Machines classifier that maps the initial features to the effect clusters.
Using SVM, we find to which effect cluster a new
object maps to, and we take the mean of the effects in
a cluster as the predicted effect of the new object (see
figure 2).
2.5
Categorizing Objects Using Effect Prototypes
We assume that the combined set of effects caused
by the actions on objects can be a discriminative feature
for categorizing objects. For each object e, we have a
Figure 1. Used objects in experiments. From left to right: big cup, mid-sized cup, small cup,
small half-cup, big cube, mid-sized cube, small cube, salt shaker, sphere, cylinder.
Learning
Prediction
Initial Features
Final Features
Initial Features
Behavior
Final – Inital = Effect Features
Table 1 is promising since it is an important step for
an agent to discover the relation between its actions and
the properties of the objects. Using the outcome of this
relation, the agent can make plans by combining the set
of actions in its reportiore. Moreover, the outcome of
this relation is crucial for learning the stable and variable properties (and affordances) of objects.
Spectral
Clustering
Class Labels
SVM
class i
mean(μ) i
Table 1. Prototypes of effects caused by
actions. A horizontal double-arrow, uparrow, down-arrow and question mark denote respectively no-change, increase, decrease and inconsistent change in corresponding feature elements.
initial features + μi
= predicted final features
Figure 2. Clusters in the effect space are
used for training a SVM, which allows predicting the effects of an action on a new
object.
set of effect clusters {EC1 , .., ECN } (where ECi denotes the label of the effect cluster of the ith action on
object e). Using nominal k-means clustering in this set
space, we find clusters, which we assume to be relevant
to categories of objects.
Rotate
45◦ cw
Rotate
45◦ ccw
Rotate
90◦ cw
Rotate
90◦ ccw
Push
Left
Push
Right
3.2
3
3.1
Position
(X)
↔
Position
(Y)
↔
Orientation
Shape
Size
?
↔
↔
↔
↔
?
↔
↔
↔
↔
?
↔
↔
↔
↔
?
↔
↔
↑
↔
↔
↔
↔
↓
↔
↔
↔
↔
Categorizing Objects
Results
Categorizing Object Features
Table 1 displays the prototypes of effects caused by
different actions. We see that push left and right actions
cause a consistent change in the position of the objects
whereas the rest of the feature vector is irrelevant for the
push actions. On the other hand, for the rotate actions,
the position, shape and the size of the objects are irrelevant but there is considerable change in the estimated
orientation of the objects. The change is not consistent
because orientation estimation for balls is the same for
all orientations, causing a high variation in the effects.
Using the method defined in section 2.5, we categorize the objects. Figure 3 displays the three categories
derived by nominal k-means clustering. We see that different objects populate different categories. For example, cluster 3 is only filled with cups whereas balls only
appear in cluster 1. Although, majority of objects in
cluster 2 are cubes, there are also cup instances in this
cluster since both types of objects are effected by viewpoint changes a lot.
Next, we test whether the categories in figure 3 are
meaningful for new objects. Using the SVM introduced
in section 2.4, we predict the effect cluster ECi for each
action to form the set {EC1 , .., ECN }, which is then
compared to the centers of the categories in figure 3 and
the closest category is assigned to the new object. Table 2 shows that new objects (like cylinders or shakers)
whose properties are different than the training set are
categorized reasonably.
Cluster 1
4
Cups
6
21
0
Cubes
0
Balls
5
17
17
Cluster 2
Cluster 3
Figure 3. Categories of the objects. Numbers represent the number of members in
each category.
an important step for development of concepts and language.
We believe that categorizing objects based on the effects of actions applied on them is an important contribution. Such an approach is developmentally relevant
since a developing infant interacts with the objects in
its environment, learns the properties of the objects as
well as a categorization of them based on the actions
that can be applied and their outcomes.
We are aware that the initial attempt presented in this
paper (nor its extensions) cannot perform at the level
of state of the art computer vision methods for object
recognition (or categorization). The contribution of the
paper is to demonstrate how object categorization might
develop in humans and robots, which might be extended
in the later stages of development by incorporating appearance information of objects etc, which constitute
one of our future research directions.
Acknowledgments
Table 2. Object categories for new objects.
Object name
Cup 1
Cup 2
Cup 3
Cup 4
Cube 1
Cube 2
Cube 3
Ball
Salt Shaker
Cylinder
4
Effect cluster vector
1,5,5,2,1,3
1,5,3,2,3,3
3,2,3,4,3,1
1,5,3,2,1,3
3,5,2,4,1,3
3,3,2,5,1,2
3,5,2,2,1,3
1,2,3,4,2,2
3,2,3,3,2,2
3,2,3,3,2,2
Closest cluster
Cluster 3
Cluster 3
Cluster 1
Cluster 3
Cluster 2
Cluster 2
Cluster 2
Cluster 1
Cluster 1
Cluster 1
Conclusion
Developmental methods, especially those linking
perception with action, are promising for acquiring perceptual abilities. As also argued by some (for example,
[7]) 1 , a vision system, whose capabilities are comparable to those of humans, may only be possible by developmental approaches.
In this paper, we have demonstrated that simple interactions with the objects in the environment leads to
a manifestation of the properties of objects. With this,
the agent can attribute meaning to the raw sensory feature vectors, extracted from the environment, which is
1 [7] argues that ”[...] it may only be through development that we
will understand that intelligence or hope to emulate it in machines.”.
This work is partially funded by the EU projects
ROSSI (FP7-ICT-21625), and RobotCub (FP6-ICT004370), and by TUBITAK (Turkish Scientific and
Technical Council) through project no 109E033.
References
[1] W.-Y. Chen, Y. Song, H. Bai, C.-J. Lin, and E. Chang.
Parallel spectral clustering. European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2008.
[2] E. Şahin, M. Çakmak, M. R. Dog̃ar, E. Ug̃ur, and
G. Üçoluk. To Afford or Not to Afford: A New Formalization of Affordances Toward Affordance-Based Robot
Control. Adaptive Behavior, 15(4):447–472, 2007.
[3] E. J. Gibson. Perceptual learning in development: Some
basic concepts. Ecological Psychology, 12(4):295–302,
2000.
[4] J. J. Gibson. The Ecologial Approach to visual perception. Lawrence Erlbaum Associates, 1986.
[5] D. Kraft, N. Pugeault, E. Baseski, M. Popović, D. Kragić,
S. Kalkan, F. Wörgötter, and N. Krüger. Birth of the Object: Detection of Objectness and Extraction of Object
Shape Through ObjectAction Complexes. International
Journal of Humanoid Robotics, 05(02):247, 2008.
[6] L. Montesano and M. Lopes. Learning grasping affordances from local visual descriptors. IEEE 8th Int. Conf.
on Development and Learning, 2009.
[7] L. B. Smith and C. Breazeal. The dynamic lift of developmental process. Developmental Science, 10(1):61–68,
2007.
[8] E. Ug̃ur, E. Şahin, and E. Öztop. Affordance learning
from range data for multi-step planning. Int. Conf. on
Epigenetic Robotics, 2009.
© Copyright 2026 Paperzz