2013 XXIV International Conference on
Information, Communication and Automation Technologies (ICAT)
October 30 – November 01, 2013, Sarajevo, Bosnia and Herzegovina
Recognizing Actions with the Associative
Self-Organizing Map
Miriam Buonamente and Haris Dindo
Magnus Johnsson
RoboticsLab, DICGIM, University of Palermo,
Viale delle Scienze, Ed. 6, 90128 Palermo, Italy
Email: {miriam.buonamente, haris.dindo}@unipa.it
Lund University Cognitive Science,
Lundagård, 222 22 Lund, Sweden
Email: [email protected]
Abstract—When artificial agents interact and cooperate with
other agents, either human or artificial, they need to recognize
others’ actions and infer their hidden intentions from the sole
observation of their surface level movements. Indeed, action and
intention understanding in humans is believed to facilitate a
number of social interactions and is supported by a complex
neural substrate (i.e. the mirror neuron system). Implementation
of such mechanisms in artificial agents would pave the route to
the development of a vast range of advanced cognitive abilities,
such as social interaction, adaptation, and learning by imitation,
just to name a few.
We present a first step towards a fully-fledged intention recognition system by enabling an artificial agent to internally represent
action patterns, and to subsequently use such representations to
recognize - and possibly to predict and anticipate - behaviors
performed by others. We investigate a biologically-inspired approach by adopting the formalism of Associative Self-Organizing
Maps (A-SOMs), an extension of the well-known Self-Organizing
Maps. The A-SOM learns to associate its activities with different
inputs over time, where inputs are high-dimensional and noisy
observations of others’ actions. The A-SOM maps actions to
sequences of activations in a dimensionally reduced topological
space, where each centre of activation provides a prototypical
and iconic representation of the action fragment. We present
preliminary experiments of action recognition task on a publicly
available database of thirteen commonly encountered actions with
promising results.
I.
I NTRODUCTION
Advancements in the field of human-robot interaction are
crucial to create the next generation of robots able to assist
humans and to support elderly and disabled in everyday life.
A major prerequisite to this goal is the ability to recognize
other agents’ actions and understand their short- or long-term
intentions, where an intention can be defined as “a plan of
action the organism chooses and commits itself to the pursuit
of a goal-an intention thus includes both a means (action plan)
as well as a goal” [1]. Implementation of similar mechanisms
in artificial agents, in our view, would be beneficial to the
development of a vast range of advanced cognitive abilities,
such as social interaction [2], adaptation [3] and learning
by imitation [4], [5], [6]. Indeed, building systems capable
of complex intention reading has been recognized to be of
immense interest in a variety of domains involving collaborative and competitive scenarios such as computer games,
surveillance, ambient intelligence, decision support, intelligent
tutoring and obviously robotics [7].
Humans continuously observe others’ behavior and, based
on their (noisy) observations, infer their intentions and make
978-1-4799-0431-0/13/$31.00 ©2013 IEEE
appropriate decisions to act. Humans do not use only verbal
communication, but they also take advantage from information
hidden in the observable behavior. They do that apparently
effortlessly, and replicating similar abilities in artificial agents
is of paramount importance in building the next generation of
social robots.
It is believed that action and intention recognition processes
in humans are governed by the mirror neuron system (MNS):
the same (premotor) neurons responsible for the execution of
an action fire also when a similar action is solely observed
[8], [9]. A functional view of this “motor resonance” phenomenon postulates that humans understand others’ intentions
by internally simulating what they would have done in a
similar situation by adopting the same motor programs used
as they were actually performing the action. This view entails
that humans are able to create rich internal models mirroring
one’s own motor capabilities and complex dynamics observed
in the outer world. By exploiting their inner world, humans
can simulate different actions, and foresee and evaluate their
consequences [10], [11], [12]. However, apart from some
isolated implementations of internal models in robotics [13],
[14], their full representational and computational power has
not yet been investigated in real-world applications.
In this paper we present a first step towards a fully-fledged
intention recognition system by enabling an artificial agent
to internally represent action patterns, and subsequently to
use such representations to recognize behaviors performed by
others. While the field of action recognition has been an active
one, especially in the machine vision community (see [15] for
a recent survey), here we investigate a biologically-inspired approach which builds on the idea of internal representations. We
adopt the Associative Self-Organizing Map [16], a variation of
Self-Organizing Map (SOM) [17], to parsimoniously represent
and to efficiently recognize human actions in real-time. We
adopt the usual distinction between “actions” and “activities”,
where - loosely speaking - the former are characterized by
simple motion patterns typically executed by a single human
(e.g. walking), while the latter are more complex and involve a
sequence of actions (e.g. dancing) - which might also involve
coordinated actions among a small number of humans [18].
While this paper concentrates on action recognition with
an A-SOM network, our long-term goal is to reuse the same
computational substrate to fully implement the action simulation idea. In other words, the agent should also be able to
simulate the likely continuation of the recognized action. For
instance, if we look at a man crossing the street, we are able to
Fig. 2. A-SOM network connected with two other SOM networks. They
provide the ancillary input to the main A-SOM (see the main text for more
details).
II.
Fig. 1. Prototypical postures of 13 different actions in our dataset: check
watch, cross arms, get up, kick, pick up, point, punch, scratch head, sit down,
throw, turn around, walk, wave hand.
infer the continuation of the observed action (i.e. the intention
to cross the street) even if an obstacle obscures our view.
Our goal is to build agents able to elicit which is the likely
continuation of the observed action if their view is obscured by
an obstacle or other factors. Indeed, as we will see below, the
A-SOM can remember perceptual sequences by associating the
current network activity with its own earlier activity. Due to
this ability, A-SOM could receive an incomplete input pattern
and continue to elicit the likely continuation, considering this
as sequence completion of perceptual activity over time. The
results presented here are the first step in this direction.
We have tested the A-SOM in the action recognition task on
a publicly available dataset of movies representing 13 common
actions1 : check watch, cross arms, scratch head, sit down, get
up, turn around, walk, wave, punch, kick, point, pick up, throw
(see Fig. 1).
The paper is structured as follows. An overview of the ASOM network is given in section II. Experiments for assessing
the performance of the proposed method are described in
section III. Finally, conclusions and future works are outlined
in section IV.
1 The movie is taken from the “INRIA 4D repository”, available at
http://4drepository.inrialpes.fr. It offers several movies representing sequences
of actions. Each video is captured from 5 different cameras. For the experiments in this paper we chose the movie “Julien1” and only one point of view,
the frontal camera “cam0”.
A SSOCIATIVE S ELF -O RGANIZING M AP (A-SOM)
Our approach is based on use of Associative SelfOrganizing Map (A-SOM), which is related to Self-Organizing
Map (SOM). SOM is a neural network that is trained using unsupervised learning to produce a smaller discretized
representation of the input space of the training samples. It
resembles the functioning of the brain in pattern recognition
tasks: when presented with an input, it excites neurons in a
specific area. The goal of learning in A-SOMs is to cause
different parts of the network to respond similarly to similar
input patterns while clustering a large input space to a smaller
output space. Self-Organizing Maps are different from other
artificial neural networks because they use a neighborhood
function to preserve the topological properties of the input
space. SOM algorithms builds a lattice of neurons, where
neurons located close to each other have similar characteristics.
The SOM structure is made of one input layer and one output
layer, the latter known as Kohonen layer. The neurons of each
layer are strictly connected to each other, whereas the output
neurons are connected with neighborhood neurons. Only one
of these neurons can be the winner for each input provided
to the network. This winner identifies the class which the
input belongs to. Furthermore, the SOM has the capability
to generalize, i.e. the network can recognize or characterize
inputs it has never encountered before.
The A-SOM is an extension to the SOM which learns to
associate its activity with activity of other neural networks. It
can be considered as a SOM with additional ancillary input
from other networks, Fig. 2. The use of A-SOM leads to a
lattice of the input data, in which the spatial locations of the
resulting prototypes in the lattice are indicative of intrinsic
statistical features of the input postures. Moreover, the A-SOM
can be connected to itself associating its activity with its own
earlier activity. This makes the A-SOM able to remember and
to complete perceptual sequences over time. Many simulations
prove that A-SOM, once received some initial input, can
continue to elicit the activity likely to follow in the nearest
future even though no further input is received [19], [20].
The A-SOM learns to associate its activity with (possibly
delayed) additional inputs. It consists of an I × J grid of
neurons with a fixed number of neurons and a fixed topology.
Each neuron nij is associated with r + 1 weight vectors
a
1
2
r
wij
∈ Rn and wij
∈ Rm1 , wij
∈ Rm2 , . . . , wij
∈ Rmr .
All the elements of all the weight vectors are initialized by
real numbers randomly selected from a uniform distribution
between 0 and 1, after which all the weight vectors are
normalized, i.e. turned into unit vectors.
At time t each neuron nij receives r + 1 input vectors xa (t) ∈
Rn and x1 (t − d1 ) ∈ Rm1 , x2 (t − d2 ) ∈ Rm2 , . . . , xr (t −
dr ) ∈ Rmr where dp is the time delay for input vector xp , p =
1, 2, . . . , r.
The main net input sij is calculated using the standard cosine
metric
sij (t) =
a
xa (t) · wij
(t)
,
a
a
||x (t)||||wij (t)||
(1)
The activity in the neuron nij is given by
a
1
2
r
yij (t) = yij
(t) + yij
(t) + yij
(t) + . . . + yij
(t) /(r + 1)
(2)
a
where the main activity yij
is calculated by using the softmax
function [21]
m
a
yij
(t) =
(sij (t))
m
maxij (sij (t))
(3)
where m is the softmax exponent.
p
The ancillary activity yij
(t),
p = 1, 2, . . . , r is calculated
by again using the standard cosine metric
p
yij
(t) =
p
xp (t − dp ) · wij
(t)
.
p
p
||x (t − dp )||||wij (t)||
(4)
Fig. 3.
Walk action movie created with a reduced number of images
III.
E XPERIMENT
We tested the representational capabilities of the A-SOM
on an action recognition task. Neurobiological studies argue
that the human brain can perceive actions by observing only
the human body poses, called postures, during action execution
[22]. Thus, actions can be described as sequences of consecutive human body poses, in terms of human body silhouettes
[23], [24], [25].
The dataset of actions we chose consists of more than 700
postural images reproducing 13 different actions. Since we
want the agent to be able to recognize one action a time, we
split the original movie into 13 different movies: one movie for
each action (see Fig. 1). Each frame is preprocessed to reduce
the noise induced by the image acquisition process and the
so-called posture vectors are extracted (see the section III-A
below) which are used to create the training set required to
train the A-SOM. Our final training set is composed of about
20000 samples where every sample is a posture vector.
The created input is used to train the A-SOM network
and to get the A-SOM weights, the training lasted about
88000 iterations. The generated weight file is used to execute
tests. The implementation of all code for the experiments
presented in this paper was done in C++ using the neural
modeling framework “Ikaros” [26]. Next sections detail the
preprocessing phase as well as the results obtained.
The neuron c with the strongest main activation is selected:
A. Preprocessing phase
a
c = arg maxij yij
(t)
(5)
a
The weights wijk
are adapted by
a
a
a
wijk
(t + 1) = wijk
(t) + α(t)Gijc (t) xak (t) − wijk
(t) (6)
where 0 ≤ α(t) ≤ 1 is the adaptation strength with
α(t) → 0 when t → ∞. The neighbourhood function
−
||rc −rij ||
Gijc (t) = e 2σ2 (t) , where rc ∈ R2 and rij ∈ R2 are
location vectors of neurons c and nij , is a Gaussian function
decreasing with time.
p
The weights wijl
, p = 1, 2, . . . , r, are adapted by
a
p
p
p
wijl
(t + 1) = wijl
(t) + βxpl (t − dp ) yij
(t) − yij
(t)
(7)
where β is the adaptation strength.
p
a
All weights wijk
(t) and wijl
(t) are normalized after each
adaptation.
In this paper the input vector x1 is the activity of the ASOM from the previous iteration rearranged into a vector and
d1 = 1.
In order to reduce the computational load, preprocessing
operations for spatial and temporal reductions have been done.
The temporal reduction is performed by reducing the number
of images for each movie: we established that all of the 13
movies should have the same duration consisting of 10 frames.
In this way, we have a good compromise to have seamless
and fluid actions, guaranteeing the quality of the movie. In
Fig. 3 the images reproducing the ”walk action” movie are
represented; as the picture shows, the reduction of the number
of images, and the subsequent reduction of the time, does not
affect the quality of the action reproduction.
The spatial reduction is performed by resizing the dimensions of the images. These images are centered at the
person’s centre of mass and bounding boxes of size equal
to the maximum bounding box enclosing person’s body are
extracted. We cut the images by using the identified boundary
box including the person performing the action. In this way,
we simulate an attentive process in which the human eye
observes and follows the salient part of the action only. In
order to improve the spatial reduction so to have a faster
execution, every image was shrunk to NH × NW dimension
to produce binary images of fixed size. Binary posture images
are represented as matrices and these matrices are vectorized
to produce posture vectors p ∈ RD , D = NH × NW ; each
posture image is represented by a posture vector p. In this
way every action, consisting of 10 frames, is represented by a
set of posture vectors pi ∈ RD , i = 1, . . . , 10. The result is
13 movies with the same duration, each of them reproducing
one action. In the experiment presented in this paper the values
NH = 15, NW = 15 have been used and binary posture images
are scanned row-wise.
B. Action classification
The goal of the action classification experiment is to verify
if the A-SOM is able to discriminate actions. We evaluate
the A-SOM by setting up a system consisting of one A-SOM
connected to itself. To this end, 13 sets containing 10 posture
vectors each (representing the binary images that form the
videos) were constructed as explained above.
We fed the A-SOM with one set a time and the centers
of activity of the A-SOM, generated for each sample of the
set, are recorded for all these tests. The coordinates of these
centers are recorded and plotted in a diagram that shows how
they are located in the space as shown in Fig. 4. The centers are
connected each other through arrows describing the temporal
evolution of the represented action. These diagrams depict
the motion pattern for each action and allow us to evaluate
if the A-SOM can discriminate the binary images that form
each action. What we expect is to have a different motion
pattern for each action, and different centers of activities for
each distinct posture vector. In such a way the classification
procedure is greatly facilitated and easily implemented through
stochastic state machines (e.g. Markov processes or similar
machine learning methods [27]).
In the presented experiment, the A-SOM elicits different
centers of activity for different posture vectors, creating in
this way, different motion patterns for different actions, and
demonstrating its ability to discriminate actions properly. As
mentioned earlier, patterns are made of centers of activity,
therefore they also give information about the ability of the
A-SOM to discriminate between the images that form each
actions. Since the A-SOM elicits the same centers of activity
for similar posture vectors, action movies made of similar
images have few centers of activation. Actions composed of
images with different characteristics, present several centers
of activation, one for each different image. It is possible to
see, for example, in the “check watch” movie 1 a), the pattern
presents only four centers of activation whereas in “punch”
movie, Fig. 1 g), the pattern presents ten different centers of
activation.
The plotted diagrams, furthermore, show that A-SOM can
create topological maps in which similar binary postures are
located close to each other. Let look at the motion pattern,
Fig. 4 c) of the action “Get up” in Fig. 1 c), the first four
binary images depicting the person sitting are located closer
each other and in a close position are located the other three
representing the person standing up, distant from them the last
three images depicting the completion of the activity. We can
easily argue that the A-SOM is able to recognize and classify
body postures representing actions.
of the A-SOM in classifying the observed action, and thanks
to its ability to remember perceptual sequence, A-SOM should
be suitable for the prediction of the likely continuation of
the perceived behavior of an agent. It has been shown that
the A-SOM can proper recognize and classify actions. Future
experiments are intended to demonstrate that the A-SOM can
receive some initial sensory input and elicit the activity to
continue the perceptual sequence even if no further input is
received. The A-SOM should be able to internally simulate
the sequence of activity likely to follow the activity elicited
by the agents’ initial behavior, and thus a way to read the
agents’ intentions.
ACKNOWLEDGMENT
The authors gratefully acknowledge the support from the
Linnaeus Centre Thinking in Time: Cognition, Communication, and Learning, financed by the Swedish Research Council,
grant no. 349-2007-8695.
R EFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
IV.
C ONCLUSION
A new method based on a different type of artificial neural
network, A-SOM, has been proposed to recognize, classify and
simulate actions. The proposed method highlights the strength
[16]
[17]
M. Tomasello, M. Carpenter, J. Call, T. Behne, and H. Moll, “Understanding and sharing intentions: the origins of cultural cognition.”
Behavioral Brain Science, vol. 28, no. 5, pp. 675–91, Oct 2005.
C. Breazeal, Designing sociable robots. The MIT Press, 2004.
G. Pezzulo and H. Dindo, “What should i do next? using shared
representations to solve interaction problems,” Experimental Brain
Research, vol. 211, no. 3-4, pp. 613–630, 2011.
A. Chella, H. Dindo, and I. Infantino, “A cognitive framework for
imitation learning,” Robotics and Autonomous Systems, vol. 54, no. 5,
pp. 403–408, 2006.
——, “Imitation learning and anchoring through conceptual spaces,”
Applied Artificial Intelligence, vol. 21, no. 4-5, pp. 343–359, April 2007.
B. D. Argall, S. Chernova, M. Veloso, and B. Browning, “A survey of
robot learning from demonstration,” Robotics and Autonomous Systems,
vol. 57, no. 5, pp. 469–483, 2009.
Y. Demiris, “Prediction of intent in robotics and multi-agent systems,”
Cognitive Processing, vol. 8, no. 3, pp. 151–158, 2007.
G. Rizzolatti and L. Craighero, “The mirror-neuron system,” Annual
Review of Neuroscience, vol. 27, pp. 169–192, 2004.
M. Iacoboni, I. Molnar-Szakacs, V. Gallese, G. Buccino, J. C. Mazziotta,
and G. Rizzolatti, “Grasping the intentions of others with one’s own
mirror neuron system.” PLoS Biol, vol. 3, no. 3, p. e79, Mar 2005.
L. W. Barsalou, “Perceptual symbol systems,” Behavioral and brain
sciences, vol. 22, no. 04, pp. 577–660, 1999.
D. M. Wolpert, K. Doya, and M. Kawato, “A unifying computational
framework for motor control and social interaction.” Philos. Trans.
Royal Soc. London B Biol. Sci., vol. 358, no. 1431, pp. 593–602, Mar
2003.
R. Grush, “The emulation theory of representation: motor control,
imagery, and perception,” Behavioral and brain sciences, vol. 27, no. 3,
pp. 377–396, 2004.
A. Dearden and Y. Demiris, “Learning forward models for robotics,”
in Proceedings of IJCAI-2005, Edinburgh, 2005, pp. 1440–1445.
H. Dindo, D. Zambuto, and G. Pezzulo, “Motor simulation via coupled
internal models using sequential monte carlo,” in Proc. of the 22nd
International Joint Conference on Artificial Intelligence (IJCAI), July
16-22 2011, pp. 2113–2119.
R. Poppe, “A survey on vision-based human action recognition,” Image
and vision computing, vol. 28, no. 6, pp. 976–990, 2010.
M. Johnsson, C. Balkenius, and G. Hesslow, “Associative selforganizing map.” in IJCCI, 2009, pp. 363–370.
T. Kohonen, “The self-organizing map,” Neurocomputing, vol. 21, no. 1,
pp. 1–6, 1998.
Fig. 4. Motion patterns for 13 actions: a) Check Watch; b) Cross Arms; c) Get Up; d) Kick; e) Pick Up; f) Point; g) Punch; h) Scratch Head; i) Sit Down; l)
Throw; m) Turn Around; n) Walk; o) Wave Hand. The points in the diagram represent the actions center of activity and the arrows indicate the action evolution
over time. Actions made of similar posture vectors, present few center of activity; whereas movies made of posture vectors with different characteristics present
several centers of activity. The diagrams give indication about the ability of A-SOM to create topological maps in which similar binary postures are located
close to each other.
[18]
P. Turaga, R. Chellappa, V. S. Subrahmanian, and O. Udrea, “Machine
recognition of human activities: A survey,” IEEE Transactions on
Circuits and Systems for Video Technology, vol. 18, no. 11, pp. 1473–
1488, 2008.
[19]
M. Johnsson, D. Gil, C. Balkenius, and G. Hesslow, “Supervised architectures for internal simulation of perceptions and actions,” Proceedings
of BICS, 2010.
[20]
M. Johnsson, D. G. Mendez, G. Hesslow, and C. Balkenius, “Internal
simulation in a bimodal system.” in SCAI, 2011, pp. 173–182.
[21]
C. M. Bishop, Neural networks for pattern recognition.
university press, 1995.
Oxford
[26]
[22]
M. A. Giese and T. Poggio, “Neural mechanisms for the recognition
of biological movements,” Nature Reviews Neuroscience, vol. 4, no. 3,
pp. 179–192, 2003.
[27]
[23]
[24]
[25]
L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, “Actions
as space-time shapes,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 29, no. 12, pp. 2247–2253, 2007.
N. Gkalelis, A. Tefas, and I. Pitas, “Combining fuzzy vector quantization with linear discriminant analysis for continuous human movement
recognition,” IEEE Transactions on Circuits and Systems for Video
Technology, vol. 18, no. 11, pp. 1511–1521, 2008.
A. Iosifidis, A. Tefas, and I. Pitas, “View-invariant action recognition
based on artificial neural networks,” IEEE Transactions on Neural
Networks and Learning Systems, vol. 23, no. 3, pp. 412–424, 2012.
C. Balkenius, J. Morén, B. Johansson, and M. Johnsson, “Ikaros: Building cognitive models for robots,” Advanced Engineering Informatics,
vol. 24, no. 1, pp. 40–48, 2010.
K. P. Murphy, Machine Learning: a Probabilistic Perspective. MIT
Press, 2012.
© Copyright 2025 Paperzz