A Cognitive System for Understanding Human Manipulation Actions

Advances in Cognitive Systems 3 (2014) 67–86
Submitted 9/2013; published 8/2014
A Cognitive System for Understanding Human Manipulation Actions
Yezhou Yang
[email protected]
Anupam Guha
[email protected]
Cornelia Fermüller
[email protected]
Yiannis Aloimonos
[email protected]
4470 A. V. Williams Building, University of Maryland, College Park, MD 20742 USA
Abstract
This paper describes the architecture of a cognitive system that interprets human manipulation
actions from perceptual information (image and depth data) and that includes interacting modules for perception and reasoning. Our work contributes to two core problems at the heart of
action understanding: (a) the grounding of relevant information about actions in perception (the
perception-action integration problem), and (b) the organization of perceptual and high-level symbolic information for interpreting the actions (the sequencing problem). At the high level, actions
are represented with the Manipulation Action Grammar, a context-free grammar that organizes actions as a sequence of sub events. Each sub event is described by the hand, movements, objects
and tools involved, and the relevant information about these factors is obtained from biologicallyinspired perception modules. These modules track the hands and objects, and they recognize the
hand grasp, objects and actions using attention, segmentation, and feature description. Experiments
on a new data set of manipulation actions show that our system extracts the relevant visual information and semantic representation. This representation could further be used by the cognitive agent
for reasoning, prediction, and planning.
1. Introduction
Cognitive systems that interact with humans must be able to interpret actions. Here we are concerned with manipulation actions. These are actions performed by agents (humans or robots) on
objects, which result in some physical change of the object. There has been much work recently on
action recognition, with most studies considering short lived actions, where the beginning and end
of the sequence is defined. Most efforts have focused on two problems of great interest to the study
of perception: the recognition of movements and the recognition of associated objects. However,
the more complex an action, the less reliable individual perceptual events are for the characterization
of actions. Thus, the problem of interpreting manipulation actions involves many more challenges
than simply recognizing movements and objects, due to the many ways that humans can perform
them.
Since perceptual events do not suffice, how do we determine the beginning and end of action
segments, and how do we combine the individual segments into longer ones corresponding to a
manipulation action? An essential component in the description of manipulations is the underlying
c 2014 Cognitive Systems Foundation. All rights reserved.
Y. Yang, A. Guha, C. Fermüller, and Y. Aloimonos
goal. The goal of a manipulation action is the physical change induced on the object. To accomplish
it, the hands must perform a sequence of sub actions on the object, such as when the hand grasps or
releases the object, or when the hand changes the grasp type during a movement. Centered around
this idea, we develop a grammatical formalism for parsing and interpreting action sequences, and
we also develop vision modules to obtain from dynamic imagery the symbolic information used in
the grammatical structure.
Our formalism for describing manipulation actions uses a structure similar to natural language.
What do we gain from this formal description of action? This is equal to asking what one gains
from a formal description of language. Chomsky’s contribution to language was the formal description of language through his generative and transformational grammar (1957). This revolutionized
language research, opened up new roads for its computational analysis and provided researchers
with common, generative language structures and syntactic operations on which language analysis
tools were built. Similarly, a grammar for action provides a common framework of the syntax and
semantics of action, so that basic tools for action understanding can be built. Researchers can then
use these tools when developing action interpretation systems, without having to start from scratch.
The input into our system for interpreting manipulation actions is perceptual data, specifically
sequences of images and depth maps. Therefore, a crucial part of our system is the vision process,
which obtains atomic symbols from perceptual data. In Section 3, we introduce an integrated vision
system with attention, segmentation, hand tracking, grasp classification, and action recognition. The
vision processes produce a set of symbols: the “Subject”, “Action”, and “Object” triplets, which
serve as input to the reasoning module. At the core of our reasoning module is the manipulation
action context-free grammar (MACFG). This grammar comes with a set of generative rules and a
set of parsing algorithms. The parsing algorithms use two main operations – “construction” and
“destruction” – to dynamically parse a sequence of tree (or forest) structures made up from the
symbols provided by the vision module. The sequence of semantic tree structures could then be
used by the cognitive system to perform reasoning and prediction. Figure 1 shows the flow chart of
our cognitive system.
Figure 1. Overview of the manipulation action understanding system, including feedback loops within and
between some of the modules. The feedback is denoted by the dotted arrow.
68
Understanding Human Manipulation Actions
2. Related Work
The problem of human activity recognition and understanding has attracted considerable interest
in computer vision in recent years. Both visual recognition methods and nonvisual methods using
motion capture systems (Guerra-Filho, Fermüller, & Aloimonos, 2005; Li et al., 2010) have been
used. Moeslund et al. (2006), Turaga et al. (2008), and Gavrila (1999) provide surveys of the
former. There are many applications for this work in areas such as human computer interaction,
biometrics, and video surveillance. Most visual recognition methods learn the visual signature of
each action from many spatio-temporal feature points (e.g, Dollár et al., 2005; Laptev, 2005; Wang
& Suter, 2007; Willems, Tuytelaars, & Van Gool, 2008). Work has focused on recognizing single
human actions like walking, jumping, or running (Ben-Arie et al., 2002; Yilmaz & Shah, 2005).
Approaches to more complex actions have employed parametric models such as hidden Markov
models (Kale et al., 2004) to learn the transitions between image frames (e.g, Aksoy et al., 2011;
Chaudhry et al., 2009; Hu et al., 2000; Saisan et al., 2001).
The problem of understanding manipulation actions is also of great interest in robotics, which
focuses on execution. Much work has been devoted to learning from demonstration (Argall et al.,
2009), such as the problem of a robot with hands learning to manipulate objects by mapping the
trajectory observed for people performing the action to the robot body. These approaches have emphasized signal to signal mapping and lack the ability to generalize. More recently, within the domain of robot control research, Fainekos et al. (2005) have used temporal logic for hybrid controller
design, and later Dantam and Stilman (2013) suggested a grammatical formal system to represent
and verify robot control policies. Wörgötter et al. (2012) and Aein et al. (2013) created a library of
manipulation actions through semantic object-action relations obtained from visual observation.
There have also been many syntactic approaches to human activity recognition that use the
concept of context-free grammars, because they provide a sound theoretical basis for modeling
structured processes. Brand (1996) used a grammar to recognize disassembly tasks that contain hand
manipulations. Ryoo and Aggarwal (2006) used the context-free grammar formalism to recognize
composite human activities and multi-person interactions. It was a two-level hierarchical approach
in which the lower level was composed of hidden Markov models and Bayesian networks while
the higher-level interactions were modeled by CFGs. More recent methods have used stochastic
grammars to deal with errors from low-level processes such as tracking (Ivanov & Bobick, 2000;
Moore & Essa, 2002). This work showed that grammar-based approaches can be practical in activity
recognition systems and provided insight for understanding human manipulation actions. However,
as mentioned, thinking about manipulation actions solely from the viewpoint of recognition has
obvious limitations. In this work, we adopt principles from CFG-based activity recognition systems,
with extensions to a minimalist grammar that accommodates not only the hierarchical structure of
human activity, but also human-tool-object interactions. This approach lets the system serve as the
core parsing engine for manipulation action interpretation.
Chomsky (1993) suggested that a minimalist generative structure, similar to the one in human
language, also exists for action understanding. Pastra and Aloimonos (2012) introduced a minimalist grammar of action, which defines the set of terminals, features, non-terminals and production
rules for the grammar in the sensorimotor domain. However, this was a purely theoretical description. The first implementation used only objects as sensory symbols (Summers-Stay et al., 2013).
69
Y. Yang, A. Guha, C. Fermüller, and Y. Aloimonos
Then Guha et al. (2013) proposed a minimalist set of atomic symbols to describe the movements in
manipulation actions. In the field of natural language understanding, which traces back to the 1960s,
Schank and Tesler proposed the Conceptual Dependency theory (1969) to represent content inferred
from natural language input. In this theory, a sentence is represented as a series of diagrams representing both mental and physical actions that involve agents and objects. These actions are built
from a set of primitive acts, which include atomic symbols like GRASP and MOVE. Manikonda
et al. (1999) have also discussed the relationship between languages and motion control.
Here we extend the minimalist action grammar of Pastra and Aloimonos (2012) to dynamically
parse the observations by providing a set of operations based on a set of context-free grammar rules.
Then we provide a set of biologically inspired visual processes that compute from the low-level
signals the symbols used as input to the grammar in the form of (Subject, Action, Object). By
integrating the perception modules with the reasoning module, we obtain a cognitive system for
human manipulation action understanding.
3. A Cognitive System For Understanding Human Manipulation Actions
In this section, we first describe the Manipulation Action Context-Free Grammar and the parsing
algorithms based on it. Then we discuss the vision methods: the attention mechanism, the hand
tracking and action recognition, the object monitoring and recognition, and the action consequence
classification.
3.1 A Context-Free Manipulation Action Grammar
Our system includes vision modules that generate a sequence of “Subject” “Action” “Patient”
triplets from the visual data, a reasoning module that takes in the sequence of triplets and builds
them into semantic tree structures. The binary tree structure represents the parsing trees, in which
leaf nodes are observations from the vision modules and the non-leaf nodes are non-terminal symbols. At any given stage of the process, the representation may have multiple tree structures. For
implementation reasons, we use a DUMMY root node to combine multiple trees. Extracting semantic trees from observing manipulation actions is the target of the cognitive system.
Before introducing the parsing algorithms, we first introduce the core of our reasoning module:
the Manipulation Action Context-Free Grammar (MACFG). In formal language theory, a contextfree grammar is a formal grammar in which every production rule is of the form V → w, where
V is a single nonterminal symbol, and w is a string of terminals and/or nonterminals (w can be
empty). The basic recursive structure of natural languages, the way in which clauses nest inside
other clauses, and the way in which lists of adjectives and adverbs are followed by nouns and verbs,
is described exactly by a context-free grammar.
Similarly for manipulation actions, every complex activity is built of smaller blocks. Using linguistics notation, a block consists of a “Subject”, “Action” and “Patient” triplet. Here a “Subject”
can be either a hand or an object, and the same holds for the “Patient”. Furthermore, a complex activity also has a basic recursive structure, and can be decomposed into simpler actions. For example,
the typical manipulation activity “sawing a plank” is described by the top-level triplet “handsaw saw
plank”, and has two lower-level triplets (which come before the top-level action in time), namely
70
Understanding Human Manipulation Actions
Table 1. A Manipulation Action Context-Free Grammar.
AP
HP
H
A
O
→
→
→
→
→
A O | A HP
H AP | HP AP
h
a
o
(1)
(2)
(3)
“hand grasp saw” and “hand grasp plank”. Intuitively, the process of observing and interpreting
manipulation actions is syntactically quite similar to natural language understanding. Thus, the
Manipulation Grammar (Table 1) is presented to parse manipulation actions.
The nonterminals H, A, and O represent the hand, the action and the object (the tools and objects
under manipulation) respectively, and the terminals h, a, and o are the observations. AP stands for
Action Phrase and HP for Hand Phrase. They are proposed here as XPs following the X-bar theory,
which is used to construct the logical form of the semantic structure (Jackendorff, 1977).
The design of this grammar is motivated by three observations: (i) Hands are the main driving
force in manipulation actions, so a specialized nonterminal symbol H is used for their representation; (ii) An Action (A) can be applied to an Object (O) directly or to a Hand Phrase (HP), which in
turn contains an Object (O), as encoded in Rule (1), which builds up an Action Phrase (AP); (iii) An
Action Phrase (AP) can be combined either with the Hand (H) or a Hand Phrase, as encoded in rule
(2), which recursively builds up the Hand Phrase. The rules discussed in Table 1 form the syntactic
rules of the grammar used in the parsing algorithms.
3.2 Cognitive MACFG Parsing Algorithms
Our aim for this project is not only to provide a grammar for representing manipulation actions, but
also to develop a set of operations that can automatically parse (create or dissolve) the semantic tree
structures. This is crucial for practical purposes, since parsing a manipulation action is inherently an
on-line process. The observations are obtained in a temporal sequence. Thus, the parsing algorithm
for the grammar should be able to dynamically update the tree structures. At any point, the current
leaves of the semantic forest structures represent the actions and objects involved so far. When a new
triplet of (“Subject”, “Action”, “Patient”) arrives, the parser updates the tree using the construction
or destruction routine.
Theoretically, the non-regular context-free language defined in Table 1 can be recognized by
a non-deterministic pushdown automaton. However, different from language input, the perception
input is naturally a temporal sequence of observations. Thus, instead of simply building a nondeterministic pushdown automaton, it requires a special set of parsing operations.
Our parsing algorithm differentiates between constructive and destructive actions. Constructive
actions are the movements that start with the hand (or a tool) coming in contact with an object and
usually result in a certain physical change on the object (a consequence), e.g, “Grasp”, “Cut”, or
“Saw”. Destructive actions are movements at the end of physical change inducing actions, when
the hand (or tool) separates from the object; some examples are “Ungrasp” or “FinishedCut”. A
71
Y. Yang, A. Guha, C. Fermüller, and Y. Aloimonos
(a)
(b)
Figure 2. The (a) construction and (b) destruction operations. Fine dashed lines are newly added connections,
crosses are node deletion, and fine dotted lines are connections to be deleted.
constructive action may or may not have a corresponding destructive action, but every destructive
action must have a corresponding constructive action. Otherwise the parsing algorithm detects
an error. In order to facilitate the action recognition, a look-up table that stores the constructivedestructive action pairs is used. This knowledge can be learned and further expanded.
The algorithm builds a tree structure T s . This structure is updated as new observations are
received. Once an observation triplet “Subject”, “Action”, and “Patient” arrives, the algorithm
checks whether the “Action” is constructive or destructive and then follows one of two pathways. If
the “Action” is constructive, a construction routine is used. Otherwise a destruction routine is used
(Refer to Algorithm 1, Algorithm 2, and Algorithm 3 for details). The process continues till the
last observation. Two illustrations in Figure 2 demonstrate how the construction and the destruction
routines work. The parse operation amounts to a chart parser (Younger, 1967), which takes in the
three nonterminals and performs bottom-up parsing following the context-free rules from Table 1.
Algorithm 1 Dynamic manipulation action tree parsing
Initialize an empty tree group (forest) T s
while New observation (subject s, action a, patient p) do
if a is a constructive action then
construction(T s , s, a, p)
end if
if a is a destructive action then
destruction(T s , s, a, p)
end if
end while
72
Understanding Human Manipulation Actions
Algorithm 2 construction(T s , s, a, p)
Previous tree group (forest) T s and new observation (subject s, action a and patient p)
if s is Hand h, and p is an object o then
Find the highest subtrees T h and T o from T s containing h and o. If h or o is not in the current
forest, create new subtrees T h and T o , respectively.
parse(T h ,a,T o ), attach it to update T s .
end if
if s is an object o1 and p is another object o2 then
Find the highest subtree T o1 and T o2 from T s containing o1 and o2 respectively. If either o1 or o2
is not in the current forest, create new subtree T o1 or T o2 . If both o1 and o2 are not in the current
forest, raise an error.
parse(T o1 ,a,T o2 ), attach it to update T s .
end if
Figure 3 shows a typical manipulation action example. The parsing algorithm takes as input
a sequence of key observations: “LeftHand Grasp Knife; RightHand Grasp Eggplant; Knife Cut
Eggplant; Knife FinishedCut Eggplant; RightHand Ungrasp Eggplant; LeftHand Ungrasp Knife”.
Then a sequence of six tree structures is parsed up or dissolved along the time line. We provide
more examples in Section 4, and a sample implementation of the parsing algorithm at http://
www.umiacs.umd.edu/~yzyang/MACFG. For clarity, Figure 3 uses a dummy root node to create a
tree structure from a forest and numbers the nonterminal nodes.
3.3 Attention Mechanism with the Torque Operator
It is essential for our cognitive system to have an effective attention mechanism, because the amount
of information in real world images is vast. Visual attention, the process of driving an agent’s
attention to a certain area, is based on both bottom-up processes defined on low-level visual features
and top-down processes influenced by the agent’s previous experience and goals (Tsotsos, 1990).
Recently, Nishigaki et al. (2012) have provided a vision tool, called the image torque, that captures
Algorithm 3 destruction(T s , s, a, p)
Previous tree structures T s and new observation (subject s, action a and patient p)
Find corresponding constructive action of a from the look-up table and denote it as a0
if There exists a lowest subtree T a0 that contains both s and a0 then
Remove every node on the path that starts from root of T a0 to a0 .
if T a0 has a parent node then
Connect the highest subtree that contains s with T a0 ’s parent node.
end if
Leave all the remaining subtrees as individual trees.
end if
Set the rest of T s as new T s .
73
Y. Yang, A. Guha, C. Fermüller, and Y. Aloimonos
Figure 3. Here the system observes a typical manipulation action example, “Cut an eggplant”, and builds a
sequence of six trees to represent this action.
the concept of closed contours using bottom-up processes. Basically, the torque operator takes
simple edges as input and computes, over regions of different sizes, a measure of how well the
edges are aligned to form a closed, convex contour.
The underlying motivation is to find object-like regions by computing the “coherence” of the
edges that support the object. Edge coherence is measured via the cross-product between the tangent
vector of the edge pixel and a vector from a center point to the edge pixel, as shown in Figure 4(a).
Formally, the value of torque, τ pq of an edge pixel q within a discrete image patch with center p is
defined as
τ pq = ||~r pq ||sinθ pq ,
(1)
where ~r pq is the displacement vector from p to q, and θ pq is the angle between ~r pq and the tangent
vector at q. The sign of τ pq depends on the direction of the tangent vector and for this work, our
system computes the direction based on the intensity contrast along the edge pixel. The torque of an
image patch, P, is defined as the sum of the torque values of all edge pixels, E(P), within the patch
as
1 X
τP =
τ pq .
(2)
2|P| q∈E(P)
74
Understanding Human Manipulation Actions
(a)
(b)
(c)
Figure 4. (a) Torque for images, (b) a sample input frame, and (c) torque operator response. Crosses are the
pixels with top extreme torque values that serve as the potential fixation points.
In this work, our system processes the testing sequences by applying the torque operators to obtain
possible initial fixation points for the object monitoring process. Figure 4 shows an example of the
application of the torque operator.
The system also employs a top-down attention mechanism; it uses the hand location to guide
the attention. Here, we integrate the bottom-up torque operator output with hand tracking. Potential
objects under manipulation are found when one of the hand regions intersects a region with high
torque responses, after which the object monitoring system (Section 3.5) monitors it.
3.4 Hand Tracking, Grasp Classification and Action Recognition
With the recent development of a vision-based, markerless, fully articulated model-based human
hand tracking system (Oikonomidis, Kyriazis, & Argyros, 2011) (http://cvrlcode.ics.forth.
gr/handtracking/), the system is able to track a 26 degree of freedom model of hand. It is worth
noting, however, that for a simple classification of movements into a small number of actions, the
location of the hands and objects would be sufficient. Moreover, with the full hand model, a finer
granularity of description can be achieved by classifying the tracked hand-model into different grasp
types.
We collected training data from different actions, which then was processed. A set of bioinspired features, following hand anatomy (Tubiana, Thomine, & Mackin, 1998), were extracted.
Intuitively, the arches formed by the fingers are crucial to differentiate different grasp types. Figure 5
shows that the fixed and mobile parts of the hand adapt to various everyday tasks by forming bony
arches: longitudinal arches (the rays formed by finger bones and associated metacarpal bones), and
oblique arches (between the thumb and four fingers).
In each image frame, our system computed the oblique and longitudinal arches to obtain an
eight parameter feature vector, as in Figure 6(a). We further reduced the dimensionality of the
feature space by Principle Component Analysis and then applied k-means clustering to discover
the underlying four general types of grasp, which are Rest, Firm Grasp, Delicate Grasp (Pinch)
and Extension Grasp. To classify a given test sequence, the data was processed as described above
and then the grasp type was computed using a naive Bayesian classifier. Figure 6(c) and (d) show
examples of the classification result.
75
Y. Yang, A. Guha, C. Fermüller, and Y. Aloimonos
(a)
(b)
Figure 5. (a) Bones of the human hand. (b) Arches of the hand: (1) one of the oblique arches; (2) one of the
longitudinal arches of the digits; (3) transverse metacarpal arch; (4) transverse carpal arch. Source: Kapandji
& Honoré, 2007.
The grasp classification is used to segment the image sequence in time and also serves as part of
the action description. In addition, our system uses the trajectory of the mass center of the hands to
classify the actions. The hand-tracking software provides the hand trajectories (of the given action
sequence between the onset of grasp and release of the object), from which our system computed
global features of the trajectory, including the frequency and velocity components. Frequency is
encoded by the first four real coefficients of the Fourier transform in all the x, y and z directions,
which gives a 24 dimensional vector over both hands. Velocity is encoded by averaging the difference in hand positions between two adjacent timestamps, which gives a six dimensional vector.
These features are then combined to yield a 30 dimensional vector that the system uses for action
recognition (Teo et al., 2012).
3.5 Object Monitoring and Recognition
Manipulation actions commonly involve objects. In order to obtain the information necessary to
monitor the objects being worked on, our system applies a new method combining segmentation and
tracking (Yang, Fermüller, & Aloimonos, 2013). This method combines stochastic tracking (Han
(a)
(b)
(c)
(d)
Figure 6. (a) One example of fully articulated hand model tracking, (b) a 3-D illustration of the tracked model,
and (c-d) examples of grasp type recognition for both hands.
76
Understanding Human Manipulation Actions
Figure 7. Flow chart of the active segmentation and tracking method for object monitoring: (1) Weighting;
(2) weighted graph-cut segmentation; (3) points propagation and filtering; (4) update target tracking model.
et al., 2009) with a fixation-based active segmentation (Mishra, Fermüller, & Aloimonos, 2009).
The tracking module provides a number of tracked points. The locations of these points define an
area of interest and a fixation point for the segmentation. The data term of the segmentation module
uses the color immediately surrounding the fixation points. The segmentation module segments
the object and updates the appearance model for the tracker. Using this method, the system tracks
objects as they deform and change topology (two objects can merge, or an object can be divided
into parts.)
Figure 7 illustrates the method over time. The method used is a dynamic closed-loop process,
where active segmentation provides the target model for the next tracking step and stochastic tracking provides the attention field for the active segmentation.
For object recognition, our system simply uses color information. The system uses a color
distribution model to be invariant to various textures or patterns. A function h(xi ) is defined to
create a color histogram, which assigns one of the m-bins to a given color at location xi . To make
the algorithm less sensitive to lighting conditions, the system uses the Hue-Saturation-Value color
space with less sensitivity in the V channel (8 × 8 × 4 bins). The color distribution for segment s(n)
is denoted as
p(s(n) )(u) = γ
I
X
k(||y − xi ||)δ[h(xi ) − u] ,
(3)
i=1
where u = 1 . . . m, δ(.) is the Kronecker delta function and γ is the normalization term γ =
1
PI
. k(.) is a weighting function designed from the intuition that not all pixels in the sami=1 k(||y−xi ||)
pling region are equally important for describing the color model. Specifically, pixels that are farther
77
Y. Yang, A. Guha, C. Fermüller, and Y. Aloimonos
away from the fixation point are assigned smaller weights,
(
1 − r2 if r < a
k(r) =
0
otherwise ,
(4)
where the parameter a is used to adapt the size of the region, and r is the distance from the fixation.
By applying the weighting function, we increase the robustness of the distribution by weakening the
influence of boundary pixels that may belong to the background or are occluded.
Since the objects in our experiments have distinct color profiles, the color distribution model
used was sufficient to recognize the segmented objects. We manually labelled several examples
from each object class as training data and used a nearest k-neighbours classifier. Figure 9 shows
sample results.
3.6 Detection of Manipulation Action Consequences
Taking an object-centric point of view, manipulation actions can be classified into six categories
according to how the manipulation transforms the object, or, in other words, what consequence the
action has on the object. These categories are Divide, Assemble, Create, Consume, Transfer, and
Deform.
To describe these action categories we need a formalism. We use the visual semantic graph
(VSG) inspired by the work of Aksoy et al. (2011). This formalism takes as input computed
object segments, their spatial relationship, and the temporal relationship over consecutive frames.
An undirected graph G(V, E, P) represents each frame. The vertex set |V| represents the set of
semantically meaningful segments, the edge set |E| represents the spatial relations between any
two segments. Two segments are connected when they share parts of their borders, or when one
of the segments is contained in the other. If two nodes v1 , v2 ∈ V are connected, E(v1 , v2 ) = 1,
otherwise, E(v1 , v2 ) = 0. In addition, every node v ∈ V is associated with a set of properties P(v)
that describes the attributes of the segment. This set of properties provides additional information
to discriminate between the different categories, and in principle many properties are possible. Here
we use location, shape, and color.
The system needs to compute the changes in the object over time. Our formulation expresses
this as the change in the VSG. At any time instance t, we consider two consecutive graphs, the
graph at time t − 1, denoted as Ga (Va , Ea , Pa ) and the graph at time t, denoted as Gz (Vz , Ez , Pz ). We
then use this formalism to represent four consequences, where ← is used to denote the temporal
correspondence between two vertices, 9 is used to denote no correspondence:
• Divide: {∃v1 ∈ Va ; v2 , v3 ∈ Vz |v1 ← v2 , v1 ← v3 )} or
{∃v1 , v2 ∈ Va ; v3 , v4 ∈ Vz |Ea (v1 , v2 ) = 1, Ez (v3 , v4 ) = 0, v1 ← v3 , v2 ← v4 } Condition (1)
• Assemble: {∃v1 , v2 ∈ Va ; v3 ∈ Vz |v1 ← v3 , v2 ← v3 } or
{∃v1 , v2 ∈ Va ; v3 , v4 ∈ Vz |Ea (v1 , v2 ) = 0, Ez (v3 , v4 ) = 1, v1 ← v3 , v2 ← v4 } Condition (2)
• Create:{∀v ∈ Va ; ∃v1 ∈ Vz |v 9 v1 } Condition (3)
• Consume:{∀v ∈ Vz ; ∃v1 ∈ Va |v 9 v1 } Condition(4)
78
Understanding Human Manipulation Actions
While the above actions can be defined purely on the basis of topological changes, there are no such
changes for Transfer and Deform. Therefore, we must define them through changes in property:
• Transfer:{∃v1 ∈ Va ; v2 ∈ Vz |PaL (v1 ) , PzL (v2 )} Condition (5)
• Deform: {∃v1 ∈ Va ; v2 ∈ Vz |PSa (v1 ) , PSz (v2 )} Condition (6)
PL represents properties of location, and PS represents properties of appearance (e.g., shape or
color). Figure 8 illustrates conditions (1) to (6). The active object monitoring process introduced in
Section 3.5 to (1) finds correspondences (←) between Va and Vz ; (2) monitors location property PL
and appearance property PS in the VSG.
Figure 8. Illustration of the changes for condition (1) to (6).
Our system integrated visual modules in the following manner. Since hands are the most important components in manipulation actions, a state-of-the-art markerless hand tracking system obtains
skeleton models of both hands. Using this data, our system classifies the manner in which the human grasps the objects into four primitive categories. On the basis of the grasp classification, our
system finds the start and end points of action sequences. Our system then classifies the action from
the hand trajectories, the hand grasp, and the consequence of the object (as explained above). To
obtain objects, our system monitors the manipulated object using a process that combines stochastic tracking with active segmentation, then recognizes the segmented objects using color. Finally,
based on the monitoring process, our system checks and classifies the effect on the object into one
of four fundamental types of “consequences”. The final output are sequences of “Subject” “Action”
“Patient” triplets, and the manipulation grammar parser takes them as input to build up semantic
structures.
4. Experiments
The theoretical framework we have presented suggests two hypotheses that deserve empirical tests:
(a) manipulation actions performed by single human agents obey a manipulation action contextfree grammar that includes manipulators, actions, and objects as terminal symbols; (b) a variant
on chart parsing that includes both constructive and destructive actions, combined with methods
for hand tracking and action and object recognition from RGBD data, can parse observed human
manipulation actions.
To test the two hypotheses empirically, we need to define a set of performance variables and
how they relate to our predicted results. The first hypothesis relates to representations, and we
can empirically test if it is possible to manually generate target trees for each manipulation action
in the test set. The second hypothesis has to do with processes, and can be tested by observing
how many times our system builds a parse tree successfully from observed human manipulation
79
Y. Yang, A. Guha, C. Fermüller, and Y. Aloimonos
Table 2. “Hands”, “Objects” and “Actions” involved in the experiments.
Category
Kitchen
Manu
− f acturing
Hand
Le f tHand
RightHand
Le f tHand
RightHand
Ob ject
Bread, Cheese, T omato
Eggplant, Cucumber
Plank
S aw
Constructive Action
Grasp, Cut
Assemble
Grasp, S aw
Assemble
Destructive Action
Ungrasp, FinishedCut
FinishedAssemble
Ungrasp, FinishedS aw
FinishedAssemble
actions. The theoretical framework for this consists of two parts: (1) a visual system to detect
(subject, action, object) triplets from input sensory data and (2) a variant of the chart parsing system
to transform the sequence of triplets into tree representations. Thus, we further separate the test
for our second hypothesis into two sub-tasks: (1) we measure the precision and recall metrics by
comparing the detected triplets from our visual system with human-labelled ground truth data and
(2) given the ground truth triplets as input, we measure the success rate using our variant of chart
parsing system by comparing it with the target trees for each action. We consider a parse successful
if the generated tree is identical with the manually generated target parse. Therefore, we consider
the second hypothesis supported when (1) our visual system achieves high precision and recall, and
(2) our parsing system achieves a high success rate. We use the ground truth triplets as input instead
of the detected ones because we cannot expect the visual system to generate the triplets with 100%
precision and recall, due to occlusions, shadows, and unexpected events. As a complete system, we
expect the visual module to have high precision and recall, thus the detected triplets can be used as
input to the parsing module in practice.
We designed our experiments under the setting with one RGBD camera in a fixed location (we
used a Kinect sensor). We asked human subjects to perform a set of manipulation actions in front
of the camera while both objects and tools were presented within the view of the camera during the
activity. We collected RGBD sequences of manipulation actions being performed by one human,
and to ensure some diversity, we collected these from two domains, namely the kitchen and the
manufacturing environments. The kitchen action set included “Making a sandwich”, “Cutting an
eggplant”, and “Cutting bread”, and the manufacturing action set included “Sawing a plank into two
pieces” and “Assemble two pieces of the plank”. To further diversify the data set, we adopted two
different viewpoints for each action set. For the kitchen actions, we used a front view setting; for
manufacturing actions, a side view setting. The five sequences have 32 human-labelled ground truth
triplets. Table 2 gives a list of “Subjects”, “Objects”, and “Actions” involved in our experiments.
To evaluate the visual system, we applied the vision techniques introduced in Sections 3.3 to
3.6. To be specific, the grasp type classification module provides a “Grasp” signal when the hand
status changes from “Rest” to one of the three other types, and an “Ungrasp” signal when it changes
back to “Rest”. At the same time, the object monitoring and the segmentation-based object recognition module provides the “Object” symbol when either of the hands touch an object. Also, the
hand tracking module provides trajectory profiles that enable the trajectory-based action recognition module to produce “Action” symbols such as “Cut” and “Saw”. The action “Assemble”
did not have a distinctive trajectory profile, so we simply generated it when the “Cheese” merged
with the “Bread” based on the object monitoring process. At the end of each recognized action, a
80
Understanding Human Manipulation Actions
Figure 9. The second row shows the hand tracking and object monitoring. The third row shows the object
recognition result, where each segmentation is labelled with an object name and a bounding box in different
color. The fourth and fifth rows depict the hand speed profile and the Euclidean distances between hands and
objects. The sixth row shows the consequence detection.
corresponding destructive symbol, as defined in Table 2, is produced, and the consequence checking
module is called to confirm the action consequence. Figure 9 shows intermediate and final results of
vision modules from a sequence of a person making a sandwich. In this scenario, our system reliably
tracks, segments and recognizes both hands and objects, recognizes “grasp”, “ungrasp” and “assemble” events, and generates a sequence of triplets along the time-line. To evaluate the parsing system,
given the sequence of ground truth triplets as inputs, a sequence of trees (or forests) is created or
dissolved dynamically using the manipulation action context free grammar parser (Section 3.1, 3.2).
Our experiments produced three results: (i) we were able to manually generate a sequence of
target tree representations for each of the five sequences in our data set; (ii) our visual system detected 34 triplets, of which 29 were correct (compared with the 32 ground truth labels), and yielded
a precision of 85.3% and a recall of 90.6%; (iii) given the sequence of ground truth triplets, our
parsing system successfully parsed all five sequences in our data set into tree representations comparing with the target parses. Figure 10 shows the tree structures built from the sequence of triplets
of the “Making a sandwich” sequence. More results for the rest of manipulation actions in the data
set can be found at http://www.umiacs.umd.edu/~yzyang/MACFG. Overall, (i) supports our
first hypothesis that human manipulation actions obey a manipulation action context-free grammar
that includes manipulators, actions, and objects as terminal symbols, while (ii) and (iii) support
our second hypothesis that the implementation of our cognitive system can parse observed human
manipulation actions. 1
1. As the experiments demonstrate, the system was robust enough to handle situations that involve hesitation, in which
the human grasps a tool, finds that it is not the desired one, and ungrasps it (as in Figure 11).
81
Y. Yang, A. Guha, C. Fermüller, and Y. Aloimonos
DUMMY.0
HP.11
DUMMY.0
DUMMY.0
AP.3
H.2
LeftHand
O.5
A.4
LeftHand
A.4
H.8
O.5
A.9
O.10
RightHand
DUMMY.0
HP.6
O.5
A.9
O.10
O.10
HP.1
H.8
H.2
RightHand
LeftHand
AP.7
AP.3
A.4
O.5
Grasp1 Bread
RightHand
O.10
A.4
Grasp2 Cheese
DUMMY.0
HP.1
H.2
LeftHand
H.8
AP.7
A.9
Grasp1 Bread Grasp2 Cheese
Grasp1 Bread
LeftHand
Assemble
AP.3
H.2
HP.6
A.13
AP.7
AP.3
H.2
HP.1
AP.12
HP.6
HP.1
HP.1
Grasp1 Bread Grasp2 Cheese
Cheese RightHand
AP.3
A.4
H.8
O.5
Grasp1 Bread
DUMMY.0
O.10
H.8
O.5
H.2
Cheese RightHand Bread LeftHand
Figure 10. The tree structures generated from the “Make a Sandwich” sequence. Figure 9 depicts the corresponding visual processing. Since our system detected six triplets temporally from this sequence, it produced
a set of six trees. The order of the six trees is from left to right.
The experimental results support our hypotheses, but we have not tested our system on a large
data set with a variety of manipulation actions. We are currently testing the system on a larger set
of kitchen actions and checking to see if our hypotheses are still supported.
5. Conclusion and Future Work
Manipulation actions are actions performed by agents (humans or robots) on objects that result in
some physical change of the object. A cognitive system that interacts with humans should have the
capability to interpret human manipulation actions. Our hypotheses are that (1) a minimalist generative structure exists for manipulation action understanding and (2) this generative structure organizes
primitive semantic terminal symbols such as manipulators, actions, and objects into hierarchical and
recursive representations. In this work, we presented a cognitive system for understanding human
manipulation actions. The system integrates vision modules that ground semantically meaningful
events in perceptual input with a reasoning module based on a context-free grammar and associated
parsing algorithms, which dynamically build the sequence of structural representations. Experimental results showed that the cognitive system can extract the key events from the raw input and can
interpret the observations by generating a sequence of tree structures.
In future work we will further generalize this framework. First, since the grammar is contextfree, a direct extension is to make it probabilistic. Second, since the grammar does not assume
constraints such as the number of operators, it can be further adapted to process scenarios with multiple agents doing complicated manipulation actions once the perception tools have been developed.
Moreover, we also plan to investigate operations that enable the system to reason during observation. After the system observes a significant number of manipulation actions, it can build a
database of all sequences of trees. By querying this database, we expect the system to predict things
such as which object will be manipulated next or which action will follow. Also, the action trees
82
Understanding Human Manipulation Actions
DUMMY.0
DUMMY.0
HP.1
KnifeA LeftHand
DUMMY.0
O.5
A.4
LeftHand
H.8
AP.7
RightHand
O.10
A.9
H.2
O.5
Grasp1 KnifeA
HP.6
H.2
O.5
AP.3
H.2
Grasp2 KnifeB
KnifeA LeftHand
DUMMY.0
HP.15
O.5
DUMMY.0
KnifeA
KnifeA
A.9
O.10
RightHand
A.13
O.14
Cut
LeftHand
KnifeA
AP.7
A.9
O.10
H.8
Grasp2 KnifeB
A.9
O.10
H.8
RightHand
LeftHand Grasp2 KnifeB
DUMMY.0
HP.11
RightHand
O.14
H.2
Grasp3 Bread
DUMMY.0
HP.6
O.5
AP.12
A.13
Grasp3 Bread
Grasp2 KnifeB
AP.7
HP.11
A.17
H.2
AP.12
H.8
AP.7
HP.6
AP.16
HP.11
HP.6
O.5
AP.12
A.13
O.14
Grasp3 Bread
H.2
LeftHand
KnifeA
O.10
HP.11
O.5
AP.12
A.13
O.14
Grasp3 Bread
H.2
LeftHand
H.8
KnifeB RightHand
DUMMY.0
O.5
O.10
H.8
O.14
H.2
KnifeA KnifeB RightHand Bread LeftHand
Figure 11. Example of the grammar that deals with hesitation. This figure shows key frames of the input
visual data and the semantic tree structures.
could be learned not only from observation but also from language resources, such as dictionaries,
recipes, manuals etc. This link to computational linguistics constitutes an interesting avenue for
future research. Also, the manipulation action grammar introduced in this work is still a syntax
grammar. We are currently investigating the possibility to couple manipulation action grammar
rules with semantic rules using lambda expressions, through the formalism of combinatory categorial grammar developed by Steedman (2002).
6. Acknowledgements
An earlier version of this paper appeared in the Proceedings of the Second Annual Conference on
Advances in Cognitive Systems in December, 2013. The authors thank the reviewers and editors for
their insightful comments and suggestions. Various stages of the research were funded in part by
the support of the European Union under the Cognitive Systems program (project POETICON++),
the National Science Foundation under INSPIRE grant SMA 1248056, and support from the US
Army, Grant W911NF-14-1-0384 under the Project: Shared Perception, Cognition and Reasoning
for Autonomy. The first author was supported in part by a Qualcomm Innovation Fellowship.
83
Y. Yang, A. Guha, C. Fermüller, and Y. Aloimonos
References
Aein, J. M., Aksoy, E. E., Tamosiunaite, M., Papon, J., Ude, A., & Wörgötter, F. (2013). Toward a
library of manipulation actions based on semantic object-action relations. Proceedings of the 2013
EEE/RSJ International Conference on Intelligent Robots and Systems (pp. 4555–4562). Tokyo:
IEEE.
Aksoy, E., Abramov, A., Dörr, J., Ning, K., Dellen, B., & Wörgötter, F. (2011). Learning the semantics of object-action relations by observation. The International Journal of Robotics Research,
30, 1229–1249.
Argall, B. D., Chernova, S., Veloso, M., & Browning, B. (2009). A survey of robot learning from
demonstration. Robotics and Autonomous Systems, 57, 469–483.
Ben-Arie, J., Wang, Z., Pandit, P., & Rajaram, S. (2002). Human activity recognition using multidimensional indexing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24,
1091–1104.
Brand, M. (1996). Understanding manipulation in video. Proceedings of the Second International
Conference on Automatic Face and Gesture Recognition (pp. 94–99). Killington, VT: IEEE.
Chaudhry, R., Ravichandran, A., Hager, G., & Vidal, R. (2009). Histograms of oriented optical
flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human
actions. Proceedings of the 2009 IEEE International Conference on Computer Vision and Pattern
Recognition (pp. 1932–1939). Miami, FL: IEEE.
Chomsky, N. (1957). Syntactic structures. Berlin: Mouton de Gruyter.
Chomsky, N. (1993). Lectures on government and binding: The Pisa lectures. Berlin: Walter de
Gruyter.
Dantam, N., & Stilman, M. (2013). The motion grammar: Analysis of a linguistic method for robot
control. Transactions on Robotics, 29, 704–718.
Dollár, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatiotemporal features. Proceedings of the Second Joint IEEE International Workshop on Visual
Surveillance and Performance Evaluation of Tracking and Surveillance (pp. 65–72). San Diego,
CA: IEEE.
Fainekos, G. E., Kress-Gazit, H., & Pappas, G. J. (2005). Hybrid controllers for path planning:
A temporal logic approach. Proceedings of the Forty-fourth IEEE Conference on Decision and
Control (pp. 4885–4890). Philadelphia, PA: IEEE.
Gavrila, D. M. (1999). The visual analysis of human movement: A survey. Computer vision and
image understanding, 73, 82–98.
Guerra-Filho, G., Fermüller, C., & Aloimonos, Y. (2005). Discovering a language for human activity. Proceedings of the AAAI 2005 Fall Symposium on Anticipatory Cognitive Embodied Systems.
Washington, DC: AAAI.
Guha, A., Yang, Y., Fermüller, C., & Aloimonos, Y. (2013). Minimalist plans for interpreting
manipulation actions. Proceedings of the 2013 International Conference on Intelligent Robots
and Systems (pp. 5908–5914). Tokyo: IEEE.
84
Understanding Human Manipulation Actions
Han, B., Zhu, Y., Comaniciu, D., & Davis, L. (2009). Visual tracking by continuous density propagation in sequential Bayesian filtering framework. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 31, 919–930.
Hu, C., Yu, Q., Li, Y., & Ma, S. (2000). Extraction of parametric human model for posture recognition using genetic algorithm. Proceedings of the 2000 IEEE International Conference on Automatic Face and Gesture Recognition (pp. 518–523). Grenoble, France: IEEE.
Ivanov, Y. A., & Bobick, A. F. (2000). Recognition of visual activities and interactions by stochastic
parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 852–872.
Jackendorff, R. (1977). X-bar-syntax: A study of phrase structure. Linguistic Inquiry Monograph,
2.
Kale, A., Sundaresan, A., Rajagopalan, A., Cuntoor, N., Roy-Chowdhury, A., Kruger, V., & Chellappa, R. (2004). Identification of humans using gait. IEEE Transactions on Image Processing,
13, 1163–1173.
Kapandji, I. A., & Honoré, L. (2007). The physiology of the joints. Boca Raton, FL: Churchill
Livingstone.
Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64,
107–123.
Li, Y., Fermüller, C., Aloimonos, Y., & Ji, H. (2010). Learning shift-invariant sparse representation of actions. Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern
Recognition (pp. 2630–2637). San Francisco, CA: IEEE.
Manikonda, V., Krishnaprasad, P. S., & Hendler, J. (1999). Languages, behaviors, hybrid architectures, and motion control. New York: Springer.
Mishra, A., Fermüller, C., & Aloimonos, Y. (2009). Active segmentation for robots. Proceedings
of the 20009 IEEE/RSJ International Conference on Intelligent Robots and Systems (pp. 3133–
3139). St. Louis, MO: IEEE.
Moeslund, T., Hilton, A., & Krüger, V. (2006). A survey of advances in vision-based human motion
capture and analysis. Computer Vision and Image Understanding, 104, 90–126.
Moore, D., & Essa, I. (2002). Recognizing multitasked activities from video using stochastic
context-free grammar. Proceedings of the National Conference on Artificial Intelligence (pp.
770–776). Menlo Park, CA: AAAI.
Nishigaki, M., Fermüller, C., & DeMenthon, D. (2012). The image torque operator: A new tool for
mid-level vision. Proceedings of the 2012 IEEE International Conference on Computer Vision
and Pattern Recogntion (pp. 502–509). Providence, RI: IEEE.
Oikonomidis, I., Kyriazis, N., & Argyros, A. (2011). Efficient model-based 3D tracking of hand
articulations using Kinect. Proceedings of the 2011 British Machine Vision Conference (pp. 1–
11). Dundee, UK: BMVA.
Pastra, K., & Aloimonos, Y. (2012). The minimalist grammar of action. Philosophical Transactions
of the Royal Society: Biological Sciences, 367, 103–117.
Ryoo, M. S., & Aggarwal, J. K. (2006). Recognition of composite human activities through contextfree grammar based representation. Proceedings of the 2006 IEEE Conference on Computer
Vision and Pattern Recognition (pp. 1709–1718). New York: IEEE.
85
Y. Yang, A. Guha, C. Fermüller, and Y. Aloimonos
Saisan, P., Doretto, G., Wu, Y., & Soatto, S. (2001). Dynamic texture recognition. Proceedings
of the 2001 IEEE International Conference on Computer Vision and Pattern Recognition (pp.
58–63). Kauai, HI: IEEE.
Schank, R. C., & Tesler, L. (1969). A conceptual dependency parser for natural language. Proceedings of the 1969 Conference on Computational Linguistics (pp. 1–3). Sanga-Saby, Sweden:
Association for Computational Linguistics.
Steedman, M. (2002). Plans, affordances, and combinatory grammar. Linguistics and Philosophy,
25, 723–753.
Summers-Stay, D., Teo, C., Yang, Y., Fermüller, C., & Aloimonos, Y. (2013). Using a minimal
action grammar for activity understanding in the real world. Proceedings of the 2013 IEEE/RSJ
International Conference on Intelligent Robots and Systems (pp. 4104–4111). Vilamoura, Portugal: IEEE.
Teo, C., Yang, Y., Daume, H., Fermüller, C., & Aloimonos, Y. (2012). Towards a Watson that sees:
Language-guided action recognition for robots. Proceedings of the 2012 IEEE International
Conference on Robotics and Automation (pp. 374–381). Saint Paul, MN: IEEE.
Tsotsos, J. (1990). Analyzing vision at the complexity level. Behavioral and Brain Sciences, 13,
423–469.
Tubiana, R., Thomine, J.-M., & Mackin, E. (1998). Examination of the hand and the wrist. Boca
Raton, FL: CRC Press.
Turaga, P., Chellappa, R., Subrahmanian, V., & Udrea, O. (2008). Machine recognition of human
activities: A survey. IEEE Transactions on Circuits and Systems for Video Technology, 18, 1473–
1488.
Wang, L., & Suter, D. (2007). Learning and matching of dynamic shape manifolds for human action
recognition. IEEE Transactions on Image Processing, 16, 1646–1661.
Willems, G., Tuytelaars, T., & Van Gool, L. (2008). An efficient dense and scale-invariant spatiotemporal interest point detector. Proceedings of the 2008 IEEE European Conference on Computer Vision (pp. 650–663). Marseille, France: Springer.
Wörgötter, F., Aksoy, E. E., Kruger, N., Piater, J., Ude, A., & Tamosiunaite, M. (2012). A simple ontology of manipulation actions based on hand-object relations. IEEE Transactions on Autonomous
Mental Development, 1, 117–134.
Yang, Y., Fermüller, C., & Aloimonos, Y. (2013). Detection of manipulation action consequences
(MAC). Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition
(pp. 2563–2570). Portland, OR: IEEE.
Yilmaz, A., & Shah, M. (2005). Actions sketch: A novel action representation. Proceedings of the
2005 IEEE International Conference on Computer Vision and Pattern Recognition (pp. 984–989).
San Diego, CA: IEEE.
Younger, D. H. (1967). Recognition and parsing of context-free languages in time n3 . Information
and Control, 10, 189–208.
86