A Modular Cognitive System for Safe Human Robot Collaboration:
Design, Implementation and Evaluation
Frank Dittrich1 , Stephan Puls1 and Heinz Wörn1
Abstract— In this paper we present the conceptional design
and details about the implementation of a system for human
robot collaboration in the industrial domain. As part of the
system, we allow for a shared workspace without spatial or
temporal separation, in order to fully utilize the possibilities of
collaboration process design. As the basis for the robot control
and the communication of the human worker with the cognitive
system, we developed an intuitive and natural interaction
concept. For the positioning of the robot we created a novel
generic concept, using solely the hands without the necessity for
any special hardware or robot specific input devices. We also
present in this paper a quantitative and qualitative evaluation of
the overall system and specific elements, were information about
the user acceptance and the applicability of the collaboration
concept are discussed. The evaluation results are thereby
based on several experiments, which were conducted with a
heterogenous group of people, serving as test persons.
I. INTRODUCTION
Merging advantages of industrial robotics, e.g. strength,
accuracy and durability, with human capabilities like dexterity and problem solving is a challenging task. Due to safety
concerns, safety fences are installed to separate human coworkers and robots. Consequently, time and space sharing
collaboration is inhibited with industrial robotics.
In some instances laser range finders are used to replace
safety fences in order to perform foreground detection. Still,
with such systems scene analysis and interpretation is not
efficiently feasible. Thus, no meaningful contribution for
challenging tasks like safe human-robot collaboration can
be achieved.
We are conducting research in the realm of human centered production environment in order to enable interactive
and collaborative scenarios. This work presents our modular
cognitive system which enables intuitive control of and
natural interaction with an industrial robot. The system was
evaluated quantitatively and qualitatively in user experiments
in which, amongst others, Reis Robotics and KUKA Laboratories took part. The overall system and specific elements
were tested and experimentally analyzed.
The remainder of this paper is organized as follows. In
Section II, related work concerning human-robot collaboration is presented. The conceptual design of the modular
cognitive system is presented in Section III. In Section IV,
the system’s implementation is detailed, which is followed
by a description of our concept for intuitive human-robot
interaction in Section V. The conducted experiments and the
1 Institute for Process Control and Robotics (IPR), Karlsruher Institute of
Technology (KIT), 76133 Karlsruhe, Germany {frank.dittrich,
stephan.puls, woern}@kit.edu
quantitative and qualitative evaluation will be discussed in
Section VI. Finally, in Section VII, a conclusion is drawn
and hints for future work are given.
Fig. 1. Left: Virtual system output showing parts of the components in our
collaborative scenario. Pictograms visualize the contextual and situational
information. Right: Real-time path modification dependent on the users
position and orientation.
II. RELATED WORK
The idea of humans working together with robots and
achieving a common goal offers a wide range of possibilities
for robotic applications.
In [1], a time-of-flight camera is incorporated into a robot
cell which is used to observe dynamic safety zones. These
zones are defined in a virtual environment model of the
working cell. For risk reduction for the human co-worker
the maximal velocity of the robot can be limited.
In [2], a system is presented which deals with direct
human-robot cooperation. The CCD-camera based vision
system is used for colour based hand tracking. The shortest
distance between hand position and the robot’s tool centre
point is determined and used in a fuzzy logic system for risk
recognition. Risk reduction is achieved through controlling
the maximal speed of the robot.
In [3], a cognitive approach to industrial robotics is presented. Besides modeling risks and achieving reactive robot
path planning the focus is set on reconstruction of human
kinematics in a robot’s working area.
The realm of personal and service robots allows for
development of human-robot interaction mechanisms and
cognitive robot systems. The application of knowledge processing in the context of robotic control is presented in [4].
The system is especially designed for use with personal
robots. Prolog is used for the underlying system to process
knowledge. The knowledge representation is based on Description Logics and is used as encyclopedic knowledge to
query the robot’s environmental model.
Cognition
Analysis
Sensing
Planning
Action
Action
Reasoning
Knowledge Base
Knowledge Base
The path is executed by a Reis RV6 industrial robot, which
serves as the collaborative partner in all scenarios. Situational
and work flow related high-level information are generated
by a knowledge based inference mechanism and serve the
cognitive adaption of the system components. On top of the
scene analysis, the planning algorithms and the reasoning
we implement a gesture-based interaction concept, which is
used as the main interface for the unidirectional human robot
communication and the robot guidance. With all elements we
strive for a real-time capability, especially with safety critical
system components.
B. Modular System Design
Fig. 2. Schematic diagram of the structural elements from the cognitive
SHRC system on a high abstraction level.
In [5], a household robot system is presented which allows
for dynamic interaction with its environment. The system is
used for learning from the environment, such as objects [6],
interaction mechanisms [7] and decision making [8].
In [9], a vision centric modular framework for cognitive
robotics research is presented. The robot resembles a humanoid child and the robot was designed to research object
manipulation. Thus, the devised interaction mechanisms are
targeted to be with passive objects rather than with humans.
The robot operating system (ROS) [10] allows for development of sophisticated robot applications. Still, in the safety
relevant context of industrial settings, to our best knowledge,
ROS does not offer algorithms which are optimized for
efficiency.
III. CONCEPTIONAL DESIGN
Our system for safe human robot collaboration (SHRC)
comprises multiple components interacting with each other.
Fig. 2 shows a schematic diagram of the structure and
the component classes on a high abstraction level. The
interaction is thereby based on the exchange of information.
Information which are essential for the safety of the human
worker are generated in real-time.
A. System Setup and Components
The research and experiment environment of our system is designed appropriate to collaboration scenarios in
the industrial domain. The human worker and the robot
thereby work and interact in a shared workspace which
covers over 4m2 . Multiple calibrated and spatially static
RGB-D sensors deliver basic low-level information about
the scene. These are the basis for the scene analysis approaches, which amongst others produce information about
the temporal evolution of the parameterized human body
model and the 2D and 3D optical flow (OF) in real-time.
Position, orientation and the line of sight of the user are
processed by approaches for the risk modeling. The task
planer directives and the temporal and spatial risk distribution
are the basis for the safe path planing algorithms, which
are optimized for the deployment in dynamic environments.
One objective when designing the SHRC system was
the possibility to easily alter the deployed component setup
and the information transfer and processing. Such a system
allows for adaption of the implemented capabilities to individual applications in varying environments with changing
work processes, deployed hardware and time and quality
constraints. Therefor we designed a modular software architecture which is underlying the SHRC system. Compared
to existing modular environments like ROS [10] with large
feature spectrum, our approach follows the “lean and mean”
design paradigm in order to enable the fast buildup of new
modules and the configuration of new system functionality,
using the Windows operating system and respective development tools.
The basic design idea is that functionality is encapsulated
in a single module, and all modules consume and produce
information during run-time, which defines the information
budget of the current module setup. Information is generalized by information classes, meaning that for instance
a 2D-OF field estimated by gradient-based methods [11]
or phase-based approaches is represented by the same 2DOF information class. Different intra-class approaches are
encapsulated in separate modules, so that ideally one module
or approach is assigned to a specific information class
in a specific SHRC system configuration. All information
exchange is done by using a virtual information bus located
in the main memory, together with a signaling strategy. The
information demand and supply can be set up by the user by
defining the information IO per module using GUI elements.
Conflicts like missing information are detected and reported
automatically before system start.
As a result of the reduced design, the resource overhead
caused by the underlying modularity is negligible. Because
of the abstraction of information, planning the use and the
configuration of the SHRC system in new environments
is easy. This also supports the adaption of the system to
varying constraints regarding the temporal behaviour and
upper bounds of the precision of certain information classes.
IV. SYSTEM IMPLEMENTATION
The following section gives details on the main analysis,
planning and reasoning components, which are used for the
system realization. These are also partially evaluated in the
experimental collaborative scenarios described in section VI.
Fig. 3. Left: 2D-OF estimation encoded in color and intensity, based on
the intensity images from the RGB-D sensor. Right: 3D-OF based temporal
prediction of a distinct body point, depicted by the orange line segment.
A. Robust Full Body Tracking
The basis for safety, action recognition and situation
awareness is the temporal analysis of the human body
posture. It is therefor important to deploy robust full body
tracking in order to maximize the safety of the human worker
in the shared workspace and to optimize the recognition
results. For our full body tracking we fuse two different
approaches which process different information sources from
separate sensors. The diversity of the approaches is wanted
and used to gain robustness. Also the tracking information is
augmented by a separate estimation of the head’s orientation
using a statistical analysis.
Real-time estimates of the skeletal setup are provided by
the full body tracking functionality of the Microsoft KINECT
SDK, which delivers information about the position and
orientation of 20 body joints.
We also estimate the 2D-OF on the basis of intensity
images from a sensor centered above the shared workspace.
Because we use a gradient-based estimator [11] for the
estimation, and estimates are needed in real-time, we deploy
fast numerical Multigrid solvers which were ported to the
graphics adapter. For the calculation of the 3D-OF, we
process the depth information together with the 2D-OF field.
As a result, we can predict the motion of various body points
(Fig. 3 right).
For the fusion of the information from the skeleton tracking and the 3D-OF estimation, we use the Kalman filter [12].
For every body joint we use the skeleton tracking results as
the measurement with fixed noise. In order to inject the 3DOF into the optimization process of the Kalman filter, the
system matrix from the prediction step is time variant and
adjusted by using the 3D-OF prediction information [11].
In case that there is no 3D-OF information available for a
certain body joint, e.g. because of occlusion, the adjustment
of the system matrix is based on a simple linear movement
assumption. The resulting fusion based posture estimation is
due to the use of heterogeneous approaches and information
sources more robust against occlusion and noise in the RGBD channels.
As described above, the risk modeling needs to know
about the line of sight of the user in order to deduce the
user’s degree of attention in regard to the robot. Fig. 4 (left
to right) shows a sequence of depth images from a RGBD sensor centered above the workspace. For the estimation
Fig. 4. The columns show a sequence of grey value encoded depth images,
where the estimated head pixels are encoded in pink. The top row shows
the blue search window and the bottom row shows the estimated head
orientation based on the statistical head point analysis.
of the head’s orientation, the head points are estimated in
the first step by using the head position from the robust
posture estimation in combination with a heuristic-based
search area (Fig. 4 top line). In the second step, a 2D
Principal Component Analysis (PCA) of the estimated head
points is used to infer the head’s orientation (Fig. 4 bottom
line).
B. Risk Modeling and Safe Motion Planning
The safety of the human co-worker cannot be determined
solely on distance measures if close collaboration of human
and robot is wanted. Thus, a risk estimation module is used
which models risk through a set of fuzzy logic rules [13].
These rules take into account the human’s head orientation
towards the robot, distance and relative velocity between
human and robot (see Fig. 5). Also, the reasoning results
about what situation is observable is merged into the risk
estimation. Consequently, during collaboration less risk is
applicable than during non-collaborative situations. The risk
estimation can be invoked as library by modules of the SHRC
system. Thus, different parameter sets can be utilized by
different modules.
The situational risk evaluation is used for detecting possibly impending collisions during robot motion. A look-ahead functionality evaluates the robot’s future trajectory of
a certain length and determines if the corresponding risk
exceeds a defined threshold. In such a case, the robot motion
is stopped and path re-planning is invoked. This functionality
is integrated into the motion planning module of the SHRC
system. The motion planning determines a jerk bounded
robot motion and interfaces with the robot’s controller.
The path planning is implemented in its own module and
enables planning for industrial robot arms with six degrees
of freedom. Internally, the planner utilizes an adapted A*
search in order to cover the robot’s configuration space [14].
During search, each analyzed node in configuration space is
evaluated according to the risk estimation. Consequently, a
path is found which satisfies risk constraints. The modular
design allows for online and reactive path traversal while
path planning can be invoked separatly.
Fig. 6. A circular trajectory of the left hand in the 3D space, describes a
circle on the 2D plane of the upper body.
Fig. 5. Risk evaluation based on fuzzy logic rules. Showing evaluated
configuration space nodes for low risk (left) and high risk situation (right),
dependent on the attention of the human worker.
C. Situation Awareness through Reasoning
Available observations about environment and performed
human actions can be used to further analyze the overall
situation. Thus, a higher cognitive functionality is build upon
recognition and observation facilities.
Information about human pose and actions are used for
logically reasoning about the situation. Different situations
are modeled, such as collaboration, process monitoring and
distraction, which describe in short the state in the robot’s
workspace. Thus, for example, if the human co-worker is
gesturing while looking at the robot it can be reasoned
that the human wants to communicate with the robot, e.g.
performing command gestures for the robot to comply.
The knowledge modeling uses Description Logics for
definition of the knowledge base. It comprises information
about all major aspects of the targeted human-robot collaboration, such as human actions, process locations, possible
situations, used tools, and human’s kinematics state. Actions
and situations are defined generally in the domain knowledge. Recognized actions, human kinematics information,
and environmental data build up the assertional knowledge.
During run-time, the assertional knowledge instantiates the
domain knowledge and is used for reasoning [15].
Furthermore, the reasoning is used for extraction of behaviour objectives for the robot. If, for example, the situation allows for direct gesture based robot positioning, the
objective resembles that fact and is used as input by a task
planning module [16]. Thus, possibly recognized gestures in
other situations can be ignored by the task planning.
The situational reasoning is incorporated in its own module in order to facilitate the output for other interested
modules, such as task planning or safe motion planning.
and the movement gestures for the positioning of the robot
in the workspace. Both classes combined are considered
by us as sufficient for the human-robot interaction in most
collaboration scenario primitives.
A. Commanding the Robot
In order to communicate commands and states to the
robot, various, sometimes numerous gestures have to be
learned not only by the SHRC system, but also by the user.
Therefore it is important, that they are easy to describe and
easy to remember. Also for the robustness of the gesture
classification, these body gestures have to be discriminative
amongst each other. Considering that this approach is applied
in an industrial context with rather large workspaces, it is also
required that the gestures can be executed with one hand or
arm, and with high variability in matters of position and
orientation.
To meet the demands, we use the concept of a virtual
drawing board. Here, the user has to imagine a plane right
in front of him and parallel to the upper body, where he
can draw various signs with one hand. The hand is hereby
represented by one point in the 3D space, allowing for a high
variability in respect to the position and orientation of the
user relative to the RGB-D sensor, since no additional information about the finger setup is needed. For the classification
of various symbols, the 2D hand trajectory is examined (Fig.
6). Therefore, the gestures can be described to the user via
simple 2D drawings (Fig. 7), which are easy to memorize
because of the inherent ability and training of humans to use
2D symbols for handwriting
V. I NTUITIVE H UMAN -ROBOT I NTERACTION
An important objective was the design and implementation of a concept for the natural and intuitive humanrobot interaction (HRI) as the basis for the interaction in
HRC scenarios. The concept should thereby also be generic
in the sense that the interaction elements could be used
to represent any interaction scenario with an articulated
robot. For the realization of this concept we identified and
determined two major gesture-based interaction classes, the
command gestures for the input of commands and states
(a)
Fig. 7.
(b)
(c)
(d)
Visualization of distinct hand and respectively command gestures.
Because of the simplicity of some of the symbols (Fig.
7.a), gestures can be executed by the user without any
intention. Therefor, for the robust gesture recognition, a
two-stage bayesian network approach is applied, which uses
empirical driven Hidden Markov Models (HMM) for the
reliable classification of 2D trajectories and symbols, and
Fig. 8. Approximation of the intended transformation of the robot’s TCP
by the user, using both hands.
a Bayesian Network for the heuristic-based rejection of
unintended gesture executions [17].
B. Positioning of the Robot in Space
For the positioning of the robot in the collaboration
scenario, we developed a novel and generic gesture-based
concept. Generic in the sense that it can be applied independent of the deployed type of robot, which for instance
could be an industrial robot like the Reis RV6 (Fig. 8) or
a mobile platform like the KUKA omniRob. As part of the
whole gesture-based interaction concept, we demand that the
positioning of an articulated robot can be conducted by the
use of solely the user’s hands, without the necessity to use
robot specific hardware like conventional input devices or
tactile elements. Also specific coordinate systems shouldn’t
bother the user in the positioning process.
Dealing with different types of robots with varying movement abilities, the meaning of the positioning is dependent
on these abilities. Therefore, for the sake of portability, we
always consider the positioning as a transformation of an
entity, meaning describing the translation and orientation of
this entity in space. Having to deal with a new robot, this
entity has to be determined first and then the meaning of the
translation and orientation of the entity has to be defined.
For a conventional robot like the Reis RV6 industrial robot
this entity is the robot’s tool center point (TCP), and the user
describes the relative translation and the absolute orientation
of the TCP in space. With a mobile platform like the KUKA
omniRob, the entity is the robot itself or respectively its
center point, and the user describes the relative and absolute
translation, and the absolute orientation in the 2D plane.
1) Transformation Description with Gestures: For the
description of transformations, the user uses one or both
hands in combination with staying in a so called legal
posture, which is needed to infer the user’s intention to enter
a certain type of movement gesture.
a) Absolute Translation: When describing the translation, we distinguish between absolute and relative translation.
In case of the absolute translation the user defines a distinct
place in space, where he or she wants the entity to be placed
at. In order to describe this location, the user has to use the
left or right arm which is represented by the positions of the
shoulder, elbow, wrist and hand joints Sat = {pst , pet , pwt
pht } which are contained in St ⊃ Sat the set of all body
joint positions at time step t. In order to take on a legal
posture, the arm has to be stretched out so that the arm’s
Fig. 9. The robot’s TCP is moving in direction of the relative translation,
which is approximated and described by the user.
joint element almost form a straight line, and the hand has
to be held still for a certain period of time.
To measure the degree of straightness, we perform a
Principal Component Analysis (PCA):
µ̃ = E{P|Sat } ,
Σ̃ = Cov{P|Sat } = EΛET ,
T
e1
0
κ1
e
,
e
,
e
κ
eT2 , κ1 ≥ κ2 ≥ κ3 .
= 1 2 3
2
0
κ3
eT3
(1)
Here the random variable P describes the arm joint positions conditioned on the arm joints Sat at time step t.
The last line in Eq. 1 describes the eigendecomposition of
the estimated covariance matrix Σ̃ of random
variable P
with the associated eigenvectors e1 , e2 , e3 and eigenvalues
κ1 ≥ κ2 ≥ κ3 . After transformation of the points Sat
into the coordinate system described by the eigenvectors,
the arm points are decorrelated and the eigenvalues describe
the variance in respect to the single dimensions. If the arm
points are aligned along a straight line, the variance κ1 must
be high and both variances κ2 and κ3 must very small and
similar. Therefor we use two thresholds vu and vl and check
if κ1 ≥ vu and κ2 ≤ vl holds, in order to conclude that the
arm points follow a straight line. If vu and vl are chosen
sufficiently far apart, this conclusion is distinct, because of
the anatomy of the human arm.
For the detection of hand activity, we analyze the distance
of the hand positions in a certain time frame ∆:
d(pht , pht+∆ ) ≤ dmax .
(2)
If d(pht , pht+∆ ), the distance metric induced by the l2 norm, is smaller then the threshold dmax , and in case the
arm points form a straight line, we assume an intended
gesture for the description of an absolute translation. In
order to determine the absolute position, we need to estimate
the intersection of the line described by the arm and the
environment. The line is due to the nature of the PCA given
by the eigenvector e1 with the highest variance. In order
to estimate the intersection one can use geometric models
of the environment or a 3D occupancy grid induced by the
Fig. 10. Spherical area for the definition of a legal posture, when executing
a movement gesture for relative translation or absolute orientation. A green
sphere depicts a legal posture in respect to the hand positions, a red sphere
depicts an illegal posture.
Fig. 11. The TCP’s axis element is taking on the absolute orientation,
which is approximated and described by the user.
RGB-D data. For our experiments with the KUKA omniRob
we used a simple geometric model of the floor in order to
determine an absolute position on the floor, which had to be
approached by the mobile platform. Section V-B.2 describes
how the actual execution of the absolute translation gesture
is conducted in case of both robots.
b) Relative Translation: The other form of translation
in our concept is the relative translation. Here the user uses
both hands in order to describe a direction in 2D or 3D space,
also in combination with staying in a legal posture, which
again is used to infer the user’s intention to enter a certain
type of movement gesture. The entity is then supposed to
follow this direction with its current position as the base.
Fig. 8 shows examples were the user describes the desired
direction with his hands. In our concept, the user first has to
imagine the direction where he wants the entity to go. In Fig.
8 this entity is the TCP of the Reis RV6 robot, and the desired
direction is depicted by a pink line attached to the TCP.
After he envisions the direction the user tries to approximate
this direction by describing a parallel vector with his hands
(Fig. 8 pink line). Although this concept might sound hard
to follow and execute, it is quite easy if one thinks about it.
First, we believe that most people have the inherent ability
to imagine and visualize a movement in 3D space of any
object, like for instance a computer screen, the telephone or
a cat. The imaginary movement is thereby independent of any
kind of coordinate system, it just takes places in the space
surrounding you. Second we also believe that most people
are also able to approximate the direction of this imaginary
movement by describing a parallel vector with their hands.
As we show later in Section VI, our assumptions turned out
to be right with all the probands in the experiments, at least
to some extend. It should be mentioned that it is also very
easy to try out and explain the concept without any actually
moving objects.
In order to take on a legal posture, one hand of the user
has to be inside and the other hand outside a certain spatial
area (Fig. 10). The position of this area is determined by the
user’s body position and posture. In addition both hands are
supposed to not move, as with the gesture for the absolute
translation. The activity recognition described in Eq.2 is
thereby applied to both hands. The moment a legal posture
is detected, the robot follows this relative translation (Fig.
9) as long as the user’s posture remains in a legal state. The
direction is thereby uniquely described as the vector starting
from the hand inside the legal area to the hand outside the
legal area. If the user moves one hand, the robot immediately
tries to stop. This gives the user a direct control over the
behaviour of the robot, compared to the relative translation
gesture were it takes about 1-2 seconds to start the movement
because of the activity analysis.
The fact that the vector description of the desired direction
is only an approximation and because of the uncertainty
in the hand tracking approaches, the actual movement of
the robot might not exactly match the imagined direction.
However this is not really a problem because of the intuitive
design of the concept, where the user doesn’t have to think in
coordinate systems or numerical dimensions. If the robot is
following a slightly wrong direction, the user automatically
and intuitively corrects the hand positions and with it the
vector description accordingly, and can thereby iteratively
alter the robot’s path to the desired outcome. As mentioned
before, once the user moves his hands the robot stops the
movement execution, and therefore correcting the path would
take another 1-2 seconds in every correction step. In order
to make the robot more responsive, the activity recognition
allows for slight adjustments, meaning hand movements,
without stopping, so that minor path alterations can be done
with a direct response of the robot.
In our implementation of this movement concept we used
the Reis RV6 and the omniRob. As described before, the
entity in case of the Reis is the TCP and in case of the
omniRob its center point. For the Reis, the relative translation
is represented in 3D and for the omniRob in 2D. In Section
V-B.2 we describe how the actual execution of the relative
translation gesture is realized with both robots.
c) Absolute Orientation: The last type of gesture-based
transformation is the absolute orientation. The concept is
directly comparable to the concept for relative translation,
and the gesture-based procedure is exactly the same. The
user first has to imagine the orientation in space (Fig. 8)
and then take on a legal posture where he describes a vector
with both hands, which in this case approximates the target
orientation of the entity in 2D or 3D space. Fig. 11 shows
an example where the user uses a movement gesture in order
alter the orientation of the TCP. Here the entity is not the
TCP itself but the axis element it is mounted on. As long
as the user doesn’t move his hands, the entity tries to take
Sensor
Robot
coordinate system CSS and describe a direction d = b - a.
When transformed into the robot’s coordinate system CSR ,
the resulting representations are:
−1
ã = RSR
(a − tSR ) ,
Observer
Fig. 12.
A direction in space, depicted by two points a and b and
represented by multiple coordinate systems.
on the desired orientation. The process is stopped when the
target orientation is reached, or when the user moves the
hands. Euqivalent to the gestures for the relative translation,
marginal alterations of the target orientation also don’t trigger
illegal states, so that the process is intuitive and iterative in
the same fashion.
In our implementation of this movement concept we used
the Reis RV6 and the omniRob. As described before, the
entity in case of the Reis is the TCP axis element and in case
of the omniRob its center point. For the Reis, the absolute
orientation is represented in 3D and for the omniRob in 2D.
In Section V-B.2 we describe how the actual execution of
the absolute orientation gesture is realized with both robots.
In case of the relative translation and the absolute orientation, the gestures also allow for a description of a measure for
the desired robot’s speed. For this the length of the direction
and orientation vector described by the user’s hands can be
processed.
2) Transformation Execution: In our implementation of
the gesture-based interaction concept we used the Reis RV6
industrial robot and the KUKA omniRob mobile platform
which is capable of moving omnidirectional in 2D without
changing its orientation. In our application scenarios we
used RGB-D sensors with a fixed position in the workspace.
Both robot types were able to operate in a static coordinate
system, so that the transformations TSR between sensor
coordinate system and robot coordinate system were also
static and could be determined offline in advance. Especially
the KUKA omniRob was operating in a world coordinate
system which was related to the workspace environment.
For the execution of the absolute translation, the target
position pat , which was given in the sensor coordinate
system, had to be transformed to a point p̃at in the static
robot coordinate system. A command gesture, as described
in SectionV-A, would then cause the robot to approach this
point in its coordinate system.
For the execution of the relative translation, a vector in
the robot’s coordinate system has to be determined which is
parallel to the one described with the user’s left and right
hand, depicted by phl and phr respectively. The direction
d = phl − phr described with the hands can be represented
in different coordinate system, by transforming phl and phr .
The resulting transformed direction d̃ then has a different
numerical look but it still describes the same movement in
space. This can be easily shown by an example illustrated
in Fig. 12. Two points a and b are given in the sensor
−1
b̃ = RSR
(b − tSR ) ,
(3)
d̃ = b̃ − ã,
with RRS and tRS depicting the rotation matrix and the
translation vector from the coordinate system transformation
TSR from CSS to CSR . For an observer with coordinate
system CSO , both point representation a and ã would have
the following representation in CSO :
aO = ROS a + tOS ,
ãO = ROR ã + tOR ,
(4)
with rotation matrices ROS and ROR , translation vectors
tOS and tOR , respectively from the coordinate system
transformations TOS from CSO to CSS and TOR from
CSO to CSR . Because of ROR = ROS RSR and tOR =
tOS + ROS tSR , a˜O can be written as:
ãO = ROR ã + tOR
−1
= ROR RSR
(a − tSR ) + tOR
−1
= ROS RSR RSR
(a − tSR ) + tOS + ROS tSR (5)
= ROS a − ROS tSR + tOS + ROS tSR
= ROS a + tOS = aO .
The same can be shown for point b. Therefor the direction
dO and d̃O are equal for any observer, which shows that
a direction in space, given by two points, is independent of
representations in distinct coordinate systems.
We therefor transform the hand positions from the sensor
to the robot coordinate system. In case of the Reis RV6,
the vector is used as the direction for the movement of the
TCP. In case of the KUKA omniRob, we used a simple
geometrical model of the floor for the projection of the vector
to the floor, resulting in a 2D direction vector.
In case of the absolute orientation, we also transform the
hand points in the robot’s coordinate system. In case of the
Reis RV6 we use the orientation as the target state of the
TCP’s axis element (Fig. 11). In case of the KUKA omniRob
we also project the vector on the floor which gives us the
2D target orientation of the omniRob.
C. Interaction Concept
In our gesture-based interaction concept and the collaboration scenarios, we use the before described interaction
gestures for commanding and moving the robot in combination. Especially, because of the identical gesture concept for
the relative translation and the absolute orientation, we use
command gestures to switch between input modes for the
respective movement type.
VI. EXPERIMENTS AND EVALUATION
For the evaluation of the whole SHRC system, with
emphasis on the gesture-based interaction concept, we conducted experiments with a group of probands. The experiments thereby comprise the quantitative evaluation of the
safe path planning as well as a qualitative evaluation using
representative collaboration scenarios. The experiments were
executed in a shared workspace using a multi RGB-D sensor
setup and a Reis RV6 industrial robot.
A. Safety and Path Planning
For evaluation purposes, different scenarios were set up
and tested. Generally, the reactive path planning (PP) method
was compared to point-to-point (PTP) robot motion and
simple robot stopping in case of possible impending collisions. Thus, the collision detecting look-a-head functionality
is used in both cases either to trigger re-planning or to stop
the robot’s motion.
For the experiment different human-system interaction
was allowed: (1) The human was present in the cell, but
stayed outside of the robots motion path, thus, the robots
motion was not influenced. (2) The human interfered with the
robot’s path by walking deliberately towards the robot’s base
and staying there for at least ten seconds. Each interaction
scenario was conducted with PP-method and PTP motion.
The task for the robot was a simple pick and place task,
in which objects to pick were placed on the one side of the
workspace and the drop-off containers were located on the
other side of the workspace. Thus, the human could interfere
easily with the robot’s motion trajectory in between.
In Figure 13, the distance between human and robot is
shown when using either PTP motion and stand still (Fig.
13 top), or the path planning method for collision avoidance
(Fig. 13 bottom). The distance never reached a touching
point, thus, collisions were properly avoided. Also,when
using PTP motion the robot has to wait each time until the
human left the circumference of the robots motion trajectory
even if the robot is still far away. The average distance
between robot and human is higher when using path planning
than when using PTP motion and robot stopping. Also, due to
the look-a-head functionality a timely path re-planning can
be invoked. This behavior results in generally safer robot
motion due to increased safety margins.
B. Collaboration and Interaction
For the qualitative evaluation we designed and built up
three different human-robot collaboration scenarios with a
varying interaction spectrum. In all experiments, a heterogenous group of 15 people with varying sex, age and
technical background were used as test persons. For the
coverage of the impressions and opinions of the probands, we
created experiment dependent questionnaires. In advance to
the execution of the experiments, the probands were briefly
trained to use the gesture-based interaction concept described
in Section V. Also for the single experiments a training
step was preceding the experiment execution, where task
dependent interaction concepts were explained and briefly
Fig. 13. Showing the distance between human and robot. Human was
interfering with robot’s path traversal. Top: PTP robot motion. Bottom: path
planning method.
trained. In the following we describe the objective and the
implementation of the single experiments, and subsequently
we present an inter-experimental evaluation based on the
information from the questionnaires.
1) Experiment Objective and Implementation: The objective of the first experiment was the evaluation of the gesturebased transformation. Basic element of the experiment was a
wooden cube which was positioned in front of the robot. The
cube thereby had holes in different sizes of 10.0, 4.0 and 2.0
cm, with two orthogonal sides holding one instance of each
class respectively. At the TCP, a plastic stick with a length of
15 cm was mounted, pointing along the axis element of the
TCPs coordinate system (Fig. 11). The task for the probands
then was to position the stick in the different holes, and
switching the orthogonal sides at each successful entering of
the stick. This scheme demanded that the probands had to use
full transformation of the TCP, meaning relative translation
and absolute orientation, at each entering attempt.
The objective of the second experiment was the evaluation
of gesture-based commanding of an industrial robot in a
shared workspace. For the experiment a scenario was chosen
were the synergy of the human robot collaboration can be
illustrated. Here the robot presents a very heavy object in a
distinct position in the shared workspace and the probands
or human worker commands the robot in order to do sophisticated work on the object. Because not all areas of the
object can be reached properly or in a comfortable way by
the human worker, he has to use command gestures in order
to rotate the object. In our implementation of this scenario
we use a wooden cube where the probands has to screw
and unscrew little parts to the cube. The distinct positioning
places for these parts are thereby distributed over all sides of
the cube. In order to rotate the cube in 90 degree steps, which
is necessary for the completion of the task, the proband has
to use various gestures using our command gesture concept.
The objective of the third experiment was the evaluation
of the gesture based programming of an industrial robot,
in a shared workspace. For this experiment a scenario was
chosen, which occurs in many applications in the industrial
domain. Here the task is to apply the command and movement gesture concepts in order to teach the robot how to
Fig. 14. Left: Rating of the probands whether the gestures were detected
and classified reliably. The mean is 3.04 and the standard deviation is 0.852.
Right: Rating if the gestures were intuitive, with mean 3.36 and standard
deviation 0.743.
Fig. 15. Left: Rating of the probands whether the tasks for the experiments
were easy to solve, using the gesture based interaction. The mean is 3.36
and the standard deviation is 1.004. Right: Rating of how easy it was for
the probands to learn the gesture based interaction concept for task in the
experiments. The mean is 3.62 and the standard deviation is 0.576.
sort objects from 2 different classes in separated bins. In
our implementation the object classes were represented by
Lego bricks in two different sizes, which were presented on
a table next to the robot. The bins where the different classes
should be placed were also placed next to the robot. For the
accomplishment of the task, the probands first had to use the
gesture based transformation concept in order to position the
robots gripper over a certain Lego brick. After the positioning
step, the probands had to trigger the grasping process, using
a command gesture. When the Lego brick was picked up by
the gripper, the probands had to use the movement gestures
again, in order to position the robot above the dropping site
or the bin. After the positioning, the probands again had to
use a command gesture for the release of the Lego brick. This
cycle was used again in order to teach the other object class.
For the completion of the task, the probands had to enter
a certain command gesture to start the autonomous sorting
process.
2) Experiment Evaluation: Subsequent to the completion
of each experiment, the probands were asked to complete
the experiment dependent questionnaire. Here they answered
questions and filled out a table. Each line in the table corresponded to one question or statement, which the probands
had to rate. To do this, they chose from a discrete ordinal
scale, which can be represented by the whole numbers from 0
to 4. The number 0 depicts no agreement with the statement
or the worst possible rating, and the number 4 depicts full
agreement with the statement or the best possible rating.
Some of the questions and statements from the experiment dependent questionnaires were designed for interexperimental evaluation and can be analyzed independent
of the respective experiment. The Figures 14, 15 and 16
show the histograms which resulted from the accumulated
questionnaire data. In all figures, the red vertical line depicts
the mean value and the grey dashed line describes the normal
distribution of the selected data for the visualization of the
Fig. 16. Left: Rating whether the probands felt safe in the presence of
the robot. Mean is 3.71 and standard deviation is 0.589. Right: Rating
whether the probands think that the interaction concept eases or allows the
task execution for people with varying professional background and age.
The mean is 3.29 and the standard deviation is 0.757.
variance.
In Fig. 14 (left) the results for the statement whether the
command and movement gestures were detected reliably, has
a mean around 3.0 with a high standard deviation. This rating
thereby refers to the technical aspects of the implementation
of the gesture based interaction concepts. By improving the
overall robustness of the command gesture detection and
the hand tracking for the movement gestures, the mean and
especially the standard deviation would show better values.
The overall rating of how intuitive the gesture based
concepts are (Fig. 14 right), has a satisfactory mean value of
3.4 but the standard deviation is relatively high. This shows
that for some probands or people, gesture based approaches
are not the first choice in solving comparable tasks. The
distribution reveals that these are the minority though.
Similar results shows the histogram in Fig. 15 (left) for the
rating of how easy it was to solve the tasks using the gesture
based concepts. Here the mean is also 3.4 but the standard
deviation is even higher. The distribution shows almost a
hard division into two groups of people. Supported by the
evaluation of the questions in the questionnaire, it can be said
that the first group felt that the gesture-based concepts are
not the best modality for such tasks, and the second group
felt confident that the concepts are a convenient way for the
execution of such tasks. The distribution shows a ratio of
approximately 2 to 1.
As mentioned before, preceding the whole experiment
execution and every single experiments, the probands had to
pass through a training step, which was deliberately chosen
to be very short, in order to be able to evaluate if the concepts
are intuitive and easy to learn. The mean and the standard
deviation of the rating of how hard it was to learn the
interaction concepts (Fig. 15 right), show that all probands
could learn the concepts to a certain degree in a very short
time.
All experiments were conducted in a shared workspace
without spatial or temporal separation. The degree of proximity of human and robot and the agility of the robot
thereby varied in the experiments. Fig. 16 (left) shows the
histogram of the accumulated data, where the probands rated
whether they felt safe during an experiment, using a shared
workspace with an industrial robot. The high mean and the
low variance show that the safety was not only given by the
safe path panning, but also the probands felt safe, which can
be interpreted as a high degree of acceptance of workspace
sharing scenarios.
At last the probands were asked whether they think that
this kind of concepts are in favor of older people, and
people with no technical background or special knowledge
about robots. The right histogram in Fig. 16 shows that the
probands predominantly rated our concepts as convenient for
such work force.
3) Evaluation Summary: In conclusion, the evaluation
indicates that our gesture-based interaction concept and
collaboration in a shared workspace is applicable and that
user acceptance exists.
The accumulated histograms and analyzed questions from
the questionnaires also indicate that the opinion of the user
differs about regarding interaction solely based on gestures
as a convenient way for the execution of the presented or
comparable tasks. Therefor we conclude that the gesturebased approach should be represented in an interaction
modality spectrum for upcoming applications, and not be
an exclusive solution.
Also we were surprised about the probands’ acceptance
of the proximity of the articulated robot during the experiment execution. Especially, considering that the robot is an
industrial robot which exceeds the probands height, and the
close collaboration in the first experiment where the probands
had to approach the moving robot real close during the
interaction, in order to successfully complete the task.
We think that with our choice of collaboration scenarios and the interaction elements in the experiments, many
real-world human-robot collaboration applications are represented. Therefor the presented evaluation can help to promote the realizations of similar human-robot collaboration
applications in the industrial domain.
VII. CONCLUSION
In this paper we presented a modular system for humanrobot collaboration, serving as the fundament for the realization of applications in the industrial domain. Our system
allows for a shared workspace, which enables the design
and implementation of more complex and sophisticated collaboration scenarios. The presented gesture-based interaction
concept is thereby the basis for the natural and intuitive
control of the robot in the collaboration process. The experimental evaluation showed the applicability of our concepts
and the user’s acceptance of the ideas of gesture-based
interaction and full-on robot contact in a shared workspace.
R EFERENCES
[1] B. Winkler. Konzept zur sicheren mensch-roboter-kooperation auf
basis von schnellen 3d time-of-flight sensoren. In In Proceeding of
Robotik 2008, pages 147–151, 2008.
[2] S. Thiemermann. Direkte Mensch-Roboter-Kooperation in der Kleinteilmontage mit einem SCARA-Roboter. PhD thesis, Fakultät für
Maschinenbau, Universität Stuttgart, 2005.
[3] Stephan Puls and Jürgen Graf and Heinz Wörn . Cognitive Robotics
in Industrial Environments. Human Machine Interaction - Getting
Closer, pages 213–234, 2012.
[4] M. Tenorth and M. Beetz. Knowrob – knowledge processing for
autonomous personal robots. In Proceedings of the IEEE International
Conference on Intelligent Robots and Systems (IROS), 2009.
[5] M. Prats, S. Wieland, T. Asfour, A.P. del Pobil, and R. Dillmann.
Compliant interaction in household environments by the armar–iii humanoid robot. In In Proceeding of IEEE/RAS International Conference
on Humanoid Robots, pages 475–480, 2008.
[6] D. Schiebener, A. Ude, J. Morimoto, T. Asfour, and R. Dillmann.
Segmentation and learning of unknown objects through physical
interaction. In 11th IEEE-RAS International Conference on Humanoid
Robots (Humanoids), pages 500–506, 2011.
[7] S.R. Schmidt-Rohr, M. Lsch, and R. Dillmann. Learning flexible,
multi-modal human-robot interaction by observing human-humaninteraction. In In Proceedings of the 19th IEEE International Symposium in Robot and Human Interactive Communication, 2010.
[8] S.R. Schmidt-Rohr, G. Dirschl, P. Meiner, and R. Dillmann. A
knowledge base for learning probabilistic decision making from human
demonstrations by a multimodal service robot. In Proceedings of the
15th International Conference on Advanced Robotics (ICAR11), 2011.
[9] Jrgen Leitner, Simon Harding, Mikhail Frank, Alexander Frster, and
Jrgen Schmidhuber. An integrated, modular framework for computer
vision and cognitive robotics research (icvision). In Antonio Chella,
Roberto Pirrone, Rosario Sorbello, and Kamilla Rn Jhannsdttir, editors,
Biologically Inspired Cognitive Architectures 2012, volume 196 of
Advances in Intelligent Systems and Computing, pages 205–210.
Springer Berlin Heidelberg, 2013.
[10] Willow Garage. The robot operating system, 2013. http://wiki.ros.org/.
[11] J. Graf and F. Dittrich and H. Wörn. High Performance Optical
Flow Serves Bayesian Filtering for Safe Human-Robot Cooperation.
In Joint 41th International Symposium on Robotics and 6th German
Conference on Robotics, pages 325–332, 2010.
[12] R.E. Kalman. A new approach to linear filtering and prediction
problems. Journal of Basic Engineering, 82(1):35–45, 1960.
[13] S. Puls and H. Wörn. Situation Dependent Risk Estimation for
Workspace-Sharing Human-Robot Cooperation. In Proc. of IADIS
International Conference on Intelligent Systems and Agents, pages 51–
58, 2013.
[14] S. Puls and P. Betz and M. Wyden and H. Wörn. Path Planning
for Industrial Robots in Human-Robot Interaction. In IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS),
WS Robot Motion Planning: Online, Reactive and in Real-time, 2012.
[15] S. Puls and H. Wörn. Combining HMM-Based Continuous Human
Action Recognition and Spatio-Temporal Reasoning for Augmented
Situation Awareness. In IADIS International Conference on Interfaces
and Human Computer Interaction, pages 133–140, 2012.
[16] S. Puls and C. Giffhorn and H. Wörn. Identifying Objectives and
Execution Planning for Safe Human-Robot Cooperation. In Proc. of
XVI Portuguese Conference on Artificial Intelligence EPIA 2013, 2013.
[17] Frank Dittrich and Stephan Puls and Heinz Wörn. A Two-Stage
Bayesian Network Approach for Robust Hand Trajectory Classification. In RSS: Science and Systems, WS Human Robot Collaboration,
2013.
© Copyright 2026 Paperzz