Preprint August 28, 2014
To appear in IROS 2014 Workshop on Compliant Manipulation: Challenges in Learning and Control
Learning Compliant Locomotion on a Quadruped Robot
Brahayam Pontón, Farbod Farshidian and Jonas Buchli
Abstract— Over the last decades, compliant locomotion and
manipulation have become a very active field of research, due
to the versatility that robots with such capabilities would offer
in many applications. With very few exceptions, generally,
robotic applications and experiments take place in controlled
environments. One of the reasons of this limited use is that in
real world scenarios, robots need to interact with an unknown
environment. In order to have a safe interaction with the
environment while successfully performing some meaningful
task, robots need to be compliant, such as humans and animals
are. In this paper, a framework for optimizing stereotypical
trotting gait of a quadruped robot by the use of variable
impedance control is proposed. The framework uses the Policy
Improvements with Path Integrals algorithm (PI2 ) to optimize a
parametrized gait generator and the robot’s impedance during
locomotion. As result, it achieves an energetically efficient and
robust trotting gait for different speeds while respecting the
joints and actuators limits. The resulting controllers have been
tested in a physics based simulator.
I. I NTRODUCTION
The fact that humans outperform robots in tasks that
involve contact and interaction with the real world, such
as manipulation or locomotion, is undeniable. There is
strong evidence that humans grasp and manipulate objects
by adapting the directional stiffness at the point of contact
[1]. The same principle has been experimentally observed in
locomotion [2]. In the same way, using variable impedance
control in robotics has been proven to be an effective method
for motion control of manipulators and legged robots [3], [4].
Variable impedance control enables the robots to deal with
imprecision of contact models [5] as well as disturbances
in the environment during interactions. However, specifying
a target impedance that suits the needs of a given task, is
not trivial. One of the available approaches to tackle this
problem is to use demonstration samples to learn the taskspecific desired impedance [6]. However, in more dynamic
tasks or tasks where the human teacher does not have
enough intuition to design the required impedance profile,
the Learning from Demonstration approach is not applicable.
Alternatively a data-driven approach like Reinforcement
Learning (RL) can be used. Unlike Learning from Demonstration, which requires a set of training data from the
desired task execution, the RL-based algorithms find the
solution of a given task by only optimizing over a userdefined cost function. RL is a general learning framework,
which gives full flexibility for learning compliant locomotion. As a model-free method, a RL algorithm requires to
Brahayam
Pontón
{[email protected]},
Farbod Farshidian {[email protected]} and Jonas Buchli
{[email protected]} are with the Agile & Dexterous Robotics
Lab at the Institute of Robotics and Intelligent Systems, ETH Zürich,
Switzerland.
Fig. 1: Picture of HyQ. The robot is 1 meter tall, 1 meter long, and
0.5 meter wide. It weights approximately 70 kg and is composed
of 12 torque-controlled joints that use a hydraulic actuation system.
Each leg has 3 joints (Hip Abduction-Adduction HAA, Hip FlexionExtension HFE, and Knee Flexion-Extension KFE), each with a
range of motion of 120 degrees [image courtesy of C. Semini].
retrieve samples from the learning policy implemented on
the system. This can be quite costly on high dimensional
systems like legged robots. Therefore, in order to make the
implementation of these methods computationally feasible,
the underlying learning algorithm should be able to leverage
efficiently from a few expensive-to-evaluate samples.
One of these sample-efficient RL algorithms is proposed in
[7], which learns a parametrized impedance and foot trajectory policies for an energy-efficient hopping on a monopedal
robot. Policy Improvements with Path Integrals (PI2 ) [8] is
another sample-efficient RL algorithm which has proven its
scalability in different robotic tasks. Buchli et. al. [9] used
this algorithm for learning the reference trajectory as well
as the compliance on a humanoid robot for a door opening
task. In [4], the PI2 algorithm is also used to learn dynamic
tasks, such as jumping and hopping.
In this work, we use the PI2 algorithm for learning a stable
and energy-efficient trotting gait on a quadrupedal robot.
The structure of the trot controller is the same as described
in [10], which is parametrized over swing feet trajectories
and the whole-body impedance. Within this control framework, PI2 is used to simultaneously optimize the swing feet
trajectories and the locomotion impedance controller. The
main contributions of this work are as follows: 1) Using
the true system dynamics instead of a simplified model for
learning a stereotypical motion on quadrupeds. 2) Learning
a periodic impedance policy for a task that is characterized
by several instances of establishing and breaking contact
with the environment. 3) Learning a trotting policy that
generalizes to different trot speeds. The proposed method
has been successfully tested in a physics based simulator for
optimizing a walking and a running trot in terms of energy
efficiency and robustness.
II. T ROT CONTROLLER STRUCTURE
HyQ (Figure 1) is a hydraulically-powered quadruped
robot [11]. Each of its legs has three joints with a range of
motion of 120 degrees. Each joint is actuated by a hydraulic
system, which provides HyQ with high performance force
controllers for handling the fast dynamics of contact forces.
In order to learn variable impedance control for a trotting task
on HyQ, we use the Reactive Controller Framework (RCF)
as the parametrized control architecture [10]. The RCF has
been designed for robust quadrupedal locomotion and is a
model-based controller composed of two main modules: the
motion generator module and the trunk stabilization module.
The motion generator module is dedicated to generate
elliptical trajectories for the feet, which are based on a
workspace-parametrized central pattern generator (WCPG).
The WCPG is a network of nonlinear oscillators, one for
each foot, that generates elliptical trajectories in cartesian
coordinates. The outputs of these oscillators are filtered by
a nonlinear filter which does not affect the output during the
swing-phase, but keeps the output constant during the stancephase. The Kinematic Adjustment sub-module cuts the elliptical trajectory of each foot, when the foot makes contact
with the ground. This feature is not used in this framework,
instead half ellipses are used as desired feet trajectories and
the parametrized impedance policy is optimized to deal with
disturbances at foot touch-down. The parameters that define
the shape of the generated elliptical trajectories are: height
Hwcpg and length Lwcpg of the ellipses, duty cycle dwcpg ,
stride frequency fwcpg and speed of locomotion Vwcpg .
The elliptical trajectories generated in Cartesian coordinates are mapped into desired joint space trajectories by using inverse kinematics transformations. A trajectory tracking
controller receives the desired joint space trajectories and
uses an inverse dynamics method to provide feed-forward
commands (based on the desired accelerations and current
joint positions and velocities). In order to react to the disturbances, which causes deviation from the desired trajectories,
a linear feedback loop is used in the tracking controller. This
approach benefits from the model information to achieve
dexterous and accurate movements while the feedback gains
can be kept low to improve the motion compliance and
robustness.
The control law in this structure is defined as follows:
τ = InvDyn(q, q̇, q̈d ) + KP S(qd − q) + KD S(q̇d − q̇)
where qd , q̇d , q̈d are the desired joint positions, velocities
and accelerations. q, q̇ are the current joint positions and
velocities, τ is the vector of generalized forces, S is the selection matrix, KP and KD are respectively the position and
velocity feedback gain matrices of the impedance controller.
The second module stabilizes the robot’s trunk. This module corrects the robot’s attitude by compensating deviations
of the roll and pitch angles from a reference horizontal
framework. Trunk stabilization forces are mapped into joint
torques without affecting the feet trajectories. The decoupling
is achieved by mapping trunk stabilization torques in the
null-space of the Jacobians associated with the stance legs.
Impedance Policy
Impedance Controller
+
Inverse Dynamics
Trunk
Stabilization
State
Estimation
Inverse
Kinematics
WCPG Feet
Trajectories
Frequency
and Phase
Estimation
Fig. 2: Learning and Adaptation Framework for HyQ.
III. L EARNING I MPEDANCE C ONTROL
The method presented in this section for learning compliant locomotion is based on the optimization of two sets of
parameters: the WCPG and the GAIN parameters. The first
set, WCPG parameters, defines the desired feet trajectories
and feed-forward commands. The set of GAIN parameters
comprises the parameters of the parametrized policy for
variable impedance (time-scheduled KP and KD ) and the
PD parameters for trunk stabilization (feedback commands
for roll and pitch angle corrections).
In this work, the PI2 algorithm is used as the learning algorithm [8] for the optimization of the two sets of parameters.
PI2 performs policy improvements by means of iteratively
executing and evaluating rollouts, where a rollout is a single
execution of randomly perturbed policy parameters. PI2
optimizes a parametrized policy Ψ(t) = g(t)T θ, where g(t)
is a set of basis functions (in general, nonlinear) and θ is the
vector of policy parameters to be optimized. The algorithm’s
goal is to minimize a user-defined cost function of the form
Z tN 1
J(θ) = ΦtN +
φ(t) + θT Rθ dt
2
ti
This cost function consists of a terminal cost ΦtN , an
intermediate state cost φ(t) and an intermediate control cost
1 T
2 θ Rθ, where R is the control cost matrix.
During the learning procedure, exploration is performed by
injecting zero-mean Gaussian noise to the policy parameters. The perturbed policy Ψ(t) = g(t)T (θ + ) is executed
and its cost function evaluated. The policy parameter update
is performed as a reward-weighted sum of the evaluated
rollouts, where the reward is inversely proportional to the
rollout policy cost.
Specifically for this locomotion task, the Von Mises
basis functions are selected as the basis functions of the
parametrized impedance policy. These basis functions are
periodic functions of a phase variable. Using a phase variable
allows to easily represent a policy for a periodic task like
locomotion, and directly parametrize the policy in terms of
the gait phase. They are defined in the form
gi (ω) = exp(hi (cos(ω − ci ) − 1))
where hi is the width and ci the basis function’ center in
phase space ω ∈ [0, 2π]. All policy parameters are initialized
to the default values used in the RCF framework [10].
Impedance Gains [Nm/rad]
1600
X Impedance
Noiseless costs
1.4
Speed
Energy
Joint
Torque
WCPG
GAIN
Control
Total
1.2
1
Cost
Y Impedance
0.8
0.6
1400
TD
1200
TO
1000
0
0.1
0.2
0
0.1
0.2
0
0.1
Initial
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.4
0.5
0.6
0.7
0.8
0.9
10 0.4
20
0.5
0.630
0.7
40 0.8
50
0.9
1
1400
1300
1200
TD
TO
0.4
0.2
0.3
1
0
5
10
15
20
25
30
35
40
45
50
# Updates
Fig. 3: Learning curve example for a Trotting Gait.
Z Impedance
4600
4500
4400
4300
TD
4200
0.3
TO
1
Phase [0−1]
The next step is to design a cost function that captures
the features that define the gait quality, and therefore, need
to be optimized. For this locomotion task, the intermediate
and terminal costs, both for the WCPG and for the GAIN
parameters optimization, have been defined as follows
Jspeed tracking (t)
Jenergy efficiency (t)
Jcloseness to joint limits (t)
φwcpg (t) = cs
ce
cj
φgain (t) = ct Jtorques (t) + φwcpg (t)
vHyQ Jspeed tracking (t) = 1 −
vdesired P
i∈joints |ωi τi | dt
Jenergy efficiency (t) =
vHyQ dt
X
Jcloseness to joint limits (t) =
N (limit−jointi , σ 2 )
i∈joints
X
Jtorques (t) =
τi2
i∈joints
Φwcpg (tN ) = ccpg 0 + Jfall
Φgain (tN ) = cpr Varpitch-roll + Φwcpg (tN )
The intermediate cost function for optimization of the
WCPG parameters φwcpg (t) captures 3 features: tracking error with respect to a desired locomotion speed Jspeed tracking (t)
(search for an optimal speed Vwcpg ); favours parameter
configurations with high energy efficiency, low value of
Jenergy efficiency (t), which penalizes consumed energy over
traveled distance (search for an optimal locomotion frequency fwcpg and duty cycle dwcpg ); and keeps the robot’s
joints motion within their limits (search for optimal length
Lwcpg and height Hwcpg of the ellipses). The final cost for the
WCPG parameters Φwcpg (tN ) heavily penalizes parameter
configurations that lead the robot to fall Jfall .
The GAIN parameters define both the disturbance rejection behaviour with respect to desired feet trajectories (robust
tracking performance), and the force interaction of the robot
with the environment (compliance). These are conflicting
objectives: low gains make the robot robust in unknown
terrain and reduce energy consumption, but robust position
tracking performance is reduced (nominal tracking is ensured
Fig. 4: Impedance gains and its evolution along different number
of iterations of the learning experience. TD and TO represent the
Touch-Down and Take-Off instants.
by the feed-forward controller); and for high gains the other
way around. The cost function for GAIN parameters captures
this feature. The intermediate cost φgain (t) penalizes high
applied torques (which are especially significant at touchdowns during highly dynamic motions) favouring compliance; and also takes into account the WCPG parameters cost
φwcpg (t), which rewards good task performance. The final
cost Φgain (tN ) imposes a penalty over the variance of the
pitch-roll angles trajectory. This guides the learning towards
discovering trajectories that form a stable limit cycle with
minimum torque requirements. This is important because in
a stable limit cycle, trajectories in the neighbourhood of the
nominal trajectory approach the nominal trajectory, making
the locomotion gait robust. Note that no pitch-roll angles
trajectory is imposed, only the variance is considered.
A trade-off between the different objectives is achieved
as result of the multi-criterion optimization. The achieved
Pareto optimal value depends on the weights (cx ), therefore,
they were selected so that the different objectives initially
contribute to the total cost with the same order of magnitude
(Figure 3), and in this way, all objectives are optimized.
Synchronization of the variable impedance policy with the
locomotion gait is performed by using adaptive frequency
oscillators [12] (uses measurements of the roll angle to
estimate the frequency and phase of the gait) and a phase
resetting mechanism [13] (synchronizes and corrects the
phase estimate based on feet contact information).
IV. R ESULTS
This section presents an example of the learning process
for a trotting gait at 1m/s and some important results. The
optimization consists of 50 iterations. Figure 3 shows the
contributions of the different objectives to the total cost
during the learning. The first three areas from the bottom,
blue, green and red represent the costs due to speed tracking
error, energy efficiency and closeness to joint limits. These
costs are used for the updates of the WCPG parameters. The
Take−off and Touch−Down Graph
4700
TO 1st
2nd
3rd
4th
4500
4800
Impedance Z direction
Z Impedance [Nm/rad]
TD 1st
4600
5th
6th
7th
8th
9th
4400
4300
4200
Mean µ
Variance µ ± σ
4100
0
0.1
0.2
0.3
4600
4400
4200
4000
3800
2
0.4
0.5
0.6
0.7
0.8
0.9
1
1
1.5
0.8
LF KFE [rad]
1
0.6
0.4
0.5
−1.4
Speed [0−2.1m/s]
0.2
0
0
Normalized Phase [0−2π]
−1.6
−1.8
measured
reference
−2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fig. 6: Generalization of the impedance in the Z direction for
different speeds by using a Gaussian process.
Gait phase
Fig. 5: Impedance policy and, Take-off and Touch-down instants in
an optimization experiment. The robot compliance can be seen in
the reduced tracking performance of one of the joints (KFE).
purple area represents the cost due to applied torques, the
yellow area shows the cost due to the variance of the pitchroll trajectory, and the pink area represents the costs for high
impedance gains.
As demonstrated in Figure 3, the WCPG parameters are
explored and updated during the first 25 iterations. This time
window has experimentally shown to be enough time to
find a set of WCPG parameters that produces a gait that
accomplishes the main goals for this phase, namely, energy
efficiency, good speed tracking and minimization of the cost
due to closeness to joint limits (Figure 3). The impedance
gains are simultaneously optimized, and after the WCPG
parameters have converged, they are finely optimized, which
reduces the cost even more.
Figure 4 shows the evolution of the impedance parameters
at different number of iteration updates. For instance, the
impedance policy after 20 updates (purple line), where the
WCPG parameters are close to converge, already has a
similar shape to the final policy; but giving the algorithm the
chance to finely tune these gains, improves the gait quality
by optimizing gait compliance and robustness (Figure 3).
Figure 5 shows how the learning algorithm modulates the
impedance profile in order to achieve a robust trot gait in the
presence of multiple contact instances with variable, uncertain durations. The graph shows how the learned impedance
policy, stiffens for good trajectory tracking during flightphase, pre-softens to prepare for the touch-down, and reduces
the stiffness even more during stance-phase, to compliantly
interact with the environment. The effect of the compliant
policy can be seen in the trajectory tracking performance of
one of the joints, as shown in Figure 5 (second plot).
The described optimization procedure has been performed
at several trot speeds and the learned policies (WCPG and
GAIN parameters per speed) have been generalized for
HyQ’s speed range by using a Gaussian process to interpolate
the results of the policies learned at discrete speeds. Figure
6 shows an example of this generalization for the impedance
in the Z direction; this generalized impedance highlights the
common trend for a compliant policy at different locomotion
speeds, the need for a compliant behaviour during contact,
and the higher stiffness required during swing phase for good
trajectory tracking.
V. C ONCLUSION
The presented method has successfully optimized in simulation a trotting gait on HyQ, in terms of robustness and
energy efficiency by learning a compliant impedance policy.
The learning framework optimizes the impedance policy taking into account the true system dynamics, task’s constraints
and uncertainties, different contact conditions; and achieves
high quality locomotion gaits at different trot speeds.
ACKNOWLEDGEMENT
This research has been funded partially through a Swiss National Science
Foundation Professorship award to Jonas Buchli.
We would like to thank Victor Barasuol for providing the RCF control
framework and support in its integration with our learning framework.
R EFERENCES
[1] J. Friedman and T. Flash, “Task-Dependent Selection of Grasp Kinematics and Stiffness in Human Object Manipulation,” Cortex, 2007.
[2] D. P. Ferris, M. Louie, and C. T. Farley, “Running in the real world:
adjusting leg stiffness for different surfaces,” 1998.
[3] M. Kalakrishnan, L. Righetti, P. Pastor, and S. Schaal, “Learning force
control policies for compliant manipulation,” in IROS, 2011.
[4] P. Fankhauser, M. Hutter, C. Gehring, M. Bloesch, M. A. Hoepflinger,
and R. Siegwart, “Reinforcement learning of single legged locomotion,” Neural Computation, 1997.
[5] C. Yang, G. Gowrishankar, S. Haddadin, S. Parusel, A. Albu-Schäffer,
and E. Burdet, “Human-like adaptation of force and impedance in
stable and unstable interactions.” IEEE TRO, 2011.
[6] K. Kronander and A. Billard, “Learning compliant manipulation
through kinesthetic and tactile human-robot interaction,” IEEE Transactions on Haptics, 2013.
[7] J. Hwangbo, C. Gehring, R. Siegwart, and J. Buchli, “Variable
impedance control for legged robots,” Dynamic Walking, 2014.
[8] E. Theodorou, J. Buchli, and S. Schaal, “A generalized path integral
control approach to reinforcement learning,” JMLR, 2010.
[9] J. Buchli, F. Stulp, E. Theodorou, and S. Schaal, “Learning variable
impedance control,” International Journal of Robotics Research, 2011.
[10] V. Barasuol, J. Buchli, C. Semini, M. Frigerio, E. D. Pieri, and D. Caldwell, “A reactive controller framework for quadrupedal locomotion on
challenging terrain,” in ICRA. IEEE, 2013.
[11] C. Semini, N. G. Tsagarakis, E. Guglielmino, M. Focchi, F. Cannella,
and D. G. Caldwell, “Design of hyq - a hydraulically and electrically
actuated quadruped robot,” 2011.
[12] L. Righetti, J. Buchli, and A. J. Ijspeert, “Dynamic hebbian learning
in adaptive frequency oscillators,” Nonlinear Phenomena, 2006.
[13] J. Nakanishi, J. Morimoto, G. Endo, G. Cheng, S. Schaal, and
M. Kawato, “Learning from demonstration and adaptation of biped
locomotion,” Robotics and Autonomous Systems, 2004.
© Copyright 2025 Paperzz