Learning to Control an Octopus Arm
with Gaussian Process Temporal
Difference Methods
Yaakov Engel
Joint work with Peter Szabo and Dmitry Volkinshtein (ex. Technion)
Gaussian Processes in Practice Workshop
Why use GPs in RL?
• A Bayesian approach to value estimation
• Forces us to to make our assumptions explicit
• Non-parametric – priors are placed and inference is performed
directly in function space (kernels).
• Domain knowledge intuitively coded into priors
• Provides full posterior, not just point estimates
• Efficient, on-line implementation, suitable for large problems
2/23
Gaussian Processes in Practice Workshop
Markov Decision Processes
X:
state space
U:
action space
p:
X × X × U → [0, 1],
xt+1 ∼ p(·|xt , ut )
q:
IR × X × U → [0, 1],
R(xt , ut ) ∼ q(·|xt , ut )
A Stationary policy:
ut ∼ µ(·|xt )
P∞ i
µ
Discounted Return: D (x) = i=0 γ R(xi , ui )|(x0 = x)
µ:
U × X → [0, 1],
Value function: V µ (x) = Eµ [D µ (x)]
Goal: Find a policy µ∗ maximizing V µ (x) ∀x ∈ X
3/23
Gaussian Processes in Practice Workshop
Bellman’s Equation
For a fixed policy µ:
h
µ
µ
0
V (x) = Ex0 ,u|x R(x, u) + γV (x )
i
Optimal value and policy:
V ∗ (x) = max V µ (x) ,
µ
µ∗ = argmax V µ (x)
µ
How to solve it?
- Methods based on Value Iteration (e.g. Q-learning)
- Methods based on Policy Iteration (e.g. SARSA, OPI,
Actor-Critic)
4/23
Gaussian Processes in Practice Workshop
Solution Method Taxonomy
RL Algorithms
Purely Policy based
(Policy Gradient)
Value−Function based
Value Iteration type
(Q−Learning)
Policy Iteration type
(Actor−Critic, OPI, SARSA)
PI methods need a “subroutine” for policy evaluation
5/23
Gaussian Processes in Practice Workshop
Gaussian Process Temporal Difference Learning
Model Equations:
R(xi ) = V (xi ) − γV (xi+1 ) + N (xi )
Or, in compact form:
Rt = Ht+1 Vt+1 + Nt
1
0
Ht = .
..
0
−γ
0
...
0
1
−γ
...
0
..
.
0
...
1
−γ
.
Our (Bayesian) goal:
Find the posterior distribution of V (·), given a sequence
of observed states and rewards.
6/23
Gaussian Processes in Practice Workshop
The Posterior
General noise covariance:
Cov[Nt ] = Σt
Joint distribution:
0
Rt−1
Ht Kt H>
t + Σt
∼N
,
0
V (x)
kt (x)> H>
t
Invoke Bayes’ Rule:
Ht kt (x)
k(x, x)
E[V (x)|Rt−1 ] = kt (x)> αt
Cov[V (x), V (x0 )|Rt−1 ] = k(x, x0 ) − kt (x)> Ct kt (x0 )
kt (x) = (k(x1 , x), . . . , k(xt , x))>
7/23
Gaussian Processes in Practice Workshop
8/23
Gaussian Processes in Practice Workshop
The Octopus Arm
Can bend and twist at any point
Can do this in any direction
Can be elongated and shortened
Can change cross section
Can grab using any part of the arm
Virtually infinitely many DOF
9/23
Gaussian Processes in Practice Workshop
The Muscular Hydrostat Mechanism
A constraint: Muscle fibers can only contract (actively)
In vertebrate limbs, two separate muscle groups - agonists and
antagonists - are used to control each DOF of every joint, by exerting
opposite torques.
But the Octopus has no skeleton!
Balloon example
Muscle tissue is incompressible, therefore, if muscles are arranged
such that different muscle groups are interleaved in perpendicular
directions in the same region, contraction in one direction will result
in extension in at least one of the other directions.
This is the Muscular Hydrostat mechanism
10/23
Gaussian Processes in Practice Workshop
Octopus Arm Anatomy 101
11/23
Gaussian Processes in Practice Workshop
Our Arm Model
arm tip
dorsal side
()
$%
pair #N+1
arm base
"#
CN
*+
!
&'
C1
ventral side
longitudinal muscle
pair #1
2
2
3
3
2
2
3
3
2
2
3
3
,
-
,
-
,
-
,
-
,
-
,
transverse muscle
0
1
0
1
0
1
0
1
0
1
0
1
transverse muscle
.
/
.
/
.
/
.
/
.
/
.
/
.
/
.
/
longitudinal muscle
12/23
Gaussian Processes in Practice Workshop
The Muscle Model
d`
f (a) = (k0 + a(kmax − k0 )) (` − `0 ) + β
dt
a ∈ [0, 1]
13/23
Gaussian Processes in Practice Workshop
Other Forces
• Gravity
• Buoyancy
• Water drag
• Internal pressures (maintain constant compartmental volume)
Dimensionality
10 compartments ⇒
22 point masses × (x, y, ẋ, ẏ)
= 88 state variables
14/23
Gaussian Processes in Practice Workshop
The Control Problem
Starting from a random position, bring {any part, tip} of arm
into contact with a goal region, optimally.
Optimality criteria:
Time, energy, obstacle avoidance
Constraint:
We only have access to sampled trajectories
Our approach:
Define problem as a MDP
Apply Reinforcement Learning algorithms
15/23
Gaussian Processes in Practice Workshop
The Task
0.15
t = 1.38
0.1
0.05
0
−0.05
−0.1
−0.1
−0.05
0
0.05
0.1
0.15
16/23
Gaussian Processes in Practice Workshop
Actions
Each action specifies a set of fixed activations –
one for each muscle in the arm.
Action # 1
Action # 2
Action # 3
Action # 4
Action # 5
Action # 6
Base rotation adds duplicates of actions 1,2,4 and 5 with
positive and negative torques applied to the base.
17/23
Gaussian Processes in Practice Workshop
Rewards
Deterministic rewards:
+10 for a goal state,
Large negative value for obstacle hitting,
-1 otherwise.
Energy economy:
A constant multiple of the energy expended by the muscles in
each action interval was deducted from the reward.
18/23
Gaussian Processes in Practice Workshop
Fixed Base Task I
19/23
Gaussian Processes in Practice Workshop
Fixed Base Task II
20/23
Gaussian Processes in Practice Workshop
Rotating Base Task I
21/23
Gaussian Processes in Practice Workshop
Rotating Base Task II
22/23
Gaussian Processes in Practice Workshop
To Wrap Up
• There’s more to GPs than regression and classification
• Online sparsification works
Challenges
• How to use value uncertainty?
• What’s a disciplined way to select actions?
• What’s the best noise covariance?
• More complicated tasks
23/23
© Copyright 2026 Paperzz