(Sub-) gradient formulae for probability functions with Gaussian

Weierstrass Institute for
Applied Analysis and Stochastics
(Sub-) gradient formulae for probability
functions with Gaussian distribution
PGMO-Days 2015
R. Henrion
Wim van Ackooij
Weierstrass Institute Berlin
EDF R&D
Mohrenstrasse 39 · 10117 Berlin · Germany · Tel. +49 30 20372 0 · www.wias-berlin.de
October 28, 2015
Probability functions
We consider probability functions of the type
ϕ(x) := P(g(x, ξ) ≤ 0),
where
x ∈ Rn is a decision vector
ξ is an m-dimensional random vector defined on a probability space (Ω, A, P)
g : Rn × Rm → Rp is a mapping defining the random inequality system g(x, ξ) ≤ 0
Probability functions occur in many optimization problems from engineering, e.g.
max{ϕ(x) | x ∈ X} reliability maximization
or
min{f (x) | ϕ(x) ≥ p} probabilistic constraints
Challenge: no analytic formula for ϕ and ∇ϕ if ξ has a continuous distribution.
(Sub-) gradient formulae for probability functions · October 28, 2015 · Page 2 (14)
Nice data do not necessarily imply nice properties
ϕ(x) := P(M x + Lξ ≥ b);
(M |L|b) =
2
−1
1
1
−1
0
ξ ∼ 1-dimensional standard normal
0
−0.5
(M |L|b) =
1
0.75
1
0.5
0.5
0.25
0
-1
0
x2
-0.5
0
x1
-0.5
0.5
1
-1
(Sub-) gradient formulae for probability functions · October 28, 2015 · Page 3 (14)
2
−1
1
1
−1
−1
0
−1
Our setting
We consider the probability function
ϕ(x) := P(g(x, ξ) ≤ 0),
in the following special case:
ξ ∼ N (µ, Σ) has a multivariate Gaussian distribution
g is continuously differentiable
The components gi (x, ·) are convex
Typical prerequisites for ϕ being nice-behaved at some x̄:
{ξ | g(x̄, ξ) ≤ 0} is compact
the inequality system
g(x̄, ξ) ≤ 0 satisfies some constraint qualification (e.g., Slater, LICQ)
For instance, the following condition is sufficient for continuity of ϕ:
gi (x̄, ξ) = 0 =⇒ ∇ξ gi (x̄, ξ) 6= 0 ∀ξ ∀i
(Sub-) gradient formulae for probability functions · October 28, 2015 · Page 4 (14)
Probability functions in different disguises
One and the same probability function may have many different integral respresentations.
In the Gaussian case one has, for instance
ϕ(x)
=
=
P(g(x, ξ) ≤ 0)
Z
f (z)
|{z}
dz general formula
g(x,z)≤0 density of ξ
Z
Fη (ρ(x, v)) dµζ (v) spheric-radial decomposition
=
v∈Sm−1
Much more, gradients ∇ϕ(x) may differ in their representations.
Gradient formulae for probability functions have been derived, e.g., in:
Raik (1971), Uryasev (1994/95), Marti (1995), Kibzun/Uryasev (1998), Pflug/Weisshaupt (2005),
Royset/Polak (2007), Garnier/Omrane/Rouchdy (2009), v. Ackooij/H./Möller/Zorgati (2011) H./Möller (2012)
Exploiting special structure may yield more efficient gradient formulae.
(Sub-) gradient formulae for probability functions · October 28, 2015 · Page 5 (14)
Spheric-radial decomposition of a Gaussian distribution
Let ξ ∼ N (0, R) with R = LLT . Then,
Z
µη ({r ≥ 0 : rLv ∩ M 6= ∅})dµζ (v),
P (ξ ∈ M ) =
v∈Sm−1
where µη , µζ are the laws of η ∼ χ(m) and of the uniform distribution on Sm−1 .
sampling of the sphere
(e.g., QMC, Déak’s method).
M
Out[77]=
𝑣
𝐿𝑣
𝑆 𝑚−1
(Sub-) gradient formulae for probability functions · October 28, 2015 · Page 6 (14)
Value of ϕ(x) := P(g(x, ξ) ≤ 0) for ξ ∼ N (0, R) and p = 1
Assume that
ξ ∼ N (0, R) with R = LLT .
g : Rn × Rm → R continuously differentiable
g(x, ·) convex for all x
g(x, 0) < 0
Then,
ϕ(x) =
R
Fη (ρ(x, v))dµζ (v).
v∈Sm−1
ρ(x, v) is the (possibly infinite) length of the intersection ray
Fη refers to the one-dimensional χ-distribution function.
If ρ was differentiable, then one could differentiate under the integral.
(Sub-) gradient formulae for probability functions · October 28, 2015 · Page 7 (14)
Nondifferentiability of probability functions without compactness
Example
ξ
∼
N (0, I2 )
g(x1 , x2 , z1 , z2 )
:=
x21 eh(z1 ) + x2 z2 − 1,
h(t)
:=
−1 − 2 log(1 − Φ(t))
Φ
:=
distribution function of 1-dimensional standard Gaussian distribution
ξ has a smooth density
g is continuously differentiable and convex in z
At
x̄ := (1, 0) Slater’s condition (constraints qualification) is satisfied
But: probability function ϕ(x) := P(g(x, ξ) ≤ 0) is not differentiable at x̄.
Reason for failure:
the set {z|g(x̄, z) ≤ 0} = {z|z2 ≤ 1} is unbounded.
(Sub-) gradient formulae for probability functions · October 28, 2015 · Page 8 (14)
Gradient of ϕ(x) := P(g(x, ξ) ≤ 0) for ξ ∼ N (0, R) and p = 1
Theorem (v. Ackooij / H. 2014)
Suppose that g ∈ C 1 (Rn × Rm , R), g(x, ·) convex, g (x̄, 0) < 0. Assume the growth condition:
k∇x g(x, z)k ≤ δ0 ekzk ∀x ∈ U (x̄) ∀z : kzk ≥ C.
Then, ϕ ∈ C 1 and
Z
∇ϕ (x̄) =
−
χ (ρ (x̄, v))
∇x g (x̄, ρ (x̄, v) Lv) dµζ (v)
h∇z g (x̄, ρ (x̄, v) Lv) , Lvi
v∈F (x̄)
Here, F (x̄) := {v ∈ Sn−1 |ρ(x̄, v) < ∞}.
Using the same sample of the sphere, one may compute simultaneously
growth condition satisfied, in most practically relevant situations
trivially satisfied in the separated case g(x, z) = g1 (x) + g2 (z)
(Sub-) gradient formulae for probability functions · October 28, 2015 · Page 9 (14)
ϕ(x̄) and ∇ϕ(x̄)
Clarke subdifferential of ϕ(x) := P(g(x, ξ) ≤ 0), for ξ ∼ N (0, R), p ≥ 1
Theorem (v. Ackooij / H. 2015)
Suppose that g ∈ C 1 (Rn × Rm , Rp ), gi (x, ·) convex, g (x̄, 0) < 0. Assume the previous growth
condition for each component gi . Then, ϕ is locally Lipschitz and
∂ c ϕ (x̄) ⊆
Z
Co
−
χ (ρ (x̄, v))
∇x gi (x̄, ρ (x̄, v) Lv) : i ∈ I(v) dµζ (v)
h∇z gi (x̄, ρ (x̄, v) Lv) , Lvi
v∈F (x̄)
Here, I(v) := {i | ρ(x, v) = ρi (x, v)}.
𝜌2 (𝑥, 𝑣2 )
If µζ ({v ∈ Sn−1 | #I(v) ≥ 2}) = 0 then
𝜌(𝑥, 𝑣)
ϕ is strictly differentiable at x̄.
𝜌1 (𝑥, 𝑣1 )
0
(Sub-) gradient formulae for probability functions · October 28, 2015 · Page 10 (14)
Differentiability of ϕ(x) := P(g(x, ξ) ≤ 0), for ξ ∼ N (0, R), p ≥ 1
Corollary
In addition to the assumptions of the previous theorem assume the following constraint qualification:
rank {∇z gi (x̄, z), ∇z gj (x̄, z)} = 2 ∀i 6= j ∈ I(z) ∀z : g(x̄, z) ≤ 0,
where, I(z) := {i | gi (x̄, z) = 0}.
Then, ϕ is strictly differentiable at x̄. If this condition holds locally around x̄, then ϕ is continuously
differentiable.
A similar result was shown by Uryasev (94) but under compactness assumption and for a general
probabilty function (not yielding the special gradient formula displayed before).
(Sub-) gradient formulae for probability functions · October 28, 2015 · Page 11 (14)
Gaussian distribution functions
Classical reproducing gradient formula for Gaussian distribution functions:
Proposition
Let ξ ∼ N (µ, Σ) with Σ regular. Then, for any x ∈ Rn ,
∂P(ξ ≤ x)
(x) = fi (xi ) · P ξ̃ (i) ≤ x−i
∂xi
∀i = 1, . . . , n.
Here, fi is the one-dimensional density of ξi ∼ N (µi , Σii ) and ξ̃ (i) ∼ N (µ̃(i) , Σ̃(i) ) is an
s − 1-dimensional Gaussian random vector with Σ̃ regular.
Reproducibility of the formula allows one to
apply same code for values and gradients of Gaussian distribution functions (e.g., Genz)
obtain higher order derivatives by induction
control the precision of gradients by that of values
Proposition generalizes to functions P(Aξ ≤ x) with ξ ∼ N (µ, Σ), Σ regular and A surjective.
(Aξ ∼ N (Aµ, AΣAT ) =⇒ AΣAT regular)
In many applications however (e.g., network design), A has more rows than columns (not surjective),
hence, classical formula cannot be directly applied.
(Sub-) gradient formulae for probability functions · October 28, 2015 · Page 12 (14)
The linear case g(x, ξ) = Aξ − x under Gaussian distribution
Theorem
Consider the probability function ϕ(x) := P(Aξ ≤ x) for ξ ∼ N (µ, Σ) with Σ regular. Then,
ϕ is globally Lipschitz ⇐⇒ ϕ is continuous ⇐⇒ Ai 6= 0 ∀i (rows of A).
ϕ is continuously differentiable if rank {Ai , Aj } = 2 ∀i, j : i 6= j .
(H./Römisch 2010)
follows form previous result
Formula from spheric-radial decomposition:
Z
∇ϕ (x̄) = −
χ (ρ (x̄, v))
ei dµζ (v)
hAi , Lvi
v∈F (x̄),i∈I(v)
ϕ ∈ C ∞ if the Linear Independence Constraint Qualification (LICQ) is satisfied
(H./Möller 2012)
Reproducing formula:
∂P(Aξ ≤ x)
(x) = fi (xi ) · P Ã(i) ξ̃ (i) ≤ x̃(i)
∂xi
Looking for a more precise subgradient formula in the Lipschitzian case (which is non-regular in general).
Related work on weak directional derivatives: Pflug/Weisshaupt (2005).
(Sub-) gradient formulae for probability functions · October 28, 2015 · Page 13 (14)
Limitng subdifferential of ϕ(x) := P(Aξ ≤ x) for ξ ∼ N (µ, Σ)
Assume that Ai 6= 0 ∀i and that no Ai is opposite to some Aj for i 6= j (otherwise zero probability).
Ji := {j | ∃λj,i > 0 : Aj = λj,i Ai } (i = 1, . . . , n).
There exist i1 , . . . , il ∈ {1, . . . , n} such that ∪lk=1 Jik is a partition of {1, . . . , n}. Then,
Aξ ≤ x
⇐⇒
Ai ξ ≤ xi (i = 1, . . . , n) ⇐⇒ Aik ξ ≤ min λ−1
j,i xj (k = 1, . . . , l)
⇐⇒
Ãξ ≤ h(x)
j∈Jk
k
Defining ϕ(x) := P(Aξ ≤ x) and ϕ̃(y) := P(Ãξ ≤ y), we observe that
ϕ(x) = ϕ̃(h(x)) and ϕ̃ ∈ C 1 .
chain rule for limitng subdifferential =⇒ ∂ϕ(x̄) = ∂h∇ϕ̃(h(x̄)), hi(x̄)
The function ϕ̃(x) is evidently nondecreasing =⇒ ∇ϕ̃(y) ∈
∂ϕ(x̄) = ∂
On the other hand:
l
X
∂ ϕ̃
(h(x̄)) · hk
∂yk
k=1
∂hk (x̄) =
S
!
(x̄) ⊆
l
X
∂ ϕ̃
(h(x̄)) · ∂hk (x̄).
∂yk
k=1
−1
{λ−1
j,i ej | j ∈ Jk : λj,i xj = hk (x)}
k
(∗)
Rl+ . This allows us to continue (*) as
k
(Sub-) gradient formulae for probability functions · October 28, 2015 · Page 14 (14)