Preliminaries
WGAN
Experiments
Related works
Wasserstein GAN
Martin Arjovsky1 , Soumith Chintala2 , Leon Bottou1,2
1 Courant
Institute of Mathematical Sciences
AI Research
2 Facebook
Presented by Chunyuan Li
1 / 16
Preliminaries
WGAN
Experiments
Related works
Preliminaries: “vanilla” GAN
1
Real data distribution Pr ;
Generator’s distribution Pg , implemented as x = G(z), z ∼ P (z)
minG maxD V (D, G)
Discriminator
−Ex∼Pr [log D(x)] − Ex∼Pg [log(1 − D(x))]
(1)
D(x): the probability that x from the real data rather than generator.
Generator
2
Problems
Ex∼Pg [log(1 − D(x))]
GAN0
(2)
Ex∼Pg [− log(D(x))]
GAN1
(3)
[Goodfellow et al., 2014]:
P1: “In practice, GAN0 may not provide sufficient gradient
for G to learn well”, GAN1 is used instead. (log D trick)
P2: “G collapses too many values of z to the same value of
x” (Mode collapse in GAN1 )
What is the princpled interpretation?
[Arjovsky and Bottou, 2017]
2 / 16
Preliminaries
WGAN
Experiments
Related works
P1 on GAN0
In GAN0 , better discriminator leads to worse vanishing gradient in its generator
Q: Why is GAN difficult to train?
A: Either our updates to the discriminator will be inacurate, or they will
vanish. It leaves up to the user to decide the precise amount of training
dedicated to the discriminator, which can make GAN training hard.
3 / 16
Preliminaries
WGAN
Experiments
Related works
P1 on GAN0
Proof Sketch:
1
Minimizing generator yields minimizing the JS divergence when the
discriminator is optimal.
For given x, the optimal discriminator is
D∗ (x) =
Pr (x)
Pr (x) + Pg (x)
(4)
The generator loss (by adding a term independent of Pg ) is
L = Ex∼Pr [log D(x)] + Ex∼Pg [log(1 − D(x))]
(5)
Plug (4) into (5):
2JS(Pr ||Pg ) − 2 log 2
(6)
2
If the supports (underlying low-dimension manifolds) of Pr and Pg (almost)
have no overlap, then JS(Pr ||Pg ) = log 2 (Theorem 2.3), and thus the
gradient of (5) wrt. Pg vanishes (Theorem 2.4 and Corollary 2.1)
3
The probability that the support of Pr and Pg have almost zero overlap is 1
(Lemma 2, Lemma 3 and Theorem 2.2)
4 / 16
Preliminaries
WGAN
Experiments
Related works
P2 on GAN1
GAN1 is a conflicting/asymmetric objective, thus (1)unstable gradient (2) mode callapse
5 / 16
Preliminaries
WGAN
Experiments
Related works
P2 on GAN1
Proof Sketch:
1
(Theorem 2.5) GAN1 equals to optimize
KL(Pg ||Pr ) − 2JS(Pg ||Pr )
(7)
2
Opposite signs for KL and JS. (Theorem 2.6: Instability of generator
gradient updates)
3
KL(Pg ||Pr ), NOT KL(Pr ||Pg ):
KL(Pg ||Pr ) assigns an high cost to generating fake looking samples, and an
low cost on mode dropping;
KL(Pr ||Pg ) assigns an high cost to not covering parts of the data, and an
low cost on generating fake looking samples;
6 / 16
Preliminaries
WGAN
Experiments
Related works
Preliminaries: distance measures for distributions
1
KL
KL(P ||Q) = EP log
P
Q
2
JS
3
P +Q
1
P +Q
1
) + KL(Q||
)
JS(P ||Q) = KL(P ||
2
2
2
2
Wasserstein
W (P ||Q) =
inf
γ∈Π(P,Q)
E(x,y)∼γ [||x − y||]
Π(P, Q) denotes the set of all joint distributions γ(x, y)
whose marginals are P and Q, respectively
γ(x, y) indicates a plan to transport “mass” from x to y,
when deforming P into Q.
The Wasserstein (or Earth-Mover) distance is then the
“cost” of the optimal transport plan
7 / 16
Preliminaries
WGAN
Experiments
Related works
Examples
P0 : distribution of (0, Z), where Z ∼ U [0, 1]
Pθ : distribution of (θ, Z), where θ is a single real parameter
+∞
KL(P0 ||Pθ ) = KL(Pθ ||P0 ) =
0
log 2 if θ 6= 0
JS(P0 ||Pθ ) =
0
if θ = 0
W (P0 ||Pθ ) = |θ|
if θ 6= 0
if θ = 0
Z
1
P0
P✓
0
✓
(a) Distributions
8 / 16
Preliminaries
WGAN
Experiments
Related works
Insight
(b) Output of W and JS
1
Observations
When the distributions are supported by low dimensional manifolds (such as
Pr and Pg in GANs)
KL or JS are binary, no meaningful gradient
W is continuous and differentiable, hence always sensible
2
Theoretical support
Theorem 1 supports the above statement
Corollary 1 say Theorem 1 is true when the mapping is neural nets.
Theorem 2 imply TV distance has the same probem with KL and JS.
9 / 16
Preliminaries
WGAN
Experiments
Related works
Wasserstein GAN
The infimum is highly intractable
Wasserstein distance has a duality form
W (Pr , Pg ) =
sup
||f ||L ≤1
=
1
K
Ex∼Pr [f (x)] − Ex∼Pg [f (x)]
sup
||f ||L ≤K
Ex∼Pr [f (x)] − Ex∼Pg [f (x)]
(8)
(9)
where supremum is over all the K-Lipschitz functions
Consider a w-parameterized family of functions {fw }w∈W that are all
K-Lipschitz
W (Pr , Pg ) = max Ex∼Pr [fw (x)] − Ex∼Pg [fw (x)]
(10)
w∈W
For example, W = [−c, c]l
10 / 16
Preliminaries
WGAN
Experiments
Related works
Wasserstein GAN
The loss for discriminator/critic
Ex∼Pr [fw (x)] − Ex∼Pg [fw (x)]
(11)
−Ex∼Pg [fw (x)] = −Ez∼p(z) [fw (gθ (z))]
(12)
The loss for generator
11 / 16
Preliminaries
WGAN
Experiments
Related works
Algorithm
Algorithm
Main difference to vanilla GAN
Remove the sigmoid of the last layer in D
Remove the log in the loss of D and G.
Clip the parameters of D in an inverval centered at 0.
Momentum-based optmizaition is not allowed
12 / 16
Preliminaries
WGAN
Experiments
Related works
Meaningful loss metric
A meaningful loss metric that correlates with the generator’s
convergence and sample quality. WGAN algorithm attempts to train
the critic relatively well before each generator update, the loss function at
this point is an estimate of the EM distance.
NOT to quantitatively evaluate generative models
Top: DCGAN discriminator; Bottom: MLP discriminator
13 / 16
Preliminaries
WGAN
Experiments
Related works
Improved stability
It allows us to train the critic till optimality, and thus no longer need to
balance generator and discriminator’s capacity properly
A generator without batch normalization in DCGAN
In no experiment did the authors see evidence of mode collapse
A generator constrcuted with MLP
14 / 16
Preliminaries
WGAN
Experiments
Related works
Integral Probability Metrics (IPMs)
dF (Pr , Pg ) = sup Ex∼Pr [f (x)] − Ex∼Pg [f (x)]
(13)
f ∈F
1
Wasserstein distance: F is the set of K-Lipschitz functions
2
Total variation distance: F is the set of all measurable functions bounded
between -1 and 1
3
Energy-based GANs: generative approach to the total variation distance
4
Maximum Mean Discrepancy (MMD): F : f ∈ H, ||f ||∞ ≤ 1, for some H in
RKHS [Sutherland et al., 2016]
5
Kernelized Stein Discrepancy: a special case of MMD, with “Steinalized”
0
kernels depending on Pg , i.e., κ(x, x0 ) = TPxg (TPxg ⊗ k(x, x0 ))
[Wang and Liu, 2016]
15 / 16
Preliminaries
WGAN
Experiments
Related works
References I
[Arjovsky and Bottou, 2017] Arjovsky, M. and Bottou, L. (2017).
Towards principled methods for training generative adversarial networks.
[Goodfellow et al., 2014] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley,
D., Ozair, S., Courville, A., and Bengio, Y. (2014).
Generative adversarial nets.
In Advances in neural information processing systems, pages 2672–2680.
[Sutherland et al., 2016] Sutherland, D. J., Tung, H.-Y., Strathmann, H., De, S., Ramdas, A.,
Smola, A., and Gretton, A. (2016).
Generative models and model criticism via optimized maximum mean discrepancy.
arXiv preprint arXiv:1611.04488.
[Wang and Liu, 2016] Wang, D. and Liu, Q. (2016).
Learning to draw samples: With application to amortized mle for generative adversarial
learning.
arXiv preprint arXiv:1611.01722.
16 / 16
© Copyright 2026 Paperzz