NTHU AI Reading Group: Improved Training of Wasserstein GANs

NTHUAIReadingGroup : ImprovedTrainingofWassersteinGANs

MarkChang2017/6/6

Outlines

• WassersteinGANs• DerivationofKantorovich-RubinsteinDuality• ImprovedTrainingofWGANs• Experiments

Outlines

• WassersteinGANs• RegularGANs• SourceofInstability• EarthMover’sDistance• Kantorovich-RubinsteinDuality• WassersteinGANs• WeightClipping

• DerivationofKantorovich-RubinsteinDuality• ImprovedTrainingofWGANs• Experiments

RegularGANs

GeneratorNetworkG(z)prior

min

Gmax

DV (D,G)

generateddata

realdata

10

DiscriminatorNetworkD(x)

sigmoidfunction

V (D,G) = Ex⇠Pr(x)[logD(x)] + E

z⇠Pz(z)[log(1�D(G(z))]

z ⇠ Pz(z)

x ⇠ Pr(x)

SourceofInstability

x

Pr(x)

VanishingGradient

OptimalDiscriminator D

⇤(x)

DisjointDistributions



realdata

generateddata

Pg(x)

EarthMover’sDistance

Cost function of WGAN : Earth Mover’s Distance



EMD(Pr

, P

✓

) = inf�2⇧(Pr,P✓)

X

x,y

kx� yk�(x, y) = inf�2⇧(Pr,P✓)

E(x,y)⇠�

kx� yk


Pr(x) Pg(x)


x

y

photo credit :https://vincentherrmann.github.io/blog/wasserstein/

EMD(Pr

, P

✓

) = inf�2⇧(Pr,P✓)

X

x,y


E(x,y)⇠�

kx� yk

Realdata

Generateddata

X

y

�(x, y) = Pr(x)

X

x

�(x, y) = P

✓

(y)

x1

y2

�(x1, y2)

Kantorovich-RubinsteinDuality

Kantorovich-RubinsteinDuality

EMD(Pr

, P

✓

) = supkfkL1

Ex⇠Prf(x)� E

x⇠P✓f(x).

1-LipschitzConstraint

Thisformulaishighlyintractable

EMD(Pr

, P

✓

) = inf�2⇧(Pr,P✓)

X

x,y


E(x,y)⇠�

kx� yk

WassersteinGANs

GeneratorNetwork

prior generateddata

realdata

CriticNetwork

z ⇠ Pz(z)

x ⇠ Pr(x)

nosigmoidfunction

fw(x)

fw(g✓(z))

g✓

fw

min

✓

max

w2[�k,k]lEx⇠Pr [fw(x)]� E

z⇠Pz [fw(g✓(z))]

k-Lipschitz Constraint

WassersteinGANs

• k-Lipschitz continuousf(x)

8x1, x2, 9k

Forarealfunction

suchthat

photo credit :https://en.wikipedia.org/wiki/Lipschitz_continuity

f(x)

|f(x1)� f(x2)||x1 � x2|

k

g(x) = kx

WeightClippingEnforceak-Lipschitz constraint: w 2 [�c, c]l

f(x) isamulti-layerneuralnetwork.

WeightClipping

Outlines

• WassersteinGANs• DerivationofKantorovich-RubinsteinDuality

• EarthMover’sDistance• LinearProgramming• DualForm

• ImprovedTrainingofWGANs• Experiments

DerivationofKantorovich-RubinsteinDuality• WassersteinGANandtheKantorovich-RubinsteinDuality

• https://vincentherrmann.github.io/blog/wasserstein/

• OptimalTransportation:ContinuousandDiscrete• http://smat.epfl.ch/~zemel/vt/pdm.pdf

• OptimalTransport:OldandNew• http://www.springer.com/br/book/9783540710493


photo credit :https://vincentherrmann.github.io/blog/wasserstein/

sumofalltheelement-wiseproducts

EMD(Pr

, P

✓

) = inf�2⇧(Pr,P✓)

X

x,y


hD,�iFEMD(Pr

, P

✓

) = inf�2⇧(Pr,P✓)

X

x,y


hD,�iF

P✓(y)

Pr(x)

� =

2

6664

�(x1, y1) �(x1, y2) · · · �(x1, yn)�(x2, y1) �(x2, y2) · · · �(x2, yn)

......

......

�(xn, y1) �(xn, y2) · · · �(xn, yn)

3

7775D =

2

6664

kx1 � y1k kx1 � y2k · · · kx1 � ynkkx2 � y1k kx2 � y2k · · · kx2 � ynk

......

......

kxn � y1k kxn � y2k · · · kxn � ynk

3

7775

LinearProgramming

Ax = b

x � 0

Objectivefunction:minimize

Constraint:

z = c

Tx

X

y

�(x, y) = Pr(x)

X

x

�(x, y) = P

✓

(y)

Objectivefunction:

Constraint:

8x, y �(x, y) � 0

EMD(Pr, P✓) = inf�2⇧(Pr,P✓)

hD,�iF

LinearProgrammingz = c

Tx

2

6666666666666664

�(x1, y1)�(x1, y2)

...�(x2, y1)�(x2, y2)

...�(xn, y1)�(xn, y2)

...

3

7777777777777775

c = vec(D) x = vec(�)2

6666666666666664

kx1 � y1kkx1 � y2k

...kx2 � y1kkx2 � y2k

...kxn � y1kkxn � y2k

...

3

7777777777777775

EMD(Pr, P✓) = inf�2⇧

hD,�iF

Objectivefunction:

LinearProgramming

2

6666666666666664

�(x1, y1)�(x1, y2)

...�(x2, y1)�(x2, y2)

...�(xn, y1)�(xn, y2)

...

3

7777777777777775

=

Ax = b

b =

Pr

P✓

�

X

y

�(x, y) = Pr(x)X

x

�(x, y) = P

✓

(y)

x = vec(�)A2

6666666666664

Pr(x1)Pr(x2)

...Pr(xn)P✓(y1)P✓(y2)

...P✓(yn)

3

7777777777775

2

6666666666664

1 1 · · · 0 0 · · · 0 0 · · ·0 0 · · · 1 1 · · · 0 0 · · ·...

......

......

......

......

0 0 · · · 0 0 · · · 1 1 · · ·1 0 · · · 1 0 · · · 1 0 · · ·0 1 · · · 0 1 · · · 0 1 · · ·...

.... . .

......

. . ....

.... . .

0 0 · · · 0 0 · · · 0 0 · · ·

3

7777777777775

Constraint:

DualForm

Ax = b

x � 0

z = c

Tx z̃ = bTy

z = c

Tx � y

TAx = y

Tb = z̃

ATy c

z = z̃StrongDuality:

WeakDuality: z � z̃

PrimalProblem: DualProblem:

minimize: maximize:

constraint: constraint:

isalowerboundof z = c

Tx � y

TAx = y

Tb = z̃z = c

Tx � y

TAx = y

Tb = z̃

DualFormz̃ = bTy

EMD(Pr, P✓) = fTPr + gTP✓

b =

Pr

P✓

�y =

fg

�Objectivefunction:

2

6666666666664

f(x1)f(x2)

...f(xn)g(x1)g(x2)

...g(xn)

3

7777777777775

2

6666666666664

Pr(x1)Pr(x2)

...Pr(xn)P✓(x1)P✓(x2)

...P✓(xn)

3

7777777777775

)

DualFormATy c

2

6666666666666664

1 0 · · · 0 1 0 · · · 01 0 · · · 0 0 1 · · · 0...

......

......

......

...0 1 · · · 0 1 0 · · · 00 1 · · · 0 0 1 · · · 0...

......

......

......

...0 0 · · · 1 1 0 · · · 00 0 · · · 1 0 1 · · · 0...

......

......

......

...

3

7777777777777775

c = vec(D)AT y =

fg

�constraint:

2

6666666666664

f(x1)f(x2)

...f(xn)g(x1)g(x2)

...g(xn)

3

7777777777775

2

6666666666666664

kx1 � x1kkx1 � x2k

...kx2 � x1kkx2 � x2k

...kxn � x1kkxn � x2k

...

3

7777777777777775

f(xi) + g(xj) kxi � xjk) 8i, j

DualForm

f(xi) + g(xj) kxi � xjk

f(xi) + g(xi) kxi � xik = 0

EMD(Pr, P✓) = fTPr + gTP✓

f(xi) = �g(xi)

8i, jconstraint:

if i = j

)

)

maximize:

DualForm

EMD(Pr

, P

✓

) = supkfkL1

Ex⇠Prf(x)� E

x⇠P✓f(x).

⇢f(xi)� f(xj) kxi � xjkf(xi)� f(xj) � �kxi � xjk

�1 f(xi)� f(xj)

kxi � xjk 1 kfkL1

f(xi) = �g(xi)

f(xi) + g(xj) kxi � xjk8i, jconstraint:

)

) )

1-LipschitzConstraint:Theslopeshouldbebetween-1and1

)

1-LipschitzConstraint

Outlines

• WassersteinGANs• DerivationofKantorovich-RubinsteinDuality• ImprovedTrainingofWGANs

• Difficultieswithweightconstraints• Gradientpenalty

• Experiments

Difficultieswithweightconstraints• Capacityunderuse

• Weightsattaintheirmaximumorminimumvalues• Canonlylearnsimplefunction

• Explodingandvanishinggradients• Clippingparameteristoolarge->explodinggradient• Clippingparameteristoosmall->vanishinggradient



Difficultieswithweightconstraints• Explodingandvanishinggradients

Gradientpenalty

• Optimalcritichasgradientswithnorm1almosteverywhereunder andPr Pg

xt = (1� t)x+ ty

rf

⇤(xt) =y � xt

ky � xtk

krf

⇤(xt)k= 1)

x ⇠ Pr

y ⇠ Pg

L = Ex̃⇠Pg [f(x̃)]� E

x̃⇠Pr [f(x)] + �Ext⇠Pt [(krxtf(xt

)k � 1)2]

gradientpenaltyoriginalcriticloss

Gradientpenalty

Outlines

• WassersteinGANs• DerivationofKantorovich-RubinsteinDuality• ImprovedTrainingofWGANs• Experiments

• ArchitecturerobustnessonLSUNbedrooms• Character-levellanguagemodeling

ArchitecturerobustnessonLSUNbedrooms

Character-levellanguagemodeling

Reference

• TowardsPrincipledMethodsforTrainingGenerativeAdversarialNetworks

• https://arxiv.org/abs/1701.04862

• WassersteinGAN• https://arxiv.org/abs/1701.07875

• WassersteinGANand the Kantorovich-RubinsteinDuality

• https://vincentherrmann.github.io/blog/wasserstein/

• Improved Trainingof WassersteinGANs• https://arxiv.org/abs/1704.00028

AbouttheSpeakerMarkChang

• Email:ckmarkoh at gmail dot com• Blog: https://ckmarkoh.github.io/• Github:https://github.com/ckmarkoh• Slideshare:http://www.slideshare.net/ckmarkohchang• Youtube:https://www.youtube.com/channel/UCckNPGDL21aznRhl3EijRQw

37

HTCResearch&HealthcareDeepLearningAlgorithmsResearchEngineer

NTHU AI Reading Group: Improved Training of Wasserstein GANs

Technology