NTHU AI Reading Group: Improved Training of Wasserstein GANs Mark Chang 2017/6/6
NTHUAIReadingGroup : ImprovedTrainingofWassersteinGANs
MarkChang2017/6/6
Outlines
• WassersteinGANs• DerivationofKantorovich-RubinsteinDuality• ImprovedTrainingofWGANs• Experiments
Outlines
• WassersteinGANs• RegularGANs• SourceofInstability• EarthMover’sDistance• Kantorovich-RubinsteinDuality• WassersteinGANs• WeightClipping
• DerivationofKantorovich-RubinsteinDuality• ImprovedTrainingofWGANs• Experiments
RegularGANs
GeneratorNetworkG(z)prior
min
Gmax
DV (D,G)
generateddata
realdata
10
DiscriminatorNetworkD(x)
sigmoidfunction
V (D,G) = Ex⇠Pr(x)[logD(x)] + E
z⇠Pz(z)[log(1�D(G(z))]
z ⇠ Pz(z)
x ⇠ Pr(x)
SourceofInstability
x
Pr(x)
VanishingGradient
OptimalDiscriminator D
⇤(x)
DisjointDistributions
V (D,G) = Ex⇠Pr(x)[logD(x)] + E
z⇠Pz(z)[log(1�D(G(z))]
realdata
generateddata
Pg(x)
EarthMover’sDistance
Cost function of WGAN : Earth Mover’s Distance
V (D,G) = Ex⇠Pr(x)[logD(x)] + E
z⇠Pz(z)[log(1�D(G(z))]
EMD(Pr
, P
✓
) = inf�2⇧(Pr,P✓)
X
x,y
kx� yk�(x, y) = inf�2⇧(Pr,P✓)
E(x,y)⇠�
kx� yk
EarthMover’sDistance
Pr(x) Pg(x)
EarthMover’sDistance
x
y
photo credit :https://vincentherrmann.github.io/blog/wasserstein/
EMD(Pr
, P
✓
) = inf�2⇧(Pr,P✓)
X
x,y
kx� yk�(x, y) = inf�2⇧(Pr,P✓)
E(x,y)⇠�
kx� yk
Realdata
Generateddata
X
y
�(x, y) = Pr(x)
X
x
�(x, y) = P
✓
(y)
x1
y2
�(x1, y2)
Kantorovich-RubinsteinDuality
Kantorovich-RubinsteinDuality
EMD(Pr
, P
✓
) = supkfkL1
Ex⇠Prf(x)� E
x⇠P✓f(x).
1-LipschitzConstraint
Thisformulaishighlyintractable
EMD(Pr
, P
✓
) = inf�2⇧(Pr,P✓)
X
x,y
kx� yk�(x, y) = inf�2⇧(Pr,P✓)
E(x,y)⇠�
kx� yk
WassersteinGANs
GeneratorNetwork
prior generateddata
realdata
CriticNetwork
z ⇠ Pz(z)
x ⇠ Pr(x)
nosigmoidfunction
fw(x)
fw(g✓(z))
g✓
fw
min
✓
max
w2[�k,k]lEx⇠Pr [fw(x)]� E
z⇠Pz [fw(g✓(z))]
k-Lipschitz Constraint
WassersteinGANs
• k-Lipschitz continuousf(x)
8x1, x2, 9k
Forarealfunction
suchthat
photo credit :https://en.wikipedia.org/wiki/Lipschitz_continuity
f(x)
|f(x1)� f(x2)||x1 � x2|
k
g(x) = kx
WeightClippingEnforceak-Lipschitz constraint: w 2 [�c, c]l
f(x) isamulti-layerneuralnetwork.
WeightClipping
Outlines
• WassersteinGANs• DerivationofKantorovich-RubinsteinDuality
• EarthMover’sDistance• LinearProgramming• DualForm
• ImprovedTrainingofWGANs• Experiments
DerivationofKantorovich-RubinsteinDuality• WassersteinGANandtheKantorovich-RubinsteinDuality
• https://vincentherrmann.github.io/blog/wasserstein/
• OptimalTransportation:ContinuousandDiscrete• http://smat.epfl.ch/~zemel/vt/pdm.pdf
• OptimalTransport:OldandNew• http://www.springer.com/br/book/9783540710493
EarthMover’sDistance
photo credit :https://vincentherrmann.github.io/blog/wasserstein/
sumofalltheelement-wiseproducts
EMD(Pr
, P
✓
) = inf�2⇧(Pr,P✓)
X
x,y
kx� yk�(x, y) = inf�2⇧(Pr,P✓)
hD,�iFEMD(Pr
, P
✓
) = inf�2⇧(Pr,P✓)
X
x,y
kx� yk�(x, y) = inf�2⇧(Pr,P✓)
hD,�iF
P✓(y)
Pr(x)
� =
2
6664
�(x1, y1) �(x1, y2) · · · �(x1, yn)�(x2, y1) �(x2, y2) · · · �(x2, yn)
......
......
�(xn, y1) �(xn, y2) · · · �(xn, yn)
3
7775D =
2
6664
kx1 � y1k kx1 � y2k · · · kx1 � ynkkx2 � y1k kx2 � y2k · · · kx2 � ynk
......
......
kxn � y1k kxn � y2k · · · kxn � ynk
3
7775
LinearProgramming
Ax = b
x � 0
Objectivefunction:minimize
Constraint:
z = c
Tx
X
y
�(x, y) = Pr(x)
X
x
�(x, y) = P
✓
(y)
Objectivefunction:
Constraint:
8x, y �(x, y) � 0
EMD(Pr, P✓) = inf�2⇧(Pr,P✓)
hD,�iF
LinearProgrammingz = c
Tx
2
6666666666666664
�(x1, y1)�(x1, y2)
...�(x2, y1)�(x2, y2)
...�(xn, y1)�(xn, y2)
...
3
7777777777777775
c = vec(D) x = vec(�)2
6666666666666664
kx1 � y1kkx1 � y2k
...kx2 � y1kkx2 � y2k
...kxn � y1kkxn � y2k
...
3
7777777777777775
EMD(Pr, P✓) = inf�2⇧
hD,�iF
Objectivefunction:
LinearProgramming
2
6666666666666664
�(x1, y1)�(x1, y2)
...�(x2, y1)�(x2, y2)
...�(xn, y1)�(xn, y2)
...
3
7777777777777775
=
Ax = b
b =
Pr
P✓
�
X
y
�(x, y) = Pr(x)X
x
�(x, y) = P
✓
(y)
x = vec(�)A2
6666666666664
Pr(x1)Pr(x2)
...Pr(xn)P✓(y1)P✓(y2)
...P✓(yn)
3
7777777777775
2
6666666666664
1 1 · · · 0 0 · · · 0 0 · · ·0 0 · · · 1 1 · · · 0 0 · · ·...
......
......
......
......
0 0 · · · 0 0 · · · 1 1 · · ·1 0 · · · 1 0 · · · 1 0 · · ·0 1 · · · 0 1 · · · 0 1 · · ·...
.... . .
......
. . ....
.... . .
0 0 · · · 0 0 · · · 0 0 · · ·
3
7777777777775
Constraint:
DualForm
Ax = b
x � 0
z = c
Tx z̃ = bTy
z = c
Tx � y
TAx = y
Tb = z̃
ATy c
z = z̃StrongDuality:
WeakDuality: z � z̃
PrimalProblem: DualProblem:
minimize: maximize:
constraint: constraint:
isalowerboundof z = c
Tx � y
TAx = y
Tb = z̃z = c
Tx � y
TAx = y
Tb = z̃
DualFormz̃ = bTy
EMD(Pr, P✓) = fTPr + gTP✓
b =
Pr
P✓
�y =
fg
�Objectivefunction:
2
6666666666664
f(x1)f(x2)
...f(xn)g(x1)g(x2)
...g(xn)
3
7777777777775
2
6666666666664
Pr(x1)Pr(x2)
...Pr(xn)P✓(x1)P✓(x2)
...P✓(xn)
3
7777777777775
)
DualFormATy c
2
6666666666666664
1 0 · · · 0 1 0 · · · 01 0 · · · 0 0 1 · · · 0...
......
......
......
...0 1 · · · 0 1 0 · · · 00 1 · · · 0 0 1 · · · 0...
......
......
......
...0 0 · · · 1 1 0 · · · 00 0 · · · 1 0 1 · · · 0...
......
......
......
...
3
7777777777777775
c = vec(D)AT y =
fg
�constraint:
2
6666666666664
f(x1)f(x2)
...f(xn)g(x1)g(x2)
...g(xn)
3
7777777777775
2
6666666666666664
kx1 � x1kkx1 � x2k
...kx2 � x1kkx2 � x2k
...kxn � x1kkxn � x2k
...
3
7777777777777775
f(xi) + g(xj) kxi � xjk) 8i, j
DualForm
f(xi) + g(xj) kxi � xjk
f(xi) + g(xi) kxi � xik = 0
EMD(Pr, P✓) = fTPr + gTP✓
f(xi) = �g(xi)
8i, jconstraint:
if i = j
)
)
maximize:
DualForm
EMD(Pr
, P
✓
) = supkfkL1
Ex⇠Prf(x)� E
x⇠P✓f(x).
⇢f(xi)� f(xj) kxi � xjkf(xi)� f(xj) � �kxi � xjk
�1 f(xi)� f(xj)
kxi � xjk 1 kfkL1
f(xi) = �g(xi)
f(xi) + g(xj) kxi � xjk8i, jconstraint:
)
) )
1-LipschitzConstraint:Theslopeshouldbebetween-1and1
)
1-LipschitzConstraint
Outlines
• WassersteinGANs• DerivationofKantorovich-RubinsteinDuality• ImprovedTrainingofWGANs
• Difficultieswithweightconstraints• Gradientpenalty
• Experiments
Difficultieswithweightconstraints• Capacityunderuse
• Weightsattaintheirmaximumorminimumvalues• Canonlylearnsimplefunction
• Explodingandvanishinggradients• Clippingparameteristoolarge->explodinggradient• Clippingparameteristoosmall->vanishinggradient
Difficultieswithweightconstraints• Capacityunderuse
Difficultieswithweightconstraints• Capacityunderuse
Difficultieswithweightconstraints• Explodingandvanishinggradients
Gradientpenalty
• Optimalcritichasgradientswithnorm1almosteverywhereunder andPr Pg
xt = (1� t)x+ ty
rf
⇤(xt) =y � xt
ky � xtk
krf
⇤(xt)k= 1)
x ⇠ Pr
y ⇠ Pg
L = Ex̃⇠Pg [f(x̃)]� E
x̃⇠Pr [f(x)] + �Ext⇠Pt [(krxtf(xt
)k � 1)2]
gradientpenaltyoriginalcriticloss
Gradientpenalty
Outlines
• WassersteinGANs• DerivationofKantorovich-RubinsteinDuality• ImprovedTrainingofWGANs• Experiments
• ArchitecturerobustnessonLSUNbedrooms• Character-levellanguagemodeling
ArchitecturerobustnessonLSUNbedrooms
Character-levellanguagemodeling
Reference
• TowardsPrincipledMethodsforTrainingGenerativeAdversarialNetworks
• https://arxiv.org/abs/1701.04862
• WassersteinGAN• https://arxiv.org/abs/1701.07875
• WassersteinGANand the Kantorovich-RubinsteinDuality
• https://vincentherrmann.github.io/blog/wasserstein/
• Improved Trainingof WassersteinGANs• https://arxiv.org/abs/1704.00028
AbouttheSpeakerMarkChang
• Email:ckmarkoh at gmail dot com• Blog: https://ckmarkoh.github.io/• Github:https://github.com/ckmarkoh• Slideshare:http://www.slideshare.net/ckmarkohchang• Youtube:https://www.youtube.com/channel/UCckNPGDL21aznRhl3EijRQw
37
HTCResearch&HealthcareDeepLearningAlgorithmsResearchEngineer