Loss
Function
𝐿𝑡𝑜𝑡𝑎𝑙 𝑐, 𝑠, 𝑥 = 𝛼𝐿𝑐𝑜𝑛𝑡𝑒𝑛𝑡( 𝑐, 𝑥) + 𝛽𝐿𝑠𝑡𝑦𝑙𝑒( 𝑠, 𝑥)
Target content, output
from a VGG mid-layerWhite noise
𝐿𝑐𝑜𝑛𝑡𝑒𝑛𝑡 𝑐, 𝑥, 𝑙 =1
2
𝑖,𝑗,𝑘
(𝐹𝑖,𝑗,𝑘𝑙 − 𝐶𝑖,𝑗,𝑘
𝑙 )2
𝑛𝐶 = 5
𝑛𝐻 =4
𝑛𝑊 = 4
𝐶𝑜𝑛𝑡𝑒𝑛𝑡 𝐼𝑚𝑎𝑔𝑒
𝑛𝐶 = 5
𝑛𝐻 =4
𝑛𝑊 = 4
𝑊ℎ𝑖𝑡𝑒 𝑁𝑜𝑖𝑠𝑒
Target
content
White
noise
Minimize
𝐿𝑠𝑡𝑦𝑙𝑒 𝑠, 𝑥, 𝑙 =1
(2𝑛𝑊𝑛𝐻𝑛𝑐)2
𝑘,𝑘′
(𝐺𝑘,𝑘′𝑙 , −𝑆𝑘,𝑘′
𝑙 )2
White noise
Target style,
output from a
VGG CNN layer
𝐺𝑘,𝑘′𝑙 =
𝑖,𝑗
𝑎𝑖,𝑗,𝑘𝑙 𝑎𝑖,𝑗,𝑘′
𝑙
𝑛𝐶 = 5
𝑛𝐻 =4
𝑛𝑊 = 4
𝑆𝑡𝑦𝑙𝑒 𝐼𝑚𝑎𝑔𝑒
Correlation
of 0.89
𝑛𝐶
𝑛𝐶 = 5
0.89
𝐺𝑘,𝑘′𝑙 =
𝑖,𝑗
𝑎𝑖,𝑗,𝑘𝑙 𝑎𝑖,𝑗,𝑘′
𝑙
𝑛𝐶 = 5
𝑛𝐻 =4
𝑛𝑊 = 4
𝑆𝑡𝑦𝑙𝑒 𝐼𝑚𝑎𝑔𝑒
Correlation
of 0.2
𝑛𝐶
0.89 0.2
𝐺𝑘,𝑘′𝑙 =
𝑖,𝑗
𝑎𝑖,𝑗,𝑘𝑙 𝑎𝑖,𝑗,𝑘′
𝑙
𝑛𝐶
𝑛𝐶 = 5
𝑛𝐻 =4
𝑛𝑊 = 4
𝑆𝑡𝑦𝑙𝑒 𝐼𝑚𝑎𝑔𝑒
Correlation
of 0.7
𝑛𝐶
0.89 0.2 0.7
𝐺𝑘,𝑘′𝑙 =
𝑖,𝑗
𝑎𝑖,𝑗,𝑘𝑙 𝑎𝑖,𝑗,𝑘′
𝑙
𝑛𝐶
𝐿𝑠𝑡𝑦𝑙𝑒 𝑠, 𝑥, 𝑙 =1
(2𝑛𝑊𝑛𝐻𝑛𝑐)2
𝑘,𝑘′
(𝐺𝑘,𝑘′𝑙 − 𝑆𝑘,𝑘′
𝑙 )2
White Noise
Gram Matrix
Target Style
Gram Matrix
, 𝑙
Target
content
White
noise
Minimize
Min Min Min Min Min
G G G G G G G G G G
Deeper
layers for
style
Increasing 𝛼 𝛽 ratio
𝐿𝑡𝑜𝑡𝑎𝑙 = 𝛼𝐿𝑐𝑜𝑛𝑡𝑒𝑛𝑡 + 𝛽𝐿𝑠𝑡𝑦𝑙𝑒
𝟏𝟎−𝟓 𝟏𝟎−𝟐
Epoch 100 Epoch 300 Epoch 500
𝑛𝐶
0.89 0.2 0.7
• Diagonal: which
channels are most
active
• Off-diagonal: which
channel pairs co-occur
𝑛𝐶