Semantic Image Synthesis with Spatially-Adaptive …kucg.korea.ac.kr/new/seminar/2019/ppt/ppt-2019-05-16.pdf2019/05/16 · Related Work - Conditional Image Synthesis Text to Image

Semantic Image Synthesis with Spatially-Adaptive Normalization

Byeong-Sun Hong

2019-05-16

Computer Graphics @ Korea University

Copyright of figures and other materials in the paper belongs to original authors.

Taesung Park(UC Berkeley) et al.CVPR 2019

Byeong-sun Hong | 2019-05-16| # 2Computer Graphics @ Korea University

Title

• Semantic Image Synthesis

Semantic Segmentation Mask Image -> Photorealistic Image

• Spatially-Adaptive Normalization

SPatially Adaptive (DE)normalization

• SPADE


Index

• Introduction

• Related Work

• Model

• Semantic Image Synthesis

• Experiments

• Conclusion

Introduction


Introduction

Champion Scene


Introduction

• 최근 Neural Network을 활용한 Generating Photorealistic image 기법들이 나오고 있음

• 이전 방법들은 Wash Away 현상으로 인해 결과가 좋지 않다.

• Contribution

Spatially-Adaptive Normalization 기법을 적용하여 Wash Away 현상을 없애고 이전의 결과들보다 좋은 결과를 얻어냈음

Segmentation 및 Style 별로 Control 가능하다.

Related Work


• “Generative Adversarial Nets”

[Ian J.Goodfellow(Google) et al. / NIPS 2014]

Related Work

Deep Generative Model


Related Work

Conditional Image Synthesis

• labels

• Text to Image

• Image to Image

• Segmentation Mask to Image


Related Work - Conditional Image Synthesis

labels

• “Conditional image synthesis with auxiliary classifier GANs”

[Augustus Odena(Google Brain) et al. / CVPR 2017]

• “cGANs with projection discriminator”

[Takeru M.(Preferred Networks inc.) and Masanori K. / ICLR 2018 ]



Text to Image

• “Generative adversarial text to image synthesis.”

[Scott Reed(University of Michigan) et al. / ICML 2016]

• “Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks”

[Han Zhang(Rutgers Univ.) et al. / ICCV 2016]



Image to Image

• “Image to image translation with conditional adversarial networks”

[Phillip Isola(Berkeley AI Research) et al. / CVPR 2017]

• “Multimodal unsupervised image-to-image translation”

[Xun Huang(Cornell Univ.) et al. / ECCV 2018]



Segmentation Mask to Image(1/3)

• “Photographic Image Synthesis with Cascaded Refinement Networks”

[Qifeng Chen and Vladlen Koltun(Intel Labs) / ICCV 2017]




• “Semi-parametric image synthesis”

[Xiaojuan Qi (CUHK) et al. / CVPR 2018]

SIMS 같은 경우에는 Class가 많은 Data에 대해서는 불가능하다.

• 연산량이 너무 많이 필요함




• “High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs”

[Ting-Chun Wang(NVIDIA Corp.) et al. / CVPR 2017]


Related Work

Normalization

• Unconditional Normalization Layers

Batch Normalization

Layer Normalization

Instance Normalization

Group Normalization

• Conditional Normalization Layers

Conditional Batch Normalization

Adaptive Instance Normalization

𝛾와 𝛽를 어떻게 학습하는지에 따라 나뉨


• 입력의 분포가 평균 0 분산 1로 규격화가 되었더라도층이 깊어질 수록 입력 분포가 변화되면서 학습이 불안정해진다.

Related Work - Normalization

Covariate Shift

출처 : Google 이미지



• Gradient Vanishing / Gradient Exploding 문제 발생


DNN의 문제점



• Normalization 이전의 해결방법

Activation Function 변화(Relu 등)

Careful Initialization

Small learning rate

• 간접적인 방법이 아닌 Training 하는 과정 자체를 전체적으로안정화 하여 문제를 해결하겠다.

Normalization 방법


여러 해결법



• “Batch Normalization : Accelerating deep network training by reducing internal covariate shift”

[Sergey Loffe and Cristian Szegedy(Google Inc.) / NIPS 2015]

Related Work - Unconditional Normalization Layers

Batch Normalization(1/2)



Batch Normalization(2/2)

𝛾, 𝛽 값(scale, shift factor)이 있는 이유 : Normalization를 통해 평균 0, 분산 1이되면, activation function의 비선형성이 없어질 수 있다.


• “Layer Normalization”

[Jimmy Lei Ba et al.(Univ. of Toronto) / NIPS 2016]

• “Instance Normalization: The Missing Ingredient for Fast Stylization”

[Dimitry Ulyanov et al.(Computer Vision Group Skoltech) /CVPR 2017]

• “Group Normalization”

[Yuxin Wu and Kaiming He(Facebook AI Research)/ ECCV 2018]


Layer, Instance, Group Normalization

N = Batch안의 Sample 수C = Channel(H,W) = Height x Width



Unconditional Normalization

R = 255G = 0B = 0

Batch Norm = 2장 총 8개의 픽셀 R,G,B 각각 NormLayer Norm = 1장 총 4개의 픽셀 모든 R,G,B NormInstance Norm = 1반 총 4개의 픽셀 R,G,B 각각 Norm

Batch Size = 2N = Batch Size로 나눈 반 수 = 4 / 2 = 2C = RGB 수 = 3H,W = 픽셀 수 = 2 x 2 = 4

Mini Batch

R = 0G = 0B = 0


Related Work

Conditional Normalization Layers

• “A learned representation for artistic style”

[V. Dumoulin(Google Brain) et al. / ICLR 2017]

• “Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization”

[Xun Huang and Serge Belongie(Cornell Univ.) / ICCV 2017]

𝑥 = content input𝑦 = style input

Conditional Batch Normalization 제시

Adaptive Instance Normalization 제시

Model


Model

Total Architecture


Model

Generator, Discriminator, Encoder

Generator DiscriminatorEncoder


Model

SPADE ResBlk, SPADE

Semantic Image Synthesis



SPADE

m = Semantic Segmentation Mask𝐻,𝑊, 𝐶 = Height, Width, Channel𝑁 = 전체 Data Size / Batch 수ℎ𝑖 = 𝑖번째 Layer에서 나온 Activation Map𝛾, 𝛽 = Scaling and Shift Factor Map



SPADE Generator

• SPADE를 사용하게 되면 처음에 Segmentation Mask가 필요 없음

각각의 SPADE ResBlk에서 계속 정보를 받아 옴

Encoder 부분이 없어도 되므로 Parameter가 줄어들면서 시간 단축



Why does SPADE work better?

• Normalization을 거치면서 Wash away 되는 경향이 있음

Input이 단일 Mask로 구성되어 있으면 Normalization을 하면서All Zero가 된다

• SPADE는 Segmentation Mask는 Normalization을 하지 않으므로Data 정보가 살아 있다.



Multi Modal Synthesis

• Random Vector를 Generator의 input으로 사용

• Random Vector를 조금씩 수정하면 다른 결과가 나온다

Experiments


Experiments

Implementation Details

• 학습시간

Dataset에 따라 다르지만 약 100 ~ 200 epoch

• Learning rate

Generator = 0.0001, Discriminator =0.0004

• Hardware

NVIDIA DGX1

출처 : NVIDIA 홈페이지


Experiments

Datasets

• COCO-Stuff 118,000 training images, 5000 validation images, 182 class

• ADE20K 20,210 training images, 2000 validation images, 150 class

• ADE20K-outdoor Subset of the ADE20K dataset that only contains outdoor scenes

• Cityscapes Street scene images in German cities. 3000 training images, 500 validation images

• Flickr Landscapes Flickr 사이트에서 가져온 풍경 사진들로 구성 DeepLab v2를 가지고 Segmentation Mask 생성 40,000 training images, 1000 validation images, 182 class


Experiments

Performance Metrics

• 평가 방법

생성한 Image를 학습이 잘 된 Semantic Segmentation Model로다시 Segmentation을 한 뒤에 Ground Truth와 비교

Segmentation Model

• COCO-Stuff : DeepLabV2

• ADE20K : UperNet 101

• Cityscapes : DRN-D-105

• 평가 종류

mIoU - Mean Intersection-over-Union

accu – Pixel Accuracy

FID – Frechet Inception Distance

• 평가 대상 – CRN, SIMS, PIX2PIXHD

공정한 비교를 위해 CRN, PIX2PIXHD는 저자에게 직접 제공받음


Experiments

Quantitative Comparisons

• mIoU - Mean Intersection-over-Union

• accu – Pixel Accuracy

• FID – Frechet Inception Distance

| 𝑚 − 𝑚𝑔 | 22 + 𝑇𝑟(𝐶 + 𝐶𝑔 − 2 𝐶𝐶𝑔

1

2)

𝑚 = 실제 Data distribution 평균𝐶 = 실제 Data distribution 분산𝑚𝑔 = 생성 Data distribution 평균

𝐶𝑔 = 생성 Data distribution 분산

𝑇𝑟 = 대각합



Experiments

Qualitative Results(1/3)


Experiments


• Flickr Landscapes Dataset으로 학습한 결과


Experiments


• COCO Dataset으로 학습한 결과


Experiments

Human Evaluation

• Amazon Mechanical Turk(AMT)

Amazon에서 제공하는 설문조사 사이트

• 결과물 2장과 Segmentation Mask를 주어 5명에게 결과를 받음

• 500개의 Result를 제공하여 결과 비교

출처 : AMT


Experiments

The Effectiveness of SPADE

• ++ - SPADE를 뺀 나머지 할 수 있는 최대한의 좋은 것

• Concat - Input에 Segmentation Mask를 Channel wise로 붙임

• Compact - Filter 수를 줄임


Experiments

Variations of SPADE Generator

• 여러 방법을 통해 Generator 성능을 측정

Input to the Generator, Kernel Size, Number of Filters, Normalization Method


Experiments

Multi-Modal Synthesis

• 같은 Segmentation Mask이지만 다른 Random Noise Input으로다른 모습을 나타낼 수 있다.


Experiments

최종 결과

• Style과 Segmentation Map을 수정하며 사용 가능

Conclusion


Conclusion

• SPADE를 사용하여 다양한 사진들에 대해 좋은 결과를 얻어냄

• Segmentation Mask와 Style에 따른 결과 변경이 가능하다


Conclusion

Demo Video - GauGAN(1/3)

• Software GauGAN - Gauguin을 본 따서 만듦.


Conclusion




Conclusion



Tae-hyeong Kim | 2012. 10. 29 | # 52Computer Graphics @ Korea University

Q & A


Backpropagation


Residual Block


ExperimentsSegmentation Modal

• DeepLabV2

“DeepLab : Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs“

• [Liang-Chieh Chen(Google Inc.) et al. / CVPR 2017]



• UperNet101

“Unified Perceptual Parsing for Scene Understanding“

• [Tete Xiao(Peking University) et al. / ECCV 2018]



• DRN-D-105

“Dilated Residual Networks“

• [Fisher Yu(Princeton University) et al. / CVPR 2017]


Additional Results


Additional Results


Additional Results

Semantic Image Synthesis with Spatially-Adaptive …kucg.korea.ac.kr/new/seminar/2019/ppt/ppt-2019-05-16.pdf2019/05/16 · Related Work - Conditional Image Synthesis Text to Image

Documents