SEAN: Image Synthesis with Semantic Region-Adaptive Normalization Peihao Zhu 1 Rameen Abdal 1 Yipeng Qin 2 Peter Wonka 1 1 KAUST 2 Cardiff University Source Image Style Image (a) (b) (c) (d) (e) (f) Figure 1: Face image editing controlled via style images and segmentation masks. a) source images. b) reconstruction of the source image; segmentation mask shown as small inset. c - f) four separate edits; we show the image that provides new style information on top and show the part of the segmentation mask that gets edited as small inset. The results of the successive edits are shown in row two and three. The four edits change hair, mouth and eyes, skin tone, and background, respectively. Abstract We propose semantic region-adaptive normalization (SEAN), a simple but effective building block for Generative Adversarial Networks conditioned on segmentation masks that describe the semantic regions in the desired output im- age. Using SEAN normalization, we can build a network architecture that can control the style of each semantic re- gion individually, e.g., we can specify one style reference image per region. SEAN is better suited to encode, transfer, and synthesize style than the best previous method in terms of reconstruction quality, variability, and visual quality. We evaluate SEAN on multiple datasets and report better quan- titative metrics (e.g. FID, PSNR) than the current state of the art. SEAN also pushes the frontier of interactive im- age editing. We can interactively edit images by changing segmentation masks or the style for any given region. We can also interpolate styles from two reference images per region. Code: https://github.com/ZPdesu/SEAN . 1. Introduction In this paper we tackle the problem of synthetic im- age generation using conditional generative adversarial net- works (cGANs). Specifically, we would like to control the layout of the generated image using a segmentation mask 5104
10
Embed
SEAN: Image Synthesis With Semantic Region-Adaptive ...openaccess.thecvf.com/content_CVPR_2020/papers/Zhu_SEAN_Ima… · 2. SEAN improves the per-region style encoding, so that reconstructed
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
SEAN: Image Synthesis with Semantic Region-Adaptive Normalization
Peihao Zhu1 Rameen Abdal1 Yipeng Qin2 Peter Wonka1
1KAUST 2Cardiff University
Source Image
Style Image
(a) (b) (c) (d) (e) (f)
Figure 1: Face image editing controlled via style images and segmentation masks. a) source images. b) reconstruction of the
source image; segmentation mask shown as small inset. c - f) four separate edits; we show the image that provides new style
information on top and show the part of the segmentation mask that gets edited as small inset. The results of the successive
edits are shown in row two and three. The four edits change hair, mouth and eyes, skin tone, and background, respectively.
Abstract
We propose semantic region-adaptive normalization
(SEAN), a simple but effective building block for Generative
Adversarial Networks conditioned on segmentation masks
that describe the semantic regions in the desired output im-
age. Using SEAN normalization, we can build a network
architecture that can control the style of each semantic re-
gion individually, e.g., we can specify one style reference
image per region. SEAN is better suited to encode, transfer,
and synthesize style than the best previous method in terms
of reconstruction quality, variability, and visual quality. We
evaluate SEAN on multiple datasets and report better quan-
titative metrics (e.g. FID, PSNR) than the current state of
the art. SEAN also pushes the frontier of interactive im-
age editing. We can interactively edit images by changing
segmentation masks or the style for any given region. We
can also interpolate styles from two reference images per
region. Code: https://github.com/ZPdesu/SEAN .
1. Introduction
In this paper we tackle the problem of synthetic im-
age generation using conditional generative adversarial net-
works (cGANs). Specifically, we would like to control the
layout of the generated image using a segmentation mask
5104
Source Image
Style Image
(a) (b) (c) (d) (e) (f)
Figure 2: Editing sequence on the ADE20K dataset. (a) source image, (b) reconstruction of the source image, (c-f) various
edits using style images shown in the top row. The regions affected by the edits are shown as small insets.
that has labels for each semantic region and “add” realistic
styles to each region according to their labels . For exam-
ple, a face generation application would use region labels
like eyes, hair, nose, mouth, etc. and a landscape painting
application would use labels like water, forest, sky, clouds,
etc. While multiple very good frameworks exist to tackle
this problem [22, 8, 39, 44], the currently best architecture
is SPADE [38] (also called GauGAN). Therefore, we de-
cided to use SPADE as starting point for our research. By
analyzing the SPADE results, we found two shortcomings
that we would like to improve upon in our work.
First, SPADE uses only one style code to control the en-
tire style of an image, which is not sufficient for high quality
synthesis or detailed control. For example, it is easily possi-
ble that the segmentation mask of the desired output image
contains a labeled region that is not present in the segmen-
tation mask of the input style image. In this case, the style
of the missing region is undefined, which yields low qual-
ity results. Further, SPADE does not allow using a different
style input image for each region in the segmentation mask.
Our first main idea is therefore to control the style of each
region individually, i.e., our proposed architecture accepts
one style image per region (or per region instance) as input.
Second, we believe that inserting style information only
in the beginning of a network is not a good architecture
choice. Recent architectures [25, 31, 2] have demonstrated
that higher quality results can be obtained if style informa-
tion is injected as normalization parameters in multiple lay-
ers in the network, e.g. using AdaIN [19]. However, none
of these previous networks use style information to gener-
ate spatially varying normalization parameters. To alleviate
this shortcoming, our second main idea is to design a nor-
malization building block, called SEAN, that can use style
input images to create spatially varying normalization pa-
rameters per semantic region. An important aspect of this
work is that the spatially varying normalization parameters
are dependent on the segmentation mask as well as the style
input images.
Empirically, we provide an extensive evaluation of our
method on several challenging datasets: CelebAMask-
HQ [28, 24, 32], CityScapes [10], ADE20K [50], and our
own Facades dataset. Quantitatively, we evaluate our work
on a wide range of metrics including FID, PSNR, RMSE
and segmentation performance; qualitatively, we show ex-
amples of synthesized images that can be evaluated by vi-
sual inspection. Our experimental results demonstrate a
large improvement over the current state-of-the-art meth-
ods. In summary, we introduce a new architecture building
block SEAN that has the following advantages:
1. SEAN improves the quality of the synthesized images
for conditional GANs. We compared to the state of the
art methods SPADE and Pix2PixHD and achieve clear
improvements in quantitative metrics (e.g. FID score)
and visual inspection.
2. SEAN improves the per-region style encoding, so that
reconstructed images can be made more similar to the
input style images as measured by PSNR and visual
inspection.
3. SEAN allows the user to select a different style in-
put image for each semantic region. This enables im-
age editing capabilities producing much higher quality
and providing better control than the current state-of-
the-art methods. Example image editing capabilities
are interactive region by region style transfer and per-
region style interpolation (See Figs. 1, 2, and 5).
5105
2. Related Work
Generative Adversarial Networks. Since their introduc-
tion in 2014, Generative Adversarial Networks (GANs) [14]
have been successfully applied to various image synthesis
tasks, e.g. image inpainting [48, 11], image manipulation
[52, 5, 1] and texture synthesis [29, 43, 12]. With contin-
uous improvements on GAN architecture [40, 25, 38], loss
function [33, 4] and regularization [16, 36, 34], the images
synthesized by GANs are becoming more and more real-
istic. For example, the human face images generated by
StyleGAN [25] are of very high quality and are almost in-
distinguishable from photographs by untrained viewers. A
traditional GAN uses noise vectors as the input and thus
provides little user control. This motivates the development
of conditional GANs (cGANs) [35] where users can con-
trol the synthesis by feeding the generator with condition-
ing information. Examples include class labels [37, 34, 6],
text [41, 18, 46] and images [22, 53, 30, 44, 44, 38]. Our
work is built on the conditional GANs with image inputs,
which aims to tackle image-to-image translation problems.