Recovering Realistic Texture in Image Super-resolution by Deep Spatial Feature Transform Xintao Wang Ke Yu Chao Dong Chen Change Loy
Recovering Realistic Texture in Image Super-resolution by
Deep Spatial Feature Transform
Xintao Wang Ke Yu Chao Dong Chen Change Loy
Problem
Low-resolution image High-resolution image
enlarge 4 times
Previous work• Contemporary SR algorithms are mostly CNN-based methods[1].
• Most of CNN-based methods use pixel-wise loss function. (MSE-based model) good at recovering edges and smooth areas
not good at texture recovery
• Adversarial loss is introduced in SRGAN[2] and EnhanceNet[3]. (GAN-based model) encourage the network to favor solutions that look more like natural images
visual quality of reconstruction is significantly improved
SRCNN SRGAN Ground-truth
[1] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep convolutional network for image super-resolution. In ECCV, 2014.[2] C. Ledig, L. Theis, F. Husz ar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, et al. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 2017.[3] M. S. Sajjadi, B. Sch olkopf, and M. Hirsch. EnhanceNet: Single image super-resolution through automated texture synthesis. In ICCV, 2017.
Motivation
building x4
plant x4swap priors
plant prior
building prior
animal
building water
sky
grass
mountain
plant
Semantic categorical prior
Issues
1. How to represent the semantic categorical prior?
2. How categorical prior can be incorporated into the reconstruction process effectively?
Our approach: explore semantic segmentation probability maps as the categorical prior up to pixel level.
Our approach: propose a novel Spatial Feature Transform that is capable of altering the network behavior conditioned on other information.
Represent categorical prior• Contemporary CNN segmentation network[1]
• fine-tuned on LR images
ResNet 101
𝐾 categories
𝑎𝑟𝑔𝑚𝑎𝑥
semantic categorical prior
probability maps
[1] Z. Liu, X. Li, P. Luo, C.-C. Loy, and X. Tang. Semantic image segmentation via deep parsing network. In ICCV, 2015.
Input LR images
animal
sky
grass
building
mountain
plant
water
background
Segments on LR images
Segments on HR images
Ground-truth
Examples on segmentation
𝒚 = 𝐺𝜃(𝒙)
Ψ = (𝑃1, 𝑃2, … , 𝑃𝐾)
Incorporate conditions
CNN for SR
𝒚 = 𝐺𝜃(𝒙)
input LR image
𝒙restored image
𝒚𝑛𝑒𝑡 𝐺
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑟𝑖𝑧𝑒𝑑 𝑏𝑦 𝜃
Categorical prior
Ψ = (𝑃1, 𝑃2, … , 𝑃𝐾)
𝒚 = 𝐺𝜃(𝒙|Ψ)probability maps
𝑃1, 𝑃2, … , 𝑃𝐾 ?prior
Ψ
Spatial Feature Transform
• By learning a mapping function ℳ, the prior Ψ is modeled by a pair of affine transformation parameters (𝛾, 𝛽) .
• The modulation is then carried out by an affine transformation on feature maps 𝑭.
ℳ:Ψ ↦ (𝛾, 𝛽)
SF𝑇 𝑭 𝛾, 𝛽 = 𝛾⨀𝑭+ 𝛽
𝒚 = 𝐺𝜃(𝒙|Ψ)ℳ:Ψ ↦ (𝛾, 𝛽)
SFT 𝑭 𝛾, 𝛽 = 𝛾⨀𝑭+ 𝛽
𝒚 = 𝐺𝜃(𝒙|𝛾, 𝛽)
Spatial Feature Transform
Co
nv
Res
idu
al b
lock
SFT
laye
r
SFT
laye
r
Co
nv
Co
nv
Residual block
Res
idu
al b
lock
SFT
laye
r
Co
nv
Up
sam
plin
g
Co
nv
Co
nv
Co
nv
⨀ +C
on
v
Co
nv
Co
nv
Co
nv
SFT layer
conditions
features 𝜸𝑖 𝜷𝑖Co
nv
Co
nv
Co
nv
Co
nv
Condition Network
Segmentationprobability
maps
Shared SFT conditions
loss function
• Adversarial loss[1]
min𝜃
max𝜂
Ε𝑦~𝑝HR 𝑙𝑜𝑔𝐷𝜂 𝑦 + Ε𝑥~𝑝LRlog(1 − 𝐷𝜂 𝐺𝜃(𝑥) )
Generator
Discriminator
Compete
• Perceptual loss[2]
encourage the network to generate images that reside on the manifold of natural images
𝜙𝑉𝐺𝐺 𝑦 − 𝜙𝑉𝐺𝐺 𝑦 22
use a pre-trained 19-layer VGG network (features before conv54)
optimize a super-resolution model in a feature space
[1] Goodfellow, Ian, et al. Generative adversarial nets. In NIPS. 2014.[2] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016.
𝑷𝒃𝒖𝒊𝒍𝒅𝒊𝒏𝒈 map
RestoredLR patch
𝑷𝒈𝒓𝒂𝒔𝒔 map 𝜸 map of 𝐶6 𝜷 map of 𝐶7Input
Spatial condition
• The modulation parameters (𝛾, 𝛽) have a close relationship with probability maps 𝑷 and contain spatial information.
𝜸 map of 𝐶51
Restored
LR patch 𝑷𝒑𝒍𝒂𝒏𝒕 map
𝑷𝒈𝒓𝒂𝒔𝒔 map
𝜷 map of 𝐶1
𝜸 map of 𝐶14 𝜷 map of 𝐶5
Delicate modulation
Results
GTSRCNN SRGAN EnhanceNet SFT-Net (ours)
PSNR: 24.83dB PSNR: 23.36dB PSNR: 22.71dB PSNR: 22.90dB
Bicubic SRCNN VDSR LapSRN DRRN MemNet EnhanceNet SRGAN SFT-Net (ours) GT
Results
MSE-based method GAN-based method
54.5
76.468 75
56.468.7 65.7
sky building grass animal plant water mountain
67 33
85 15Ours
Ours
EnhanceNet
SRGAN
User study – part I
User study – part II
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Rank-1 Rank-2 Rank-3 Rank-4
80.4 18.4
18.6 79.6
61.3 36.3
37 62.4
GT
Ours
MemNet
SRCNN
buildingprior
skyprior
grassprior
mountainprior
waterprior
plantprior
animalpriorbicubic
bu
ildin
gsk
ygr
ass
mo
un
tain
wat
er
pla
nt
anim
al
Impact of different priors
building sky grass mountain water plant animalbuilding sky grass mountain water plant animal
building prior
buildingprior
skyprior
grassprior
mountainprior
waterprior
plantprior
animalpriorbicubic
bu
ildin
gsk
ygr
ass
mo
un
tain
wat
er
pla
nt
anim
al
Impact of different priors
buildingprior
skyprior
grassprior
mountainprior
waterprior
plantprior
animalpriorbicubicmountain
Other conditioning methods
[1] S. Zhu, S. Fidler, R. Urtasun, D. Lin, and C. C. Loy. Be your own prada: Fashion synthesis with structural coherence. In ICCV, 2017.[2] E. Perez, F. Strub, H. de Vries, V. Dumoulin, and A. Courville. FiLM: Visual reasoning with a general conditioning layer. arXiv preprint arXiv:1709.07871, 2017.
Compositionalmapping[1]
FiLM[2]Input concatenation
Input concatenation
Compositionalmapping
FiLMSFT-Net (ours)
Comparison with other conditioning methods
SRGAN SRGANOurs Ours
Robustness to out-of-category
Conclusion
• Explore semantic segmentation maps as categorical prior for realistic texture recovery.
• Propose a novel Spatial Feature Transform layer to efficiently incorporate the categorical conditions into a CNN-based SR network.
• Extensive comparisons and a user study demonstrate the capability of SFT-Net in generating realistic and visually pleasing textures.
Crafting a Toolchain for Image Restoration by
Deep Reinforcement Learning
Ke Yu Chao Dong Liang Lin Chen Change Loy
Image Restoration
• There are many individual tasks• Denoising• Deblurring• JPEG Deblocking• Super-Resolution• …
• Towards more complicated distortions• Address multiple levels of degradation in one task[1, 2]
• Address multiple individual tasks[3]
Image Restoration – A New Setting
• Consider multiple distortions simultaneously • Real-world: Image capture and storage• Synthetic: Gaussian blur, Gaussian noise and JPEG compression
GaussianBlur
GaussianNoise
JPEGCompression
Real-worldScenario
SyntheticSetting
Our New Task
Motivation
• Can we use a single CNN to address multiple distortions?• Inefficient: Require a huge network to handle all the possibilities• Inflexible: All kinds of distorted images are processed with the
same structure
• Find a more efficient and flexible approach!• Process different distortion in a different way
Method – Decision Making
• Progressively restore the image quality
• Treat image restoration as a decision making process
Noisy! Try a denoising toolBlurry! Try a
deblurring toolArtifacts! Try a
deblocking toolGood enough :)
Method – Overview
• Our framework requires a toolbox and an agent
Agent Agent
Toolbox Toolbox
Method – Toolbox
• We design 12 tools, each of which addresses a simple task• 3-layer CNN[4]
• 8-layer CNN
Method – Agent
• Use reinforcement learning to address tool selection
Statecurrent distorted image
action at last step
12 tools
stoppingAction
Reward: PSNR gain at each step
InputImage
FeatureExtractor
One-hot Encoder
LSTM 𝒗1
𝐈1
𝑆1Agent
𝒗1
Structure:
Method – Joint Training
• Challenge of ‘Middle State’• Intermediate results after several steps of processing• None of the tools has seen these intermediate results
• Joint Training
...
...
...
forward
backwardtoolchain 1
toolchain 2forward
backward
MSE loss
MSE loss
Experimental Results
• Dataset: DIV2K[5]
• Comparison with generic models for image restoration• VDSR[1]
• DnCNN[3]
Experimental Results
• Quantitative results on DIV2K
• Runtime Analyses
More efficient
Better generality
Competitive performance
Experimental Results
• Qualitative results on DIV2K
Input
1st step
2nd step
3rd step
VDSR-s
VDSR[1]
Mild (unseen) Moderate Severe (unseen)
Experimental Results
• Qualitative results on real-world imagesInput 1st step 2nd step 3rd step VDSR[1]
Experimental Results
• Ablation StudyJoint training
Stopping action
Conclusion
• Contributions• Address image restoration in a reinforcement learning framework• Propose joint learning to cope with middle processing state• Dynamically formed toolchain performs competitively against
human-designed networks with less computational complexity
• Future work• Incorporate more tools (trained with GAN loss)• Handle spatial-variant distortions
Thanks!Q & A
Reference
[1] J. Kim, J. Kwon Lee, and K. Mu Lee. Accurate image super-resolution using very deep convolutional networks. In CVPR, 2016.
[2] Y. Tai, J. Yang, X. Liu, and C. Xu. Memnet: A persistent memory network for image restoration. In ICCV, 2017.
[3] K. Zhang,W. Zuo, Y. Chen, D. Meng, and L. Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. TIP, 2017.
[4] C. Dong, C. C. Loy, K. He, and X. Tang. Image super-resolution using deep convolutional networks. TPAMI, 38(2):295–307, 2016.
[5] E. Agustsson and R. Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. In CVPR Workshop, 2017.