Hallucinated-IQA: No-Reference Image Quality Assessment via … · 2018. 6. 11. · Hallucinated-IQA: No-Reference Image Quality Assessment via Adversarial Learning Kwan-Yee Lin1

Hallucinated-IQA: No-Reference Image Quality Assessment

via Adversarial Learning

Kwan-Yee Lin1 and Guanxiang Wang2

1Department of Information Science, School of Mathematical Sciences, Peking University2Department of Mathematics, School of Mathematical Sciences, Peking University

[email protected]

[email protected]

Abstract

No-reference image quality assessment (NR-IQA) is a

fundamental yet challenging task in low-level computer vi-

sion community. The difficulty is particularly pronounced

for the limited information, for which the corresponding ref-

erence for comparison is typically absent. Although various

feature extraction mechanisms have been leveraged from

natural scene statistics to deep neural networks in previous

methods, the performance bottleneck still exists.

In this work, we propose a hallucination-guided qual-

ity regression network to address the issue. We firstly

generate a hallucinated reference constrained on the dis-

torted image, to compensate the absence of the true refer-

ence. Then, we pair the information of hallucinated ref-

erence with the distorted image, and forward them to the

regressor to learn the perceptual discrepancy with the guid-

ance of an implicit ranking relationship within the gener-

ator, and therefore produce the precise quality prediction.

To demonstrate the effectiveness of our approach, compre-

hensive experiments are evaluated on four popular image

quality assessment benchmarks. Our method significantly

outperforms all the previous state-of-the-art methods by

large margins. The code and model are publicly available

on the project page https://kwanyeelin.github.

io/projects/HIQA/HIQA.html.

1. Introduction

Image quality assessment (IQA) refers to the challenging

task of automatically predicting the perceptual quality of a

distorted image. IQA serves as a key component in the low-

level computer vision community and has a wide range of

applications [13, 26, 49].

IQA algorithms could be classified into three cate-

gories: full-reference IQA (FR-IQA) [50, 24, 19], reduced-

reference IQA (RR-IQA) [11], and general purpose no-

reference IQA (NR-IQA) [46, 17, 42, 51, 44, 20, 25]. Al-

(a)

(b)

(c)

Ground‐truthReference

Distorted

Image

Hallucinated Reference

Discrepancy

Map

Figure 1: An illustration of our motivation. The first column is

the Ground-truth Reference image which is undistorted. The sec-

ond column shows several kinds of distortion that is easily hap-

pened. The third row demonstrates the hallucinated reference im-

ages which are generated by our approach. The fourth column is

the discrepancy map which captures rich information that can be

utilized to guide the learning of quality regression network to get

high accuracy results.

though FR-IQA and RR-IQA metrics have achieved re-

markable results over the decades, the precondition of them

that requiring a corresponding non-distorted reference im-

age for comparison during quality predicting process makes

these metrics infeasible in practical applications, since it is

hard, even impossible in most cases, to obtain an ideal ref-

erence image. In contrast, NR-IQA, which takes only the

distorted image to be assessed as input without any addi-

tional information, is more realistic and therefore receives

substantial attention in recent years. However, the ill-posed

definition makes it is highly challenging for NR-IQA to

make a good image quality prediction.

The ill-posed nature of the underdetermined NR-IQA

problem is particularly pronounced for the limited informa-

tion, for which the form of distortion and the correspond-

732

https://kwanyeelin.github.io/projects/HIQA/HIQA.html

https://kwanyeelin.github.io/projects/HIQA/HIQA.html

ing non-distorted reference image are typically absent. It is

counter-intuitive, since human visual system (HVS) needs

a reference to quantify the perceptual discrepancy by com-

paring the distorted image either directly with the original

undistorted image or implicitly with a hallucinated scene in

mind, as demonstrated in Figure. 1(a). The ill-posed defini-

tion becomes the most essential issue of NR-IQA task that

leads to the performance bottleneck over the last decade.

Numerous efforts have been made to ease this problem

by designing powerful feature representation models. Tra-

ditional methods commonly use manually designed statis-

tic representations, and hence lack of diversity and flexibil-

ity for modeling the multiple complex distorted types1 and

large span of image contents (e.g., human, animal, plant,

cityscape, transportation, etc.) in NR-IQA. In recent years,

the promising results of Deep Neural Networks (DNNs)

in many computer vision tasks [8, 5, 43] encourage re-

searchers exploring their formidable feature representation

power to the NR-IQA task. Nevertheless, the extremely

limited annotation samples in public datasets greatly limit

the advantage of DNNs in NR-IQA task. To better leverag-

ing the power of DNNs, previous works usually utilize var-

ious multi-task and data augmentation strategies with extra

annotated ranking, proxy quality scores, or distortion infor-

mation sophisticatedly, which are unavailable in practical

NR-IQA applications, and hence lack of feasibility for mod-

eling unknown multiple distortion types. Some works at-

tempt to transfer general image feature representations from

a pre-trained model on ImageNet [6] to quality prediction.

While, less correlation and similarity between NR-IQA and

image classification task reduce the effectiveness of transfer

learning.

In this work, a Hallucination-Guided Quality Regression

Network is proposed to simulate the behaviour of human

visual system (HVS), which can make precise prediction

by leveraging perceptual discrepancy information between

the distorted image and hallucinated reference. As shown

in Fig. 2, a high-resolution scene hallucination is firstly

generated from the distorted image. Then, the discrepancy

map which naturally encoding the difference between the

distorted images and hallucinated reference can be obtained

to guide the learning of regression network. With the strong

and clear defined discrepancy information incorporated in,

the ill-posed nature of NR-IQA can be dramatically over-

come. Therefore, even with a common data augmentation,

our approach could lead to better performance than all of

the conventional sophisticated methods.

A straightforward way to generate the hallucinated ref-

erence is to leverage the state-of-the-art image super-

1An image could be distorted by any stages in the whole process of

its lifecycle from acquisition to storage, and therefore will undergo diverse

distortions, like noise corruption, compression artifact, transmission errors,

under-/over-exposure,etc. For more details, please refer to [35].

resolution [23, 22, 40], blind deblur [33], or inpainting

[34] methods to reconstruct images from the distorted ones.

However, since an image could be distorted with multiple

unknown distortions, which breaks the basic assumptions2

of these related fields, it is impractical to utilize them to ob-

tain a reconstructed image that qualified for the agent ref-

erence of NR-IQA task. To this end, a Quality-Aware Gen-

erative Network is proposed to generate hallucinated refer-

ence with a novel quality-aware perceptual term which is

designed specifically for the NR-IQA task at hand.

While the Quality-Aware Generative Network is robust

to most distortion types and levels, it is, however, still

very challenging for a method that under the framework of

DCNN to reconstruct high-frequency details with realistic

texture, when the distorted image lacks structure informa-

tion, as shown in Fig. 1(c). Since the result of hallucinated

reference is crucial for final prediction, a bad hallucination

will introduce large bias and therefore lead the regression

results into sub-optimal values. We propose to tackle this

problem with the two following mechanisms from low-level

to high: (1) We introduce adversarial learning idea to hal-

lucinated reference generation and quality prediction with

a novel IQA-discriminator to, on the one hand, encourage

the generated hallucinated scene perceptually hard to dis-

tinguish from the true reference images, and on the other

hand, in a low-level semantic, constrain the influence of

bad hallucinations to the quality regression network. (2) A

novel high-level semantic fusion mechanism is introduced

to further reduce the instability of the quality regression net-

work caused by the hallucination model. It explores implicit

ranking relationship within the hallucination network to as a

guidance to help the regression network adjusting the image

quality prediction in an adaptive manner.The quality-aware

generative network, hallucination-guided quality regression

network, and the iqa-discriminator can be jointly optimised

in an end-to-end manner.

Our main contributions of this work are summarised into

three folds:

• A novel Hallucination-Guided Quality Regression

Network is proposed to incorporate the perceptual dis-

crepancy information into network learning to over-

come the ill-posed nature of NR-IQA and significantly

improves the prediction precision and robustness.

• A Quality-Aware Generative Network together with

a quality-aware perceptual loss is proposed, in which

both texture feature similarity and quality feature sim-

ilarity are taken into consideration in a complementary

manner to help generating qualified hallucinated refer-

ences.

2For example, super-resolution methods usually assume the blur kernel

or form is known.

733

• Since the result of hallucinated reference is crucial for

final prediction, an IQA-Discriminator and an implicit

ranking relationship fusion scheme are introduced to

better guide the learning of generator and suppress the

negative scene hallucination influence to quality re-

gression in a low-level to high-level manner.

We evaluate the proposed method on four broadly used

image quality assessment benchmarks including LIVE [41],

CSIQ [21], TID2008 [36], and TID2013 [35]. Our approach

shows the superior performance over all of the state-of-the-

art NR-IQA methods by significant margins. Comprehen-

sive ablation study further demonstrates the effectiveness of

each component.

2. Related Work

No-reference Image Quality Assessment. In the litera-

ture of NR-IQA, besides classic methods ([31, 38, 29, 47])

and their improved versions ([51, 44, 27]), recently, signifi-

cant progresses have been achieved by exploring DNNs for

better feature representation [17, 18, 45, 20, 25, 48]. For

example, Kang et al. [17] introduce a shallow ConvNet to

model the quality prediction. This approach is refined to

a multi-task CNN [18], where the network learns both dis-

tortion type and quality score simultaneously. Bianco et al.

[2] use a pre-trained DCNN fine-tuned on an IQA dataset to

extract features, then map them to IQA scores by an SVR

model. Hui et al. [48] also propose to extract features by a

pre-trained ResNet [14]. Instead of learning IQA scores di-

rectly, they fine tune the network to learn a probabilistic rep-

resentation of distorted images. According to the distortion

types and levels in particular datasets, Liu et al. [25] syn-

thesize masses of ranked images to train a Siamese network

to learn the rankings for NR-IQA. Liang et al. [24] propose

to use non-aligned similar scene as a reference. Kim and

Lee [20] apply state-of-the-art FR-IQA methods to gener-

ate proxy scores on patches as the ground truth to pre-train

the model and then fine-tune to NR-IQA.

In this work, we propose a unique approach to address

the ill-posed problem by compensating the absent refer-

ence information without any extra data annotation or prior

knowledge, which therefore increases the flexibility and

feasibility than other methods.

Generative Adversarial Network. GANs [12] and var-

ious variants [37, 1, 28] flourish in generating natural im-

ages such as human faces [3] and indoor scenes [7]. How-

ever, generating high-resolution images (e.g.,256×256) will

lead GANs to training instability and sometimes nonsensi-

cal outputs, which has been proven in [16]. Since our ul-

timate goal is NR-IQA, and the performance of quality re-

gression network is closely related to the output of the gen-

erator, instead of applying original discriminator, we tai-

lor the adversarial learning scheme for image quality as-

sessment by introducing an effective iqa-discriminative net-

work.

3. Our Approach

In this section, we introduce our approach for NR-IQA.

An overview of our framework is illustrated in Fig. 2. The

model consists of three parts, i.e., the quality-aware gen-

erative network G, the iqa-discriminative network D and

the hallucination-guided quality regression network R. The

generative network produces hallucinated reference as the

compensatory information for the distorted images. The

discriminative network is trained with G in the adversar-

ial manner to help G producing more qualified results and

constrain negative effects of bad ones to R. We define the

objective discrepancy (i.e., the pixel-wise differences) be-

tween a distorted image and the corresponding scene hal-

lucination as the discrepancy map3. The quality regression

network takes the distorted images and corresponding dis-

crepancy maps as inputs, with the guidance of implicit rank-

ing relationships in G, to exploit the perceptual discrepancy

and produce the predicted quality scores as outputs.

3.1. QualityAware Generative Network

As we mentioned in the previous sections, the function

of hallucinated reference for the distorted image is to com-

pensate the absence of true reference image, and the less

gap between hallucination and true reference, the more pre-

cise the quality regression network will perform. Therefore,

the aim of G is to generate a high-resolution hallucinate im-

age Ish conditioned on the distorted image Id. Toward this

end, we adopt a stacked hourglass [32] as the baseline of the

generative network.

A straightforward way for learning the generating func-

tion G(Id) is to enforce the output of the generator both

pixel-wise and perception-wise close to the true refer-

ence. Therefore, given a set of distorted images {Iid, i =1, 2, . . . , N}, and corresponding true reference images

{Iir, i = 1, 2, . . . , N}, we solve

θ = argminθ

1

N

N∑

i=1

(lp(Gθ(Iid), I

ir)+ls(Gθ(I

id), I

ir)), (1)

where lp penalizes the pixel-wise differences between the

output and the ground truth with pixel-level error measure-

ments, such as MSE, to generate holistic content; and ls pe-

nalizes the perception-wise differences to achieve sharper

local results. We adopt a feature space loss term [9] as the

perception constraint, which is defined as

ls(Gθ(Iid), I

ir) = ‖φ(Gθ(I

id))− φ(Iir)‖

2, (2)

3This is different from the concept of error map, which is used in FR-

IQA to represent pixel-wise error between the distorted image and true

reference.

734

Stack 1 Stack n

Quality‐Aware Generative Network

Hallucination‐Guided Quality Regression Network IQA‐Discriminative Network

Generated Hallucinated Reference

Ground Truth Reference

Real/Fake

Fusion

S

Convolution

Residual Unit

Down Sampling

Fully Connection

Predicted Quality Score

Subtraction

Generated Hallucinated Reference

Up

Sampling

High‐Level Feature

Hourglass Module

Entrywise

Sum

Distorted Image

Discrepancy Map

Figure 2: An illustration of our proposed Hallucinated-IQA framework. It consists of three strongly related subnets. (a) Quality-Aware

Generative Network is used to generate hallucinated reference images. In order to get high resolution hallucinated images, a quality-

aware loss is introduced to the learning process. (b) Hallucination-Guided Quality Regression Network is in a position to incorporate the

discrepancy information between the hallucinated image and distorted encoded in the discrepancy map. The incorporated discrepancy

information together with high-level semantic fusion from the generative network can supply the regression network with rich information

and greatly guide the network learning. (3) Since the results of the hallucinated image are crucial for the final prediction, IQA-Discriminator

is proposed to further refine the hallucinated image.

Distorted

ImageBaseline Generator Baseline + Quality‐

Aware LossBaseline + Quality‐

Aware Loss + IQA‐GANDiscrepancy

Map

Figure 3: An illustration of the effectiveness of quality-aware

loss and IQA-GAN. With the quality-aware loss and IQA-GAN

scheme adding, the hallucinated images are improved to be more

and more clear and plausible. The last column shows the discrep-

ancy map got from our model, which can be seen to well capture

the type and location information of the distortion. The map is

demonstrated to be very helpful for our IQA task.

where φ(·) represents a feature transformation. Intuitively,

pre-trained network like VGG-19 could be utilized to cal-

culate the perception term. This is reasonable in most cases

by the fact that the VGG-19 is trained for semantic classifi-

cation, and the features of its intermediate layers are there-

fore invariant to the noise of input [4, 10]. Consequently,

these layers provide structure and texture information to the

generator for inferring more accurate results. However, the

invariance property will also lead to the perception term ig-

noring the hard cases where the output of the generator still

contains a certain degree distortion information, as demon-

strated in Fig. 3. To ease this problem, we propose a

quality-aware perceptual loss, which incorporates the fea-

tures of the deep regression network R dynamically. The

loss function in equation (2) becomes

ls(Gθ(Iid), I

ir) = λ1lv(Gθ(I

id), I

ir) + λ2lq(Gθ(I

id), I

ir),

(3)

where

lv =

Cv∑

cv=1

1

WjHj

Wj∑

x=1

Hj∑

y=1

‖φj(Gθ(Iid))x,y − φj(I

ir)x,y‖

2,

(4)

and

lq =

Cq∑

cq=1

1

WkHk

Wk∑

x=1

Hk∑

y=1

‖πk(Gθ(Iid))x,y − πk(I

ir)x,y‖

2,

(5)

where φj(·) denotes the feature map at the j-th layer of

VGG-19, πk(·) denotes the feature map at the k-th layer

of R; W and H represent the dimensions of the feature

map, C represents the number of feature maps at a par-

ticular layer. Since the vgg-19 network and R are trained

for different tasks, the representation of kernels within the

two networks also toward to preserve different information.

The activations from the layers of a pre-trained 4 NR-IQA

regression network capture the distortion information of the

input, which ensures the quality similarity measurement be-

tween the output of G and the ground truth. The activations

from the layers of the VGG-19 network ensure the seman-

tic similarity measurement. Base on respective represent-

ing capabilities of the two networks, incorporating both lvand lq losses to the perception term could complement each

other and therefore help the generator producing better re-

sults jointly.

4It should be noted that, the pre-trained quality regression model refers

to the one that trained from scratch with IQA dataset.

735

3.2. IQADiscriminative Network

To ensure the generator producing high perceptual out-

puts with realistic high-frequency details, especially for the

samples that seriously lack structure and texture informa-

tion due to the distortion type (e.g., local block-wise dis-

tortions of different intensity, transmission errors), or the

distortion level, the adversarial learning mechanism is in-

troduced to our work.

The original manner of adversarial learning is to train G

to generate images to fool D, and D is in contrast trained

to distinguish fake reference images Ish from real reference

images Ir. However, since GANs are limited to the resolu-

tion of the generator, and the distorted images forwarded to

a quality network are usually of large size to maintain suffi-

cient contextual information, directly providing Ish as fake

images to the discriminator will introduce instability to opti-

mization procedure and sometimes leads to nonsensical re-

sults. More importantly, our ultimate goal is improving the

performance of the deep regression network R. Even when

G fails to generate high-resolution hallucination images, the

predicted scores of R should still be a reasonable value.

Thus, the influence of bad hallucination images to R should

be suppressed. Thus, we propose a IQA-Discriminator (i.e.,

D ) to ease above problems by discriminating the fake sam-

ples from the real samples according to their positive or neg-

ative influence to R. If G generates a hallucinated reference

could help improving the precision of R, then this halluci-

nation is defined as real sample to D, otherwise the halluci-

nation is a fake sample. This could be formulated as

maxω

E[logDω(Ir)]+E[log (1− |Dω(Gθ(Id))− dfake|)],

(6)

where dfake denotes the ground truth influence label with

the definition

difake =

{

1 if ‖R(Iid, Iish)− si‖F < ǫ

0 if ‖R(Iid, Iish)− si‖F ≥ ǫ

, (7)

where si is the ground truth quality score of Iid, ǫ denotes

the threshold parameter. The general idea behind this for-

mulation is that it leverages the property of quality regres-

sion loss, where the loss is an explicit index that directly

reflects the impact of G on R, to enforce D only penalizing

samples with negative influence. Therefore, it could also be

regarded as a relaxation strategy to stabilize the adversarial

learning process.

Thus, G is eventually optimised to fool the discriminator

D by generating qualified hallucinated scene that is benefi-

cial for R. The adversarial loss of G is formulated as

Ladv = E[log(1−Dω(Gθ(Id)))], (8)

and the overall loss function of G for all training samples is

given by

LG = µ1Lp + µ2Ls + µ3Ladv, (9)

where µ1, µ2 and µ3 represent the parameters that trade off

the three loss components.

3.3. HallucinationGuided Quality Regression Network

Given the hallucinated scene generated by G, we are

able to provide agent references to the quality regression

network to compensate the absence of true reference infor-

mation. In order to incorporate the hallucinated reference

information effectively, the concept of discrepancy map is

introduced to the work. To further stabilizing the optimiza-

tion procedure of R, a high-level semantic fusion scheme is

proposed.

Discrepancy Map. Given a set of distorted images to

be assessed, previous CNN-based NR-IQA methods learn a

mapping function R(Id) to predict the quality scores. On

the contrary, we consider the distorted images and their dis-

crepancy maps as pairs{Iid, Iimap}

Ni=1

to train a deep regres-

sion network by solving

γ = argminγ

1

N

N∑

i=1

lr(R(Iid, Iimap), s

i), (10)

where Imap = |Id−Gθ(Id)| , denotes the discrepancy map.

The formulation shows the discrepancy map could virtually

be regarded as a prior information to tell the network what

the distortion looks like.

It is interesting that, so far, the holistic mechanism func-

tions in a reinforced way, during training stage of R, G

is used to produce auxiliary hallucinated references, while

during the training stage of G, R is in reverse introduced to

help generating better hallucinations. In essence, G and R

are mutually correlated and thus can reinforce each other.

High-level Semantic Fusion. As we mentioned in pre-

vious sections, the precision of R is greatly depended on

the eligibility of the hallucinated scene. To be specific,

a qualified hallucination as the agent reference could help

R exploring correct perceptual discrepancy of the distorted

image, while the unqualified one will conversely introduce

large bias to R by improperly narrowing the distortion infor-

mation. Hence, a constrained scheme is needed to stabilize

the quality regression process.

Assume G has been trained, the feature maps after the

m-th residual block in encoder part of its n-th stack are

considered as {Hcmnmn (Id)}

Cmncmn=1

. We fuse the ones after

the last encoder residual block of second stack with the fea-

ture maps after the last block of R, then we have the fusion

term:

F = f(H5,2(Id))⊗ (R1(Id, Imap)) (11)

where f is a linear projection to ensure the dimensions of

H and R1 are equal, R1 denotes the feature extraction be-

fore the fully connected layers (R2) of R, and ⊗ denotes

736

the concatenation operation. Thus, the loss of R could be

formulated as:

LR =1

T

T∑

t=1

‖R2(f(H5,2(Id))⊗ (R1(Id, Imap)))− st‖ℓ1

(12)

The form of the loss LR allows the high-level semantic in-

formation of G participating in the optimization procedure

of R. As we discussed in the introduction, the fusion term

F explores implicit ranking relationship5 within G to as a

guidance to help R adjusting the quality prediction in an

adaptive manner. Specifically, if G is optimal, the solvers

may simply drive the weights of neurons in R2 that connect-

ing with f toward zero to approach identity mappings. Oth-

erwise, the eligibility of the hallucinated scene is materially

a reflection of the quality of the input distorted images that

could be leveraged to as a guidance to correct the prediction,

and therefore improving the precision of R in a high-level

semantic manner. Meanwhile, the iqa-discriminator could

be regarded as a low-level semantic scheme to R, since it

encourages G to generate useful hallucination input to R.

Therefore, our model has schemes in multiple semantic lev-

els to stabilize the quality regression process.

3.4. Training Strategy

Since all of the operations in G and R are differentiable,

these two sub-networks can be trained in an end-to-end

manner. To better optimize the generation and quality re-

gression in a mutually reinforced way, we take an alterna-

tive training strategy in practice. Please refer to the supple-

mentary, where Algorithm 1 demonstrates the whole train-

ing processing of our approach as the pseudo codes.

3.5. WeaklySupervised Quality Assessment

In this section, we discuss some extensions to further un-

cover the potential of our framework.

To advance the development of IQA task, various bench-

marks have been released in these years. However, a signif-

icant issue follows as well. As shown in Table 1, there are

huge gaps of distorted quality definition, types and levels

among the datasets. While NR-IQA models are commonly

trained on one specific dataset, these gaps will easily lead

the models to suffer over-fitting problem and lack of gen-

eralization ability. Learning from cross-datasets is an alter-

native way to ease the problem. Previous methods usually

transfer the definition of quality scores by non-linear map-

pings learned from the distributions of datasets, which may

introduce bias to the models.

5G serves as not only a generator, but also an encoder-decoder mech-

anism. Thus, the difference-information between images distorted in dif-

ferent degree is encoded compactly in the end of the encoder part. We

refer to this “difference-information” as “implicit ranking relationship” of

distorted images in this work.

In contrast, as a by-product of our work, the hallucinated

scene could be regarded as a universal medium among dif-

ferent datasets to help the training process of a particular

one without losing precision, since the hallucination is only

constrained on distorted image and serves as the fundamen-

tal agent reference information of image quality. Mean-

while, the detachable training process of our framework

provides an alternative that the R in the stages of train-

ing G and final quality regression model could be different.

Based on the above, as long as a hallucination generator is

trained either on one specific dataset with multiple complex

distortions or on multi-datasets in once, it can be used to

help the training process of any other datasets as a plug-and

play module in a weakly-supervised manner. Moreover, the

module could also be used as a data augmentation or initial-

ization mechanism without any extra annotation or artificial

prior knowledge. We evaluate above discussion in Sec.4.1.

3.6. Implementation Details

All the training samples are 256× 256 pixel patches that

randomly sampled from the original images. Then a com-

mon data augmentation is performed with random rotation

(±20◦) and flip. We train our models with Caffe [15] on

the Titan X GPUs with a mini-batch size of 32 and all of

them are trained from scratch. The stochastic gradient de-

scent (SGD) is used to optimise the networks with an initial

learning rate of 10−5 for the generation network and 10−2

for the regression network, and dropped by a factor of 0.1every 20K iterations. The weight decay is 0.0005, and the

momentum is 0.9. During testing, we extract overlapped

image patches at a fixed stride from each testing image, and

simply average all predicted scores as the final whole-image

quality score.

Databases # of Ref.Images # of Dist.Images # of Dist.Types Score Type Score Range

LIVE 29 779 5 DMOS [1,100]

CSIQ 30 866 6 DMOS [0,1]

TID2008 25 1700 17 MOS [0,9]

TID2013 25 3000 24 MOS [0,9]

Table 1: Summary of the databases evaluated in the experiments.

4. Experiments

Datasets. We perform experiments on four widely used

benchmark datasets LIVE [41], CSIQ [21], TID2008 [36],

and TID2013 [35]. The detailed information are summa-

rized in Table 1.

Evaluation Metrics. Following most previous works,

two evaluation criteria are adopted in our paper: the Spear-

man’s Rank Order Correlation Coefficient (SROCC) and the

Linear Correlation Coefficient (LCC). SROCC is a measure

of the monotonic relationship between the ground-truth and

model prediction. LCC is a measure of the linear correlation

between the ground-truth and model prediction. The de-

tailed definitions are formulated in the supplementary ma-

terial.

737

Methods # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 # 11 # 12 # 13

BLIINDS-II [39] 0.714 0.728 0.825 0.358 0.852 0.664 0.780 0.852 0.754 0.808 0.862 0.251 0.755

CORNIA-10K [47] 0.341 -0.196 0.689 0.184 0.607 -0.014 0.673 0.896 0.787 0.875 0.911 0.310 0.625

HOSA[44] 0.853 0.625 0.782 0.368 0.905 0.775 0.810 0.892 0.870 0.893 0.932 0.747 0.701

RankIQA [25] 0.667 0.620 0.821 0.365 0.760 0.736 0.783 0.809 0.767 0.866 0.878 0.704 0.810

Ours 0.923 0.880 0.945 0.673 0.955 0.810 0.855 0.832 0.957 0.914 0.624 0.460 0.782

Ours+Oracle 0.952 0.890 0.976 0.831 0.931 0.773 0.898 0.812 0.910 0.929 0.735 0.638 0.739

Methods # 14 # 15 # 16 # 17 # 18 # 19 # 20 # 21 # 22 # 23 # 24 ALL

BLIINDS-II[39] 0.081 0.371 0.159 -0.082 0.109 0.699 0.222 0.451 0.815 0.568 0.856 0.550

CORNIA-10K [47] 0.161 0.096 0.008 0.423 -0.055 0.259 0.606 0.555 0.592 0.759 0.903 0.651

HOSA [44] 0.199 0.327 0.233 0.294 0.119 0.782 0.532 0.835 0.855 0.801 0.905 0.728

RankIQA [25] 0.512 0.622 0.268 0.613 0.662 0.619 0.644 0.800 0.779 0.629 0.859 0.780

Ours 0.664 0.122 0.182 0.376 0.156 0.850 0.614 0.852 0.911 0.381 0.616 0.879

Ours+Oracle 0.834 0.457 0.823 0.850 0.539 0.893 0.695 0.859 0.910 0.655 0.712 0.935

Table 2: Performance evaluation (SROCC) on the entire TID2013 database.

4.1. Comparisons with the stateofthearts

To validate our approach, we conduct extensive evalua-

tions, where ten state-of-the-art NR-IQA methods are com-

pared. We follow the experimental protocol used in three

most recent algorithms (i.e., HOSA [44], BIECON [20],

and RankIQA [25]), where the reference images are ran-

domly divided into two subsets with 80% for training and

20% for testing, and the corresponding distorted images are

divided in the same way to ensure there is no overlap image

content between the two sets. All the experiments are under

ten times random train-test splitting operation, and the me-

dian SROCC and LCC values are reported as final statistics.

Single dataset evaluations. We first analyze the exper-

iment results on TID2013. The SROCC for our approach

and compared state-of-the-arts on entire TID2013 dataset

are reported in Table 2. Our method significantly outper-

forms previous methods by a large margin. We achieve

a 13% relative improvement over the most state-of-the-art

method RankIQA on entire dataset with all the distortion

types under consideration at once. For individual distor-

tions, due to the normalization operation in the network,

the performances on a small number of types like intensity

shift and change of colour saturation are lower than some

methods. While we generally achieve the highest accura-

cies on most distortion types (over 60% subsets). Specifi-

cally, the significant improvements on distortion types like

#4(masked noise) and #14(non-eccentricity pattern noise)

quantitatively demonstrate the effectiveness of our hallu-

cinated reference compensation mechanism, and improve-

ments on types such as #9 (Image denoising) and #22(Multiplicative Gaussian noise) verify the capacity of our G

component as a single model that hallucinates images under

multiple distortions effectively.

Table 3 shows the performance evaluation on the entire

LIVE database. Our method outperforms all of the state-

of-the-art methods for both SROCC and LCC evaluations.

Among the methods compared in the experiments, the most

state-of-the-art three methods explore different strategies to

better leverage the power of DNNs and achieve promis-

ing results, where BIECON uses FR-IQA methods to gen-

erated proxy quality scores, RankIQA synthesizes masses

of ranked images to train the network, and PQR takes ad-

vantage from a pre-trained Res-50 network. Our method

achieves 2% improvements than BIECON, 2% SROCC

and 1% LCC improvements than PQR, and 0.1% slightly

improvements than RankIQA with training from scratch.

These observation demonstrate that our mechanisms in-

crease the model capacity effectively from a new perspec-

tive.

As for TID2008 dataset, our approach also achieves

highest performances compared with all of the state-of-the-

arts. We also reach best performances on CSIQ dataset. For

space saving, the detail results and discussion of this two

dataset are shown in the supplementary material, please re-

fer to it.

We also list the results of using ground-truth reference

on above experiments as the theoretical bounds, which are

referred to “ours+oracle”, to further verify the effectiveness

and potential of proposed hallucinated references to NR-

IQA. The oracle outperforms all the methods in all datasets

by large margins. These results demonstrate the effective-

ness of hallucinated information and show great potential

performance gain if the hallucinated information could be

well generated.

Cross-dataset evaluations. Here, we perform two types

of cross-dataset evaluations to further verify some merits of

our approach. Table 4 shows the results of cross-dataset

test where the models are trained by the LIVE dataset,

and tested on the TID2008 dataset. We follow the com-

mon experiment setting to test the results on the subsets of

TID2008, where four distortion types (i.e., JPEG, JPEG2K,

WN, and BLUR) are included, and a logistic regression is

applied to match the predicted DMOS to MOS value. The

promising results demonstrate the generalization ability of

our approach.

To evaluate the by-product of our work where the model

could be leveraged in a weakly-supervised manner to han-

738

SROCC JP2K JPEG WN BLUR FF ALL

BRISQUE [30] 0.914 0.965 0.979 0.951 0.877 0.940

CORNIA [47] 0.943 0.955 0.976 0.969 0.906 0.942

CNN [17] 0.952 0.977 0.978 0.962 0.908 0.956

SOM [51] 0.947 0.952 0.984 0.976 0.937 0.964

BIECON [20] 0.952 0.974 0.980 0.956 0.923 0.961

RankIQA [25] 0.970 0.978 0.991 0.988 0.954 0.981

PQR [48] - - - - - 0.965

Ours 0.983 0.961 0.984 0.983 0.989 0.982

Ours+Oracle 0.978 0.960 0.993 0.988 0.968 0.983

LCC JP2K JPEG WN BLUR FF ALL

BRISQUE [30] 0.923 0.973 0.985 0.951 0.903 0.942

CORNIA [47] 0.951 0.965 0.987 0.968 0.917 0.935

CNN [17] 0.953 0.981 0.984 0.953 0.933 0.953

SOM [51] 0.952 0.961 0.991 0.974 0.954 0.962

BIECON [20] 0.965 0.987 0.970 0.945 0.931 0.962

RankIQA [25] 0.975 0.986 0.994 0.988 0.960 0.982

PQR [48] - - - - - 0.971

Ours 0.977 0.984 0.993 0.990 0.960 0.982

Ours+Oracle 0.989 0.985 0.997 0.992 0.988 0.989

Table 3: Performance evaluation (both SROCC and LCC) on the

entire LIVE database.

CORNIA [47] CNN [17] SOM [51] Ours Ours+Oracle

SROCC 0.892 0.920 0.923 0.934 0.939

LCC 0.880 0.903 0.899 0.917 0.920

Table 4: Cross-dataset evaluation (SROCC).The models are

trained on the LIVE database and tested on the subset of TID2008.

L T08 T08+T13 Ours+Oracle

SROCC 0.982 0.982 0.983 0.983

LCC 0.982 0.985 0.988 0.989

Table 5: SROCC and LCC results of models on the LIVE

database with training generator on different datasets.

dle cross-dataset quality assessment, we train the generator

on different datasets, and use LIVE dataset to train the re-

gression network. Table 5 reports the results. The “L” as the

plain of the experiment represents the hallucination genera-

tor is trained on the training set of LIVE. The “T08” repre-

sents training the generator on TID2008, and “T08+T13” is

the version that training on both the TID2008 and TID2013.

It can be clearly observed that with more IQA datasets ag-

gregated in the generator, the regression network reaches

higher SROCC and LCC performances to approximate the

oracle.

4.2. Ablation study

To investigate the efficacy of the key components of our

model, we conduct ablation experiments on the TID2008

dataset. The overall results are shown in Figure 4. We use

a modified Res-18 network with only distorted images as

inputs to be our baseline model and analyze each proposed

component based on the baseline network (BL), by compar-

ing both SROCC and LCC results.

Hallucinated reference compensation. We first eval-

uate the hallucinated reference compensation mechanism.

0.755

0.8000.859

0.870

0.864

0.887

0.894

0.910

0.910

0.918

0.941

0.949

0.500

0.550

0.600

0.650

0.700

0.750

0.800

0.850

0.900

0.950

1.000

SROCC LCC

BL BL+HCM BL+HCM+QSL

BL+HCM+QSL+ADV BL+HCM+QS+QADV BL+HCM+QS+QADV+HSF

Figure 4: Ablation results on the entire TID2008 dataset.

By adding a holistic hallucination model to provide halluci-

nated references pairing with distorted images as the inputs

to res-18 network (“BL+HCM”), we get a 0.859 SROCC

value and a 0.870 PLCC value, which up to 14% and 8%improvement compared to the baseline model, respectively.

Quality-aware perceptual loss. By adding the feature

matching loss w.r.t. quality similarity at the training process

of the hallucination model(“BL+HCM+QPL”), our model

obtains a further 0.5% improvement on SROCC and 2% on

LCC.

Adversarial learning. To explore the effect of proposed

IQA-Discriminative network for quality assessment, we

further compare the models with adversarial learning mech-

anism under original definition (“BL+HCM+QPL+ADV”)

and our definition(“BL+HCM+QPL+QADV”). Adding

original adversarial learning mechanism leads to a 3% im-

provement on SROCC and 3% on LCC, while our method

obtains further 2% and about 1% improvements on SROCC

and LCC, respectively.

Multi-level semantic fusion. We also show the im-

provements brought by the multi-level semantic fusion

mechanism. We fuse the feature maps of the generator from

stack two with the ones of same size in quality regression

network, and obtain the highest 0.941 SROCC value and

0.949 LCC value.

5. Conclusion

In this paper, we propose to solve the ill-posed na-

ture of NR-IQA from a new perspective. We introduce a

hallucination-guided quality regression network to capture

the perceptual discrepancy between the distorted images

and the hallucinated images, and therefore predict precise

perceptual quality result. We generate the hallucinations by

a novel quality-aware generation network with the help of

a specially designed iqa-discriminator under the adversarial

learning manner. The proposed network does not require

any extra annotations or artificial prior knowledge for train-

ing and can be trained end-to-end. Extensive experiments

demonstrate the superior performance on NR-IQA task.

739

References

[1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gener-

ative adversarial networks. In ICML, 2017.

[2] S. Bianco, L. Celona, P. Napoletano, and R. Schettini. On

the use of deep learning for blind image quality assessment.

CoRR, 2016.

[3] X. Chen, X. Chen, Y. Duan, R. Houthooft, J. Schulman,

I. Sutskever, and P. Abbeel. Infogan: Interpretable repre-

sentation learning by information maximizing generative ad-

versarial nets. In NIPS, 2016.

[4] D. Cho, J. Park, T. Oh, Y. Tai, and I. S. Kweon. Weakly-

and self-supervised learning for content-aware deep image

retargeting. In ICCV, 2017.

[5] J. Dai, Y. Li, K. He, and J. Sun. R-FCN: object detection via

region-based fully convolutional networks. In NIPS, 2016.

[6] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li. Ima-

genet: A large-scale hierarchical image database. In CVPR,

2009.

[7] E. L. Denton, S. Chintala, a. szlam, and R. Fergus. Deep

generative image models using a laplacian pyramid of adver-

sarial networks. In NIPS. 2015.

[8] J. Fu, H. Zheng, and T. Mei. Look closer to see better: Recur-

rent attention convolutional neural network for fine-grained

image recognition. In CVPR, 2017.

[9] L. A. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis

using convolutional neural networks. In NIPS, 2015.

[10] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer

using convolutional neural networks. In CVPR, 2016.

[11] S. A. Golestaneh and L. J. Karam. Reduced-reference qual-

ity assessment based on the entropy of DWT coefficients of

locally weighted gradient magnitudes. TIP, 25(11):5293–

5303, 2016.

[12] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,

D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio.

Generative adversarial networks. CoRR, 2014.

[13] J. Guo and H. Chao. Building an end-to-end spatial-temporal

convolutional network for video super-resolution. In AAAI,

2017.

[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning

for image recognition. In CVPR, 2016.

[15] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-

shick, S. Guadarrama, and T. Darrell. Caffe: Convolu-

tional architecture for fast feature embedding. arXiv preprint

arXiv:1408.5093, 2014.

[16] C. Kaae Sønderby, J. Caballero, L. Theis, W. Shi, and

F. Huszar. Amortised MAP Inference for Image Super-

resolution. ICLR, 2017.

[17] L. Kang, P. Ye, Y. Li, and D. Doermann. Convolutional neu-

ral networks for no-reference image quality assessment. In

CVPR, 2014.

[18] L. Kang, P. Ye, Y. Li, and D. S. Doermann. Simultaneous es-

timation of image quality and distortion via multi-task con-

volutional neural networks. In ICIP, 2015.

[19] J. Kim and S. Lee. Deep learning of human visual sensitivity

in image quality assessment framework. In CVPR, 2017.

[20] J. Kim and S. Lee. Fully deep blind image quality predictor.

J. Sel. Topics Signal Processing, 11(1):206–220, 2017.

[21] E. C. Larson and D. M. Chandler. Most apparent distortion:

full-reference image quality assessment and the role of strat-

egy. Journal of Electronic Imaging, 19(1):011006, 2010.

[22] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunning-

ham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and

W. Shi. Photo-realistic single image super-resolution using a

generative adversarial network. In CVPR, 2017.

[23] Y. Li, W. Dong, X. Xie, G. Shi, X. Li, and D. Xu. Learn-

ing parametric sparse models for image super-resolution. In

NIPS. 2016.

[24] Y. Liang, J. Wang, X. Wan, Y. Gong, and N. Zheng. Im-

age quality assessment using similar scene as reference. In

ECCV, 2016.

[25] X. Liu, J. van de Weijer, and A. D. Bagdanov. Rankiqa:

Learning from rankings for no-reference image quality as-

sessment. In ICCV, 2017.

[26] Y. Liu, J. Yan, and W. Ouyang. Quality aware network for

set to set recognition. In CVPR, 2017.

[27] K. Ma, W. Liu, T. Liu, Z. Wang, and D. Tao. dipiq: Blind

image quality assessment by learning-to-rank discriminable

image pairs. TIP, pages 3951–3964, 2017.

[28] M. Mirza and S. Osindero. Conditional generative adversar-

ial nets. CoRR, 2014.

[29] A. Mittal, A. K. Moorthy, and A. C. Bovik. No-reference

image quality assessment in the spatial domain. TIP, pages

4695–4708, 2012.

[30] A. Mittal, A. K. Moorthy, and A. C. Bovik. No-reference

image quality assessment in the spatial domain. TIP, pages

4695–4708, 2012.

[31] A. K. Moorthy and A. C. Bovik. Blind image quality as-

sessment: From natural scene statistics to perceptual quality.

TIP, 20(12):3350–3364, 2011.

[32] A. Newell, K. Yang, and J. Deng. Stacked hourglass net-

works for human pose estimation. In ECCV, 2016.

[33] J. Pan, Z. Lin, Z. Su, and M.-H. Yang. Robust kernel estima-

tion with outliers handling for image deblurring. In CVPR,

2016.

[34] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A.

Efros. Context encoders: Feature learning by inpainting. In

CVPR, 2016.

[35] N. Ponomarenko, O. Ieremeiev, V. Lukin, K. Egiazarian,

L. Jin, J. Astola, B. Vozel, K. Chehdi, M. Carli, and F. Bat-

tisti. Color image database tid2013: Peculiarities and prelim-

inary results. In European Workshop on Visual Information

Processing, pages 106–111, 2013.

[36] N. Ponomarenko, V. Lukin, A. Zelensky, K. Egiazarian,

M. Carli, and F. Battisti. Tid2008 - a database for evalua-

tion of full-reference visual quality assessment metrics. Adv

Modern Radioelectron, 10:30–45, 2004.

[37] A. Radford, L. Metz, and S. Chintala. Unsupervised repre-

sentation learning with deep convolutional generative adver-

sarial networks. CoRR, 2015.

[38] M. A. Saad, A. C. Bovik, and C. Charrier. Dct statistics

model-based blind image quality assessment. In ICIP, 2011.

[39] M. A. Saad, A. C. Bovik, and C. Charrier. Blind image

quality assessment: A natural scene statistics approach in the

DCT domain. TIP, pages 3339–3352, 2012.

740

[40] M. S. M. Sajjadi, B. Scholkopf, and M. Hirsch. Enhancenet:

Single image super-resolution through automated texture

synthesis. In ICCV, 2017.

[41] H. R. Sheikh, M. F. Sabir, and A. C. Bovik. A statistical

evaluation of recent full reference image quality assessment

algorithms. TIP, 15(11):3440–3451, 2006.

[42] H. Tang, N. Joshi, and A. Kapoor. Blind image quality as-

sessment using semi-supervised rectifier networks. In CVPR,

2014.

[43] S. Xie and Z. Tu. Holistically-nested edge detection. In

ICCV, 2015.

[44] J. Xu, P. Ye, Q. Li, H. Du, Y. Liu, and D. Doermann. Blind

image quality assessment based on high order statistics ag-

gregation. TIP, pages 4444–4457, 2016.

[45] L. Xu, J. Li, W. Lin, Y. Zhang, L. Ma, Y. Fang, and Y. Yan.

Multi-task rank learning for image quality assessment. IEEE

Trans. Circuits Syst. Video Techn., pages 1833–1843, 2017.

[46] P. Ye, J. Kumar, and D. S. Doermann. Beyond human opin-

ion scores: Blind image quality assessment based on syn-

thetic scores. In CVPR, 2014.

[47] P. Ye, J. Kumar, L. Kang, and D. Doermann. Unsupervised

feature learning framework for no-reference image quality

assessment. In CVPR, 2012.

[48] H. Zeng, L. Zhang, and A. C. Bovik. A probabilistic quality

representation approach to deep blind image quality predic-

tion. CoRR, 2017.

[49] K. Zhang, W. Zuo, S. Gu, and L. Zhang. Learning deep cnn

denoiser prior for image restoration. In CVPR, 2017.

[50] L. Zhang and H. Li. Sr-sim: A fast and high performance iqa

index based on spectral residual. In ICIP, 2012.

[51] P. Zhang, W. Zhou, L. Wu, and H. Li. Som: Semantic ob-

viousness metric for image quality assessment. In CVPR,

2015.

741

Hallucinated-IQA: No-Reference Image Quality Assessment via … · 2018. 6. 11. · Hallucinated-IQA: No-Reference Image Quality Assessment via Adversarial Learning Kwan-Yee Lin1

Documents