Page 1
Hallucinated-IQA: No-Reference Image Quality Assessment
via Adversarial Learning
Kwan-Yee Lin1 and Guanxiang Wang2
1Department of Information Science, School of Mathematical Sciences, Peking University2Department of Mathematics, School of Mathematical Sciences, Peking University
[email protected]
[email protected]
Abstract
No-reference image quality assessment (NR-IQA) is a
fundamental yet challenging task in low-level computer vi-
sion community. The difficulty is particularly pronounced
for the limited information, for which the corresponding ref-
erence for comparison is typically absent. Although various
feature extraction mechanisms have been leveraged from
natural scene statistics to deep neural networks in previous
methods, the performance bottleneck still exists.
In this work, we propose a hallucination-guided qual-
ity regression network to address the issue. We firstly
generate a hallucinated reference constrained on the dis-
torted image, to compensate the absence of the true refer-
ence. Then, we pair the information of hallucinated ref-
erence with the distorted image, and forward them to the
regressor to learn the perceptual discrepancy with the guid-
ance of an implicit ranking relationship within the gener-
ator, and therefore produce the precise quality prediction.
To demonstrate the effectiveness of our approach, compre-
hensive experiments are evaluated on four popular image
quality assessment benchmarks. Our method significantly
outperforms all the previous state-of-the-art methods by
large margins. The code and model are publicly available
on the project page https://kwanyeelin.github.
io/projects/HIQA/HIQA.html.
1. Introduction
Image quality assessment (IQA) refers to the challenging
task of automatically predicting the perceptual quality of a
distorted image. IQA serves as a key component in the low-
level computer vision community and has a wide range of
applications [13, 26, 49].
IQA algorithms could be classified into three cate-
gories: full-reference IQA (FR-IQA) [50, 24, 19], reduced-
reference IQA (RR-IQA) [11], and general purpose no-
reference IQA (NR-IQA) [46, 17, 42, 51, 44, 20, 25]. Al-
(a)
(b)
(c)
Ground‐truthReference
Distorted
Image
Hallucinated Reference
Discrepancy
Map
Figure 1: An illustration of our motivation. The first column is
the Ground-truth Reference image which is undistorted. The sec-
ond column shows several kinds of distortion that is easily hap-
pened. The third row demonstrates the hallucinated reference im-
ages which are generated by our approach. The fourth column is
the discrepancy map which captures rich information that can be
utilized to guide the learning of quality regression network to get
high accuracy results.
though FR-IQA and RR-IQA metrics have achieved re-
markable results over the decades, the precondition of them
that requiring a corresponding non-distorted reference im-
age for comparison during quality predicting process makes
these metrics infeasible in practical applications, since it is
hard, even impossible in most cases, to obtain an ideal ref-
erence image. In contrast, NR-IQA, which takes only the
distorted image to be assessed as input without any addi-
tional information, is more realistic and therefore receives
substantial attention in recent years. However, the ill-posed
definition makes it is highly challenging for NR-IQA to
make a good image quality prediction.
The ill-posed nature of the underdetermined NR-IQA
problem is particularly pronounced for the limited informa-
tion, for which the form of distortion and the correspond-
732
Page 2
ing non-distorted reference image are typically absent. It is
counter-intuitive, since human visual system (HVS) needs
a reference to quantify the perceptual discrepancy by com-
paring the distorted image either directly with the original
undistorted image or implicitly with a hallucinated scene in
mind, as demonstrated in Figure. 1(a). The ill-posed defini-
tion becomes the most essential issue of NR-IQA task that
leads to the performance bottleneck over the last decade.
Numerous efforts have been made to ease this problem
by designing powerful feature representation models. Tra-
ditional methods commonly use manually designed statis-
tic representations, and hence lack of diversity and flexibil-
ity for modeling the multiple complex distorted types1 and
large span of image contents (e.g., human, animal, plant,
cityscape, transportation, etc.) in NR-IQA. In recent years,
the promising results of Deep Neural Networks (DNNs)
in many computer vision tasks [8, 5, 43] encourage re-
searchers exploring their formidable feature representation
power to the NR-IQA task. Nevertheless, the extremely
limited annotation samples in public datasets greatly limit
the advantage of DNNs in NR-IQA task. To better leverag-
ing the power of DNNs, previous works usually utilize var-
ious multi-task and data augmentation strategies with extra
annotated ranking, proxy quality scores, or distortion infor-
mation sophisticatedly, which are unavailable in practical
NR-IQA applications, and hence lack of feasibility for mod-
eling unknown multiple distortion types. Some works at-
tempt to transfer general image feature representations from
a pre-trained model on ImageNet [6] to quality prediction.
While, less correlation and similarity between NR-IQA and
image classification task reduce the effectiveness of transfer
learning.
In this work, a Hallucination-Guided Quality Regression
Network is proposed to simulate the behaviour of human
visual system (HVS), which can make precise prediction
by leveraging perceptual discrepancy information between
the distorted image and hallucinated reference. As shown
in Fig. 2, a high-resolution scene hallucination is firstly
generated from the distorted image. Then, the discrepancy
map which naturally encoding the difference between the
distorted images and hallucinated reference can be obtained
to guide the learning of regression network. With the strong
and clear defined discrepancy information incorporated in,
the ill-posed nature of NR-IQA can be dramatically over-
come. Therefore, even with a common data augmentation,
our approach could lead to better performance than all of
the conventional sophisticated methods.
A straightforward way to generate the hallucinated ref-
erence is to leverage the state-of-the-art image super-
1An image could be distorted by any stages in the whole process of
its lifecycle from acquisition to storage, and therefore will undergo diverse
distortions, like noise corruption, compression artifact, transmission errors,
under-/over-exposure,etc. For more details, please refer to [35].
resolution [23, 22, 40], blind deblur [33], or inpainting
[34] methods to reconstruct images from the distorted ones.
However, since an image could be distorted with multiple
unknown distortions, which breaks the basic assumptions2
of these related fields, it is impractical to utilize them to ob-
tain a reconstructed image that qualified for the agent ref-
erence of NR-IQA task. To this end, a Quality-Aware Gen-
erative Network is proposed to generate hallucinated refer-
ence with a novel quality-aware perceptual term which is
designed specifically for the NR-IQA task at hand.
While the Quality-Aware Generative Network is robust
to most distortion types and levels, it is, however, still
very challenging for a method that under the framework of
DCNN to reconstruct high-frequency details with realistic
texture, when the distorted image lacks structure informa-
tion, as shown in Fig. 1(c). Since the result of hallucinated
reference is crucial for final prediction, a bad hallucination
will introduce large bias and therefore lead the regression
results into sub-optimal values. We propose to tackle this
problem with the two following mechanisms from low-level
to high: (1) We introduce adversarial learning idea to hal-
lucinated reference generation and quality prediction with
a novel IQA-discriminator to, on the one hand, encourage
the generated hallucinated scene perceptually hard to dis-
tinguish from the true reference images, and on the other
hand, in a low-level semantic, constrain the influence of
bad hallucinations to the quality regression network. (2) A
novel high-level semantic fusion mechanism is introduced
to further reduce the instability of the quality regression net-
work caused by the hallucination model. It explores implicit
ranking relationship within the hallucination network to as a
guidance to help the regression network adjusting the image
quality prediction in an adaptive manner.The quality-aware
generative network, hallucination-guided quality regression
network, and the iqa-discriminator can be jointly optimised
in an end-to-end manner.
Our main contributions of this work are summarised into
three folds:
• A novel Hallucination-Guided Quality Regression
Network is proposed to incorporate the perceptual dis-
crepancy information into network learning to over-
come the ill-posed nature of NR-IQA and significantly
improves the prediction precision and robustness.
• A Quality-Aware Generative Network together with
a quality-aware perceptual loss is proposed, in which
both texture feature similarity and quality feature sim-
ilarity are taken into consideration in a complementary
manner to help generating qualified hallucinated refer-
ences.
2For example, super-resolution methods usually assume the blur kernel
or form is known.
733
Page 3
• Since the result of hallucinated reference is crucial for
final prediction, an IQA-Discriminator and an implicit
ranking relationship fusion scheme are introduced to
better guide the learning of generator and suppress the
negative scene hallucination influence to quality re-
gression in a low-level to high-level manner.
We evaluate the proposed method on four broadly used
image quality assessment benchmarks including LIVE [41],
CSIQ [21], TID2008 [36], and TID2013 [35]. Our approach
shows the superior performance over all of the state-of-the-
art NR-IQA methods by significant margins. Comprehen-
sive ablation study further demonstrates the effectiveness of
each component.
2. Related Work
No-reference Image Quality Assessment. In the litera-
ture of NR-IQA, besides classic methods ([31, 38, 29, 47])
and their improved versions ([51, 44, 27]), recently, signifi-
cant progresses have been achieved by exploring DNNs for
better feature representation [17, 18, 45, 20, 25, 48]. For
example, Kang et al. [17] introduce a shallow ConvNet to
model the quality prediction. This approach is refined to
a multi-task CNN [18], where the network learns both dis-
tortion type and quality score simultaneously. Bianco et al.
[2] use a pre-trained DCNN fine-tuned on an IQA dataset to
extract features, then map them to IQA scores by an SVR
model. Hui et al. [48] also propose to extract features by a
pre-trained ResNet [14]. Instead of learning IQA scores di-
rectly, they fine tune the network to learn a probabilistic rep-
resentation of distorted images. According to the distortion
types and levels in particular datasets, Liu et al. [25] syn-
thesize masses of ranked images to train a Siamese network
to learn the rankings for NR-IQA. Liang et al. [24] propose
to use non-aligned similar scene as a reference. Kim and
Lee [20] apply state-of-the-art FR-IQA methods to gener-
ate proxy scores on patches as the ground truth to pre-train
the model and then fine-tune to NR-IQA.
In this work, we propose a unique approach to address
the ill-posed problem by compensating the absent refer-
ence information without any extra data annotation or prior
knowledge, which therefore increases the flexibility and
feasibility than other methods.
Generative Adversarial Network. GANs [12] and var-
ious variants [37, 1, 28] flourish in generating natural im-
ages such as human faces [3] and indoor scenes [7]. How-
ever, generating high-resolution images (e.g.,256×256) will
lead GANs to training instability and sometimes nonsensi-
cal outputs, which has been proven in [16]. Since our ul-
timate goal is NR-IQA, and the performance of quality re-
gression network is closely related to the output of the gen-
erator, instead of applying original discriminator, we tai-
lor the adversarial learning scheme for image quality as-
sessment by introducing an effective iqa-discriminative net-
work.
3. Our Approach
In this section, we introduce our approach for NR-IQA.
An overview of our framework is illustrated in Fig. 2. The
model consists of three parts, i.e., the quality-aware gen-
erative network G, the iqa-discriminative network D and
the hallucination-guided quality regression network R. The
generative network produces hallucinated reference as the
compensatory information for the distorted images. The
discriminative network is trained with G in the adversar-
ial manner to help G producing more qualified results and
constrain negative effects of bad ones to R. We define the
objective discrepancy (i.e., the pixel-wise differences) be-
tween a distorted image and the corresponding scene hal-
lucination as the discrepancy map3. The quality regression
network takes the distorted images and corresponding dis-
crepancy maps as inputs, with the guidance of implicit rank-
ing relationships in G, to exploit the perceptual discrepancy
and produce the predicted quality scores as outputs.
3.1. QualityAware Generative Network
As we mentioned in the previous sections, the function
of hallucinated reference for the distorted image is to com-
pensate the absence of true reference image, and the less
gap between hallucination and true reference, the more pre-
cise the quality regression network will perform. Therefore,
the aim of G is to generate a high-resolution hallucinate im-
age Ish conditioned on the distorted image Id. Toward this
end, we adopt a stacked hourglass [32] as the baseline of the
generative network.
A straightforward way for learning the generating func-
tion G(Id) is to enforce the output of the generator both
pixel-wise and perception-wise close to the true refer-
ence. Therefore, given a set of distorted images {Iid, i =1, 2, . . . , N}, and corresponding true reference images
{Iir, i = 1, 2, . . . , N}, we solve
θ = argminθ
1
N
N∑
i=1
(lp(Gθ(Iid), I
ir)+ls(Gθ(I
id), I
ir)), (1)
where lp penalizes the pixel-wise differences between the
output and the ground truth with pixel-level error measure-
ments, such as MSE, to generate holistic content; and ls pe-
nalizes the perception-wise differences to achieve sharper
local results. We adopt a feature space loss term [9] as the
perception constraint, which is defined as
ls(Gθ(Iid), I
ir) = ‖φ(Gθ(I
id))− φ(Iir)‖
2, (2)
3This is different from the concept of error map, which is used in FR-
IQA to represent pixel-wise error between the distorted image and true
reference.
734
Page 4
Stack 1 Stack n
Quality‐Aware Generative Network
Hallucination‐Guided Quality Regression Network IQA‐Discriminative Network
Generated Hallucinated Reference
Ground Truth Reference
Real/Fake
Fusion
S
Convolution
Residual Unit
Down Sampling
Fully Connection
Predicted Quality Score
Subtraction
Generated Hallucinated Reference
Up
Sampling
High‐Level Feature
Hourglass Module
Entrywise
Sum
Distorted Image
Discrepancy Map
Figure 2: An illustration of our proposed Hallucinated-IQA framework. It consists of three strongly related subnets. (a) Quality-Aware
Generative Network is used to generate hallucinated reference images. In order to get high resolution hallucinated images, a quality-
aware loss is introduced to the learning process. (b) Hallucination-Guided Quality Regression Network is in a position to incorporate the
discrepancy information between the hallucinated image and distorted encoded in the discrepancy map. The incorporated discrepancy
information together with high-level semantic fusion from the generative network can supply the regression network with rich information
and greatly guide the network learning. (3) Since the results of the hallucinated image are crucial for the final prediction, IQA-Discriminator
is proposed to further refine the hallucinated image.
Distorted
ImageBaseline Generator Baseline + Quality‐
Aware LossBaseline + Quality‐
Aware Loss + IQA‐GANDiscrepancy
Map
Figure 3: An illustration of the effectiveness of quality-aware
loss and IQA-GAN. With the quality-aware loss and IQA-GAN
scheme adding, the hallucinated images are improved to be more
and more clear and plausible. The last column shows the discrep-
ancy map got from our model, which can be seen to well capture
the type and location information of the distortion. The map is
demonstrated to be very helpful for our IQA task.
where φ(·) represents a feature transformation. Intuitively,
pre-trained network like VGG-19 could be utilized to cal-
culate the perception term. This is reasonable in most cases
by the fact that the VGG-19 is trained for semantic classifi-
cation, and the features of its intermediate layers are there-
fore invariant to the noise of input [4, 10]. Consequently,
these layers provide structure and texture information to the
generator for inferring more accurate results. However, the
invariance property will also lead to the perception term ig-
noring the hard cases where the output of the generator still
contains a certain degree distortion information, as demon-
strated in Fig. 3. To ease this problem, we propose a
quality-aware perceptual loss, which incorporates the fea-
tures of the deep regression network R dynamically. The
loss function in equation (2) becomes
ls(Gθ(Iid), I
ir) = λ1lv(Gθ(I
id), I
ir) + λ2lq(Gθ(I
id), I
ir),
(3)
where
lv =
Cv∑
cv=1
1
WjHj
Wj∑
x=1
Hj∑
y=1
‖φj(Gθ(Iid))x,y − φj(I
ir)x,y‖
2,
(4)
and
lq =
Cq∑
cq=1
1
WkHk
Wk∑
x=1
Hk∑
y=1
‖πk(Gθ(Iid))x,y − πk(I
ir)x,y‖
2,
(5)
where φj(·) denotes the feature map at the j-th layer of
VGG-19, πk(·) denotes the feature map at the k-th layer
of R; W and H represent the dimensions of the feature
map, C represents the number of feature maps at a par-
ticular layer. Since the vgg-19 network and R are trained
for different tasks, the representation of kernels within the
two networks also toward to preserve different information.
The activations from the layers of a pre-trained 4 NR-IQA
regression network capture the distortion information of the
input, which ensures the quality similarity measurement be-
tween the output of G and the ground truth. The activations
from the layers of the VGG-19 network ensure the seman-
tic similarity measurement. Base on respective represent-
ing capabilities of the two networks, incorporating both lvand lq losses to the perception term could complement each
other and therefore help the generator producing better re-
sults jointly.
4It should be noted that, the pre-trained quality regression model refers
to the one that trained from scratch with IQA dataset.
735
Page 5
3.2. IQADiscriminative Network
To ensure the generator producing high perceptual out-
puts with realistic high-frequency details, especially for the
samples that seriously lack structure and texture informa-
tion due to the distortion type (e.g., local block-wise dis-
tortions of different intensity, transmission errors), or the
distortion level, the adversarial learning mechanism is in-
troduced to our work.
The original manner of adversarial learning is to train G
to generate images to fool D, and D is in contrast trained
to distinguish fake reference images Ish from real reference
images Ir. However, since GANs are limited to the resolu-
tion of the generator, and the distorted images forwarded to
a quality network are usually of large size to maintain suffi-
cient contextual information, directly providing Ish as fake
images to the discriminator will introduce instability to opti-
mization procedure and sometimes leads to nonsensical re-
sults. More importantly, our ultimate goal is improving the
performance of the deep regression network R. Even when
G fails to generate high-resolution hallucination images, the
predicted scores of R should still be a reasonable value.
Thus, the influence of bad hallucination images to R should
be suppressed. Thus, we propose a IQA-Discriminator (i.e.,
D ) to ease above problems by discriminating the fake sam-
ples from the real samples according to their positive or neg-
ative influence to R. If G generates a hallucinated reference
could help improving the precision of R, then this halluci-
nation is defined as real sample to D, otherwise the halluci-
nation is a fake sample. This could be formulated as
maxω
E[logDω(Ir)]+E[log (1− |Dω(Gθ(Id))− dfake|)],
(6)
where dfake denotes the ground truth influence label with
the definition
difake =
{
1 if ‖R(Iid, Iish)− si‖F < ǫ
0 if ‖R(Iid, Iish)− si‖F ≥ ǫ
, (7)
where si is the ground truth quality score of Iid, ǫ denotes
the threshold parameter. The general idea behind this for-
mulation is that it leverages the property of quality regres-
sion loss, where the loss is an explicit index that directly
reflects the impact of G on R, to enforce D only penalizing
samples with negative influence. Therefore, it could also be
regarded as a relaxation strategy to stabilize the adversarial
learning process.
Thus, G is eventually optimised to fool the discriminator
D by generating qualified hallucinated scene that is benefi-
cial for R. The adversarial loss of G is formulated as
Ladv = E[log(1−Dω(Gθ(Id)))], (8)
and the overall loss function of G for all training samples is
given by
LG = µ1Lp + µ2Ls + µ3Ladv, (9)
where µ1, µ2 and µ3 represent the parameters that trade off
the three loss components.
3.3. HallucinationGuided Quality Regression Network
Given the hallucinated scene generated by G, we are
able to provide agent references to the quality regression
network to compensate the absence of true reference infor-
mation. In order to incorporate the hallucinated reference
information effectively, the concept of discrepancy map is
introduced to the work. To further stabilizing the optimiza-
tion procedure of R, a high-level semantic fusion scheme is
proposed.
Discrepancy Map. Given a set of distorted images to
be assessed, previous CNN-based NR-IQA methods learn a
mapping function R(Id) to predict the quality scores. On
the contrary, we consider the distorted images and their dis-
crepancy maps as pairs{Iid, Iimap}
Ni=1
to train a deep regres-
sion network by solving
γ = argminγ
1
N
N∑
i=1
lr(R(Iid, Iimap), s
i), (10)
where Imap = |Id−Gθ(Id)| , denotes the discrepancy map.
The formulation shows the discrepancy map could virtually
be regarded as a prior information to tell the network what
the distortion looks like.
It is interesting that, so far, the holistic mechanism func-
tions in a reinforced way, during training stage of R, G
is used to produce auxiliary hallucinated references, while
during the training stage of G, R is in reverse introduced to
help generating better hallucinations. In essence, G and R
are mutually correlated and thus can reinforce each other.
High-level Semantic Fusion. As we mentioned in pre-
vious sections, the precision of R is greatly depended on
the eligibility of the hallucinated scene. To be specific,
a qualified hallucination as the agent reference could help
R exploring correct perceptual discrepancy of the distorted
image, while the unqualified one will conversely introduce
large bias to R by improperly narrowing the distortion infor-
mation. Hence, a constrained scheme is needed to stabilize
the quality regression process.
Assume G has been trained, the feature maps after the
m-th residual block in encoder part of its n-th stack are
considered as {Hcmnmn (Id)}
Cmncmn=1
. We fuse the ones after
the last encoder residual block of second stack with the fea-
ture maps after the last block of R, then we have the fusion
term:
F = f(H5,2(Id))⊗ (R1(Id, Imap)) (11)
where f is a linear projection to ensure the dimensions of
H and R1 are equal, R1 denotes the feature extraction be-
fore the fully connected layers (R2) of R, and ⊗ denotes
736
Page 6
the concatenation operation. Thus, the loss of R could be
formulated as:
LR =1
T
T∑
t=1
‖R2(f(H5,2(Id))⊗ (R1(Id, Imap)))− st‖ℓ1
(12)
The form of the loss LR allows the high-level semantic in-
formation of G participating in the optimization procedure
of R. As we discussed in the introduction, the fusion term
F explores implicit ranking relationship5 within G to as a
guidance to help R adjusting the quality prediction in an
adaptive manner. Specifically, if G is optimal, the solvers
may simply drive the weights of neurons in R2 that connect-
ing with f toward zero to approach identity mappings. Oth-
erwise, the eligibility of the hallucinated scene is materially
a reflection of the quality of the input distorted images that
could be leveraged to as a guidance to correct the prediction,
and therefore improving the precision of R in a high-level
semantic manner. Meanwhile, the iqa-discriminator could
be regarded as a low-level semantic scheme to R, since it
encourages G to generate useful hallucination input to R.
Therefore, our model has schemes in multiple semantic lev-
els to stabilize the quality regression process.
3.4. Training Strategy
Since all of the operations in G and R are differentiable,
these two sub-networks can be trained in an end-to-end
manner. To better optimize the generation and quality re-
gression in a mutually reinforced way, we take an alterna-
tive training strategy in practice. Please refer to the supple-
mentary, where Algorithm 1 demonstrates the whole train-
ing processing of our approach as the pseudo codes.
3.5. WeaklySupervised Quality Assessment
In this section, we discuss some extensions to further un-
cover the potential of our framework.
To advance the development of IQA task, various bench-
marks have been released in these years. However, a signif-
icant issue follows as well. As shown in Table 1, there are
huge gaps of distorted quality definition, types and levels
among the datasets. While NR-IQA models are commonly
trained on one specific dataset, these gaps will easily lead
the models to suffer over-fitting problem and lack of gen-
eralization ability. Learning from cross-datasets is an alter-
native way to ease the problem. Previous methods usually
transfer the definition of quality scores by non-linear map-
pings learned from the distributions of datasets, which may
introduce bias to the models.
5G serves as not only a generator, but also an encoder-decoder mech-
anism. Thus, the difference-information between images distorted in dif-
ferent degree is encoded compactly in the end of the encoder part. We
refer to this “difference-information” as “implicit ranking relationship” of
distorted images in this work.
In contrast, as a by-product of our work, the hallucinated
scene could be regarded as a universal medium among dif-
ferent datasets to help the training process of a particular
one without losing precision, since the hallucination is only
constrained on distorted image and serves as the fundamen-
tal agent reference information of image quality. Mean-
while, the detachable training process of our framework
provides an alternative that the R in the stages of train-
ing G and final quality regression model could be different.
Based on the above, as long as a hallucination generator is
trained either on one specific dataset with multiple complex
distortions or on multi-datasets in once, it can be used to
help the training process of any other datasets as a plug-and
play module in a weakly-supervised manner. Moreover, the
module could also be used as a data augmentation or initial-
ization mechanism without any extra annotation or artificial
prior knowledge. We evaluate above discussion in Sec.4.1.
3.6. Implementation Details
All the training samples are 256× 256 pixel patches that
randomly sampled from the original images. Then a com-
mon data augmentation is performed with random rotation
(±20◦) and flip. We train our models with Caffe [15] on
the Titan X GPUs with a mini-batch size of 32 and all of
them are trained from scratch. The stochastic gradient de-
scent (SGD) is used to optimise the networks with an initial
learning rate of 10−5 for the generation network and 10−2
for the regression network, and dropped by a factor of 0.1every 20K iterations. The weight decay is 0.0005, and the
momentum is 0.9. During testing, we extract overlapped
image patches at a fixed stride from each testing image, and
simply average all predicted scores as the final whole-image
quality score.
Databases # of Ref.Images # of Dist.Images # of Dist.Types Score Type Score Range
LIVE 29 779 5 DMOS [1,100]
CSIQ 30 866 6 DMOS [0,1]
TID2008 25 1700 17 MOS [0,9]
TID2013 25 3000 24 MOS [0,9]
Table 1: Summary of the databases evaluated in the experiments.
4. Experiments
Datasets. We perform experiments on four widely used
benchmark datasets LIVE [41], CSIQ [21], TID2008 [36],
and TID2013 [35]. The detailed information are summa-
rized in Table 1.
Evaluation Metrics. Following most previous works,
two evaluation criteria are adopted in our paper: the Spear-
man’s Rank Order Correlation Coefficient (SROCC) and the
Linear Correlation Coefficient (LCC). SROCC is a measure
of the monotonic relationship between the ground-truth and
model prediction. LCC is a measure of the linear correlation
between the ground-truth and model prediction. The de-
tailed definitions are formulated in the supplementary ma-
terial.
737
Page 7
Methods # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 # 11 # 12 # 13
BLIINDS-II [39] 0.714 0.728 0.825 0.358 0.852 0.664 0.780 0.852 0.754 0.808 0.862 0.251 0.755
CORNIA-10K [47] 0.341 -0.196 0.689 0.184 0.607 -0.014 0.673 0.896 0.787 0.875 0.911 0.310 0.625
HOSA[44] 0.853 0.625 0.782 0.368 0.905 0.775 0.810 0.892 0.870 0.893 0.932 0.747 0.701
RankIQA [25] 0.667 0.620 0.821 0.365 0.760 0.736 0.783 0.809 0.767 0.866 0.878 0.704 0.810
Ours 0.923 0.880 0.945 0.673 0.955 0.810 0.855 0.832 0.957 0.914 0.624 0.460 0.782
Ours+Oracle 0.952 0.890 0.976 0.831 0.931 0.773 0.898 0.812 0.910 0.929 0.735 0.638 0.739
Methods # 14 # 15 # 16 # 17 # 18 # 19 # 20 # 21 # 22 # 23 # 24 ALL
BLIINDS-II[39] 0.081 0.371 0.159 -0.082 0.109 0.699 0.222 0.451 0.815 0.568 0.856 0.550
CORNIA-10K [47] 0.161 0.096 0.008 0.423 -0.055 0.259 0.606 0.555 0.592 0.759 0.903 0.651
HOSA [44] 0.199 0.327 0.233 0.294 0.119 0.782 0.532 0.835 0.855 0.801 0.905 0.728
RankIQA [25] 0.512 0.622 0.268 0.613 0.662 0.619 0.644 0.800 0.779 0.629 0.859 0.780
Ours 0.664 0.122 0.182 0.376 0.156 0.850 0.614 0.852 0.911 0.381 0.616 0.879
Ours+Oracle 0.834 0.457 0.823 0.850 0.539 0.893 0.695 0.859 0.910 0.655 0.712 0.935
Table 2: Performance evaluation (SROCC) on the entire TID2013 database.
4.1. Comparisons with the stateofthearts
To validate our approach, we conduct extensive evalua-
tions, where ten state-of-the-art NR-IQA methods are com-
pared. We follow the experimental protocol used in three
most recent algorithms (i.e., HOSA [44], BIECON [20],
and RankIQA [25]), where the reference images are ran-
domly divided into two subsets with 80% for training and
20% for testing, and the corresponding distorted images are
divided in the same way to ensure there is no overlap image
content between the two sets. All the experiments are under
ten times random train-test splitting operation, and the me-
dian SROCC and LCC values are reported as final statistics.
Single dataset evaluations. We first analyze the exper-
iment results on TID2013. The SROCC for our approach
and compared state-of-the-arts on entire TID2013 dataset
are reported in Table 2. Our method significantly outper-
forms previous methods by a large margin. We achieve
a 13% relative improvement over the most state-of-the-art
method RankIQA on entire dataset with all the distortion
types under consideration at once. For individual distor-
tions, due to the normalization operation in the network,
the performances on a small number of types like intensity
shift and change of colour saturation are lower than some
methods. While we generally achieve the highest accura-
cies on most distortion types (over 60% subsets). Specifi-
cally, the significant improvements on distortion types like
#4(masked noise) and #14(non-eccentricity pattern noise)
quantitatively demonstrate the effectiveness of our hallu-
cinated reference compensation mechanism, and improve-
ments on types such as #9 (Image denoising) and #22(Multiplicative Gaussian noise) verify the capacity of our G
component as a single model that hallucinates images under
multiple distortions effectively.
Table 3 shows the performance evaluation on the entire
LIVE database. Our method outperforms all of the state-
of-the-art methods for both SROCC and LCC evaluations.
Among the methods compared in the experiments, the most
state-of-the-art three methods explore different strategies to
better leverage the power of DNNs and achieve promis-
ing results, where BIECON uses FR-IQA methods to gen-
erated proxy quality scores, RankIQA synthesizes masses
of ranked images to train the network, and PQR takes ad-
vantage from a pre-trained Res-50 network. Our method
achieves 2% improvements than BIECON, 2% SROCC
and 1% LCC improvements than PQR, and 0.1% slightly
improvements than RankIQA with training from scratch.
These observation demonstrate that our mechanisms in-
crease the model capacity effectively from a new perspec-
tive.
As for TID2008 dataset, our approach also achieves
highest performances compared with all of the state-of-the-
arts. We also reach best performances on CSIQ dataset. For
space saving, the detail results and discussion of this two
dataset are shown in the supplementary material, please re-
fer to it.
We also list the results of using ground-truth reference
on above experiments as the theoretical bounds, which are
referred to “ours+oracle”, to further verify the effectiveness
and potential of proposed hallucinated references to NR-
IQA. The oracle outperforms all the methods in all datasets
by large margins. These results demonstrate the effective-
ness of hallucinated information and show great potential
performance gain if the hallucinated information could be
well generated.
Cross-dataset evaluations. Here, we perform two types
of cross-dataset evaluations to further verify some merits of
our approach. Table 4 shows the results of cross-dataset
test where the models are trained by the LIVE dataset,
and tested on the TID2008 dataset. We follow the com-
mon experiment setting to test the results on the subsets of
TID2008, where four distortion types (i.e., JPEG, JPEG2K,
WN, and BLUR) are included, and a logistic regression is
applied to match the predicted DMOS to MOS value. The
promising results demonstrate the generalization ability of
our approach.
To evaluate the by-product of our work where the model
could be leveraged in a weakly-supervised manner to han-
738
Page 8
SROCC JP2K JPEG WN BLUR FF ALL
BRISQUE [30] 0.914 0.965 0.979 0.951 0.877 0.940
CORNIA [47] 0.943 0.955 0.976 0.969 0.906 0.942
CNN [17] 0.952 0.977 0.978 0.962 0.908 0.956
SOM [51] 0.947 0.952 0.984 0.976 0.937 0.964
BIECON [20] 0.952 0.974 0.980 0.956 0.923 0.961
RankIQA [25] 0.970 0.978 0.991 0.988 0.954 0.981
PQR [48] - - - - - 0.965
Ours 0.983 0.961 0.984 0.983 0.989 0.982
Ours+Oracle 0.978 0.960 0.993 0.988 0.968 0.983
LCC JP2K JPEG WN BLUR FF ALL
BRISQUE [30] 0.923 0.973 0.985 0.951 0.903 0.942
CORNIA [47] 0.951 0.965 0.987 0.968 0.917 0.935
CNN [17] 0.953 0.981 0.984 0.953 0.933 0.953
SOM [51] 0.952 0.961 0.991 0.974 0.954 0.962
BIECON [20] 0.965 0.987 0.970 0.945 0.931 0.962
RankIQA [25] 0.975 0.986 0.994 0.988 0.960 0.982
PQR [48] - - - - - 0.971
Ours 0.977 0.984 0.993 0.990 0.960 0.982
Ours+Oracle 0.989 0.985 0.997 0.992 0.988 0.989
Table 3: Performance evaluation (both SROCC and LCC) on the
entire LIVE database.
CORNIA [47] CNN [17] SOM [51] Ours Ours+Oracle
SROCC 0.892 0.920 0.923 0.934 0.939
LCC 0.880 0.903 0.899 0.917 0.920
Table 4: Cross-dataset evaluation (SROCC).The models are
trained on the LIVE database and tested on the subset of TID2008.
L T08 T08+T13 Ours+Oracle
SROCC 0.982 0.982 0.983 0.983
LCC 0.982 0.985 0.988 0.989
Table 5: SROCC and LCC results of models on the LIVE
database with training generator on different datasets.
dle cross-dataset quality assessment, we train the generator
on different datasets, and use LIVE dataset to train the re-
gression network. Table 5 reports the results. The “L” as the
plain of the experiment represents the hallucination genera-
tor is trained on the training set of LIVE. The “T08” repre-
sents training the generator on TID2008, and “T08+T13” is
the version that training on both the TID2008 and TID2013.
It can be clearly observed that with more IQA datasets ag-
gregated in the generator, the regression network reaches
higher SROCC and LCC performances to approximate the
oracle.
4.2. Ablation study
To investigate the efficacy of the key components of our
model, we conduct ablation experiments on the TID2008
dataset. The overall results are shown in Figure 4. We use
a modified Res-18 network with only distorted images as
inputs to be our baseline model and analyze each proposed
component based on the baseline network (BL), by compar-
ing both SROCC and LCC results.
Hallucinated reference compensation. We first eval-
uate the hallucinated reference compensation mechanism.
0.755
0.8000.859
0.870
0.864
0.887
0.894
0.910
0.910
0.918
0.941
0.949
0.500
0.550
0.600
0.650
0.700
0.750
0.800
0.850
0.900
0.950
1.000
SROCC LCC
BL BL+HCM BL+HCM+QSL
BL+HCM+QSL+ADV BL+HCM+QS+QADV BL+HCM+QS+QADV+HSF
Figure 4: Ablation results on the entire TID2008 dataset.
By adding a holistic hallucination model to provide halluci-
nated references pairing with distorted images as the inputs
to res-18 network (“BL+HCM”), we get a 0.859 SROCC
value and a 0.870 PLCC value, which up to 14% and 8%improvement compared to the baseline model, respectively.
Quality-aware perceptual loss. By adding the feature
matching loss w.r.t. quality similarity at the training process
of the hallucination model(“BL+HCM+QPL”), our model
obtains a further 0.5% improvement on SROCC and 2% on
LCC.
Adversarial learning. To explore the effect of proposed
IQA-Discriminative network for quality assessment, we
further compare the models with adversarial learning mech-
anism under original definition (“BL+HCM+QPL+ADV”)
and our definition(“BL+HCM+QPL+QADV”). Adding
original adversarial learning mechanism leads to a 3% im-
provement on SROCC and 3% on LCC, while our method
obtains further 2% and about 1% improvements on SROCC
and LCC, respectively.
Multi-level semantic fusion. We also show the im-
provements brought by the multi-level semantic fusion
mechanism. We fuse the feature maps of the generator from
stack two with the ones of same size in quality regression
network, and obtain the highest 0.941 SROCC value and
0.949 LCC value.
5. Conclusion
In this paper, we propose to solve the ill-posed na-
ture of NR-IQA from a new perspective. We introduce a
hallucination-guided quality regression network to capture
the perceptual discrepancy between the distorted images
and the hallucinated images, and therefore predict precise
perceptual quality result. We generate the hallucinations by
a novel quality-aware generation network with the help of
a specially designed iqa-discriminator under the adversarial
learning manner. The proposed network does not require
any extra annotations or artificial prior knowledge for train-
ing and can be trained end-to-end. Extensive experiments
demonstrate the superior performance on NR-IQA task.
739
Page 9
References
[1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gener-
ative adversarial networks. In ICML, 2017.
[2] S. Bianco, L. Celona, P. Napoletano, and R. Schettini. On
the use of deep learning for blind image quality assessment.
CoRR, 2016.
[3] X. Chen, X. Chen, Y. Duan, R. Houthooft, J. Schulman,
I. Sutskever, and P. Abbeel. Infogan: Interpretable repre-
sentation learning by information maximizing generative ad-
versarial nets. In NIPS, 2016.
[4] D. Cho, J. Park, T. Oh, Y. Tai, and I. S. Kweon. Weakly-
and self-supervised learning for content-aware deep image
retargeting. In ICCV, 2017.
[5] J. Dai, Y. Li, K. He, and J. Sun. R-FCN: object detection via
region-based fully convolutional networks. In NIPS, 2016.
[6] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li. Ima-
genet: A large-scale hierarchical image database. In CVPR,
2009.
[7] E. L. Denton, S. Chintala, a. szlam, and R. Fergus. Deep
generative image models using a laplacian pyramid of adver-
sarial networks. In NIPS. 2015.
[8] J. Fu, H. Zheng, and T. Mei. Look closer to see better: Recur-
rent attention convolutional neural network for fine-grained
image recognition. In CVPR, 2017.
[9] L. A. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis
using convolutional neural networks. In NIPS, 2015.
[10] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer
using convolutional neural networks. In CVPR, 2016.
[11] S. A. Golestaneh and L. J. Karam. Reduced-reference qual-
ity assessment based on the entropy of DWT coefficients of
locally weighted gradient magnitudes. TIP, 25(11):5293–
5303, 2016.
[12] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio.
Generative adversarial networks. CoRR, 2014.
[13] J. Guo and H. Chao. Building an end-to-end spatial-temporal
convolutional network for video super-resolution. In AAAI,
2017.
[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
for image recognition. In CVPR, 2016.
[15] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-
shick, S. Guadarrama, and T. Darrell. Caffe: Convolu-
tional architecture for fast feature embedding. arXiv preprint
arXiv:1408.5093, 2014.
[16] C. Kaae Sønderby, J. Caballero, L. Theis, W. Shi, and
F. Huszar. Amortised MAP Inference for Image Super-
resolution. ICLR, 2017.
[17] L. Kang, P. Ye, Y. Li, and D. Doermann. Convolutional neu-
ral networks for no-reference image quality assessment. In
CVPR, 2014.
[18] L. Kang, P. Ye, Y. Li, and D. S. Doermann. Simultaneous es-
timation of image quality and distortion via multi-task con-
volutional neural networks. In ICIP, 2015.
[19] J. Kim and S. Lee. Deep learning of human visual sensitivity
in image quality assessment framework. In CVPR, 2017.
[20] J. Kim and S. Lee. Fully deep blind image quality predictor.
J. Sel. Topics Signal Processing, 11(1):206–220, 2017.
[21] E. C. Larson and D. M. Chandler. Most apparent distortion:
full-reference image quality assessment and the role of strat-
egy. Journal of Electronic Imaging, 19(1):011006, 2010.
[22] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunning-
ham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and
W. Shi. Photo-realistic single image super-resolution using a
generative adversarial network. In CVPR, 2017.
[23] Y. Li, W. Dong, X. Xie, G. Shi, X. Li, and D. Xu. Learn-
ing parametric sparse models for image super-resolution. In
NIPS. 2016.
[24] Y. Liang, J. Wang, X. Wan, Y. Gong, and N. Zheng. Im-
age quality assessment using similar scene as reference. In
ECCV, 2016.
[25] X. Liu, J. van de Weijer, and A. D. Bagdanov. Rankiqa:
Learning from rankings for no-reference image quality as-
sessment. In ICCV, 2017.
[26] Y. Liu, J. Yan, and W. Ouyang. Quality aware network for
set to set recognition. In CVPR, 2017.
[27] K. Ma, W. Liu, T. Liu, Z. Wang, and D. Tao. dipiq: Blind
image quality assessment by learning-to-rank discriminable
image pairs. TIP, pages 3951–3964, 2017.
[28] M. Mirza and S. Osindero. Conditional generative adversar-
ial nets. CoRR, 2014.
[29] A. Mittal, A. K. Moorthy, and A. C. Bovik. No-reference
image quality assessment in the spatial domain. TIP, pages
4695–4708, 2012.
[30] A. Mittal, A. K. Moorthy, and A. C. Bovik. No-reference
image quality assessment in the spatial domain. TIP, pages
4695–4708, 2012.
[31] A. K. Moorthy and A. C. Bovik. Blind image quality as-
sessment: From natural scene statistics to perceptual quality.
TIP, 20(12):3350–3364, 2011.
[32] A. Newell, K. Yang, and J. Deng. Stacked hourglass net-
works for human pose estimation. In ECCV, 2016.
[33] J. Pan, Z. Lin, Z. Su, and M.-H. Yang. Robust kernel estima-
tion with outliers handling for image deblurring. In CVPR,
2016.
[34] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A.
Efros. Context encoders: Feature learning by inpainting. In
CVPR, 2016.
[35] N. Ponomarenko, O. Ieremeiev, V. Lukin, K. Egiazarian,
L. Jin, J. Astola, B. Vozel, K. Chehdi, M. Carli, and F. Bat-
tisti. Color image database tid2013: Peculiarities and prelim-
inary results. In European Workshop on Visual Information
Processing, pages 106–111, 2013.
[36] N. Ponomarenko, V. Lukin, A. Zelensky, K. Egiazarian,
M. Carli, and F. Battisti. Tid2008 - a database for evalua-
tion of full-reference visual quality assessment metrics. Adv
Modern Radioelectron, 10:30–45, 2004.
[37] A. Radford, L. Metz, and S. Chintala. Unsupervised repre-
sentation learning with deep convolutional generative adver-
sarial networks. CoRR, 2015.
[38] M. A. Saad, A. C. Bovik, and C. Charrier. Dct statistics
model-based blind image quality assessment. In ICIP, 2011.
[39] M. A. Saad, A. C. Bovik, and C. Charrier. Blind image
quality assessment: A natural scene statistics approach in the
DCT domain. TIP, pages 3339–3352, 2012.
740
Page 10
[40] M. S. M. Sajjadi, B. Scholkopf, and M. Hirsch. Enhancenet:
Single image super-resolution through automated texture
synthesis. In ICCV, 2017.
[41] H. R. Sheikh, M. F. Sabir, and A. C. Bovik. A statistical
evaluation of recent full reference image quality assessment
algorithms. TIP, 15(11):3440–3451, 2006.
[42] H. Tang, N. Joshi, and A. Kapoor. Blind image quality as-
sessment using semi-supervised rectifier networks. In CVPR,
2014.
[43] S. Xie and Z. Tu. Holistically-nested edge detection. In
ICCV, 2015.
[44] J. Xu, P. Ye, Q. Li, H. Du, Y. Liu, and D. Doermann. Blind
image quality assessment based on high order statistics ag-
gregation. TIP, pages 4444–4457, 2016.
[45] L. Xu, J. Li, W. Lin, Y. Zhang, L. Ma, Y. Fang, and Y. Yan.
Multi-task rank learning for image quality assessment. IEEE
Trans. Circuits Syst. Video Techn., pages 1833–1843, 2017.
[46] P. Ye, J. Kumar, and D. S. Doermann. Beyond human opin-
ion scores: Blind image quality assessment based on syn-
thetic scores. In CVPR, 2014.
[47] P. Ye, J. Kumar, L. Kang, and D. Doermann. Unsupervised
feature learning framework for no-reference image quality
assessment. In CVPR, 2012.
[48] H. Zeng, L. Zhang, and A. C. Bovik. A probabilistic quality
representation approach to deep blind image quality predic-
tion. CoRR, 2017.
[49] K. Zhang, W. Zuo, S. Gu, and L. Zhang. Learning deep cnn
denoiser prior for image restoration. In CVPR, 2017.
[50] L. Zhang and H. Li. Sr-sim: A fast and high performance iqa
index based on spectral residual. In ICIP, 2012.
[51] P. Zhang, W. Zhou, L. Wu, and H. Li. Som: Semantic ob-
viousness metric for image quality assessment. In CVPR,
2015.
741