Page 1
Image Quality Scaling for Electrophotographic PrintsGarrett M. Johnson , Rohit Patil, Ethan D. Montag, and Mark D. Fairchild
Munsell Color Science Laboratory, Chester F. Carlson Center for Imaging Science,
Rochester Institute of Technology, Rochester, NY, USA 14623-5604
ABSTRACT
Two psychophysical experiments were performed scaling overall image quality of black-and-white electrophotographic
(EP) images. Six different printers were used to generate the images. There were six different scenes included in the
experiment, representing photographs, business graphics, and test-targets. The two experiments were split into a paired-
comparison experiment examining overall image quality, and a triad experiment judging overall similarity and
dissimilarity of the printed images. The paired-comparison experiment was analyzed using Thurstone’s Law, to generate
an interval scale of quality, and with dual scaling, to determine the independent dimensions used for categorical scaling.
The triad experiment was analyzed using multidimensional scaling to generate a psychological stimulus space. The
psychophysical results indicated that the image quality was judged mainly along one dimension and that the
relationships among the images can be described with a single dimension in most cases. Regression of various physical
measurements of the images to the paired comparison results showed that a small number of physical attributes of the
images could be correlated with the psychophysical scale of image quality. However, global image difference metrics
did not correlate well with image quality.
Keywords: Image quality, image measurement, psychophysics, vision modeling
1. INTRODUCTION
There are many factors that can influence the perceived quality of an image. These factors include, but are not limited to,
the tone reproduction, noise, sharpness, graininess, and contrast of the image. To add confusion into the mix, the
perceived quality of an image is often mistaken for the “quality” of an imaging system. The parameters of an imaging
system certainly can affect the output quality, and are often measured and supplied as metrics for overall image quality.
These parameters might include system modulation transfer functions (MTF), pixel addressability, paper brightness, and
system gamut. Engeldrum1 gives an excellent overview of the confounding terminology that is often used when
describing image quality.
Image quality becomes an even trickier problem when trying to determine scales of quality among a series of images.
Most people can be handed an image and are able to come up with a mental judgment as to whether the image is good or
bad. This task becomes increasingly difficult when they are handed two (or more) images and asked to determine not
only whether each image is good or bad, but also its relationship with the others. This concept is known as image quality
scaling. Do people use the same perceptual clues to determine general quality of a single image as they use when
determining the relationships among images? If so, what are the most important perceptions that govern overall image
quality?
These questions are typically studied by designing an experiment that alters a single perception, or “ness” from the
Engeldrum framework, and determining the effect of that “ness” on overall image quality. Examples might include
scaling contrast to determine the relationship between contrast and quality. This type of scaling can be performed on
many individual perceptions, and then combined to form an overall image quality metric. Keelan has described methods
and techniques for this type of systematic image quality modeling in detail.2 In this particular experiment we have taken
the opposite approach. An experiment was performed on a series of images with no a priori knowledge of the
differences between the images. We then used several psychological scaling techniques in an attempt to determine what
the observers were using to judge image quality.3 A series of measurements from the images themselves, as well as from
[email protected] , [email protected] , [email protected] , [email protected] , www.cis.rit.edu/mcsl
Page 2
test targets generated from the same imaging systems were then correlated against the psychological scales of image
quality. From these measurements we attempt to predict the results of the psychophysical experiments.
2. EXPERIMENTAL DESIGN
Two psychophysical experiments were performed to scale the image quality of black and white electrophotographic
prints. The prints were generated on six different commercially available high-speed printers (Printers A-F). All of the
printers were black and white printers rated at 600 dpi. Six scenes were chosen: three pictorial images, 2 business type
graphics, and a printer test-target. The six scenes are shown in Figure 1 and are referred to as Candle, Flower, Building,
Arch, Business, and Target. The two business graphics were example images from Adobe Illustrator, and were vector
graphics converted to EPS files. The test target was also a vector EPS, combined with a bitmap image. The photographic
images were PhotoCD images at 300 pixels-per-inch.
Figure 1. Six scenes (clockwise from top left) Target, Candle, Arch, Business, Flower, Building used in psychophysical experiments.
The six scenes used in the experiment, along with several additional printer test targets, were then electronically sent to
the various printer operators as “originals.” The operators were allowed to use whatever traditional means (including the
RIP) they would normally used, but were instructed not to alter the original images in any manner. If possible the paper
was requested to be a standard 20-pound stock, although one operators used a 24 pound stock. The returned prints were
then mounted in a cardboard frame with several pieces of blank paper behind to ensure opacity. The frame provided a
degree of protection from handling and warping.
Page 3
The experiment was performed in a special viewing room configured with Macbeth fluorescent illuminators
approximating D50 at 2000 lux. The experimental setup is shown in Figure 2. The experiments were divided into two
parts. The first part was a paired comparison experiment in which the observers were shown two prints and ask to choose
the print that was of higher quality. The prints were placed on the viewing table by the experimenter and the observers
were not allowed to handle the prints. The observers maintained a typical viewing distance of 18 inches. Assuming 600
dpi and 8.5x11 inch prints this corresponds to approximately 180 cycles per degree of visual angle. The observers
selected the print of higher quality using a Wacom tablet attached to a computer. The tablet showed two large buttons
(left and right) on the tablet that the observer “pressed” depending on their selection. The computer software recorded
which images were presented, the order, and which image was selected to be of higher quality. The order of presentation,
as well as the layout of the images was randomized for each observer. The six scenes and six printers resulted in 90
observations, n = 6*(6*5)/2. The experiment took approximately 1 hour to complete for each observer. A total of 19
color normal observers participated in the experiment.
The second experiment was a triad experiment, designed to judge the similarity and differences between the prints. The
observers were presented with three images and asked to choose which pair of the three images appeared most similar in
quality, and which pair appeared the most dissimilar. The observers again used the tablet to record their choices, which
were recorded by the computer software. For the triad experiment there were 120 trials, resulting in 240 judgments of
similarity and dissimilarity. Again 19 color normal observers participated, with sessions taking about 1 hour each.
Figure 2. Experimental setup.
2. EXPERIMENTAL ANALYSIS
The paired comparison data was first analyzed using Thurstone’s Law of Comparative Judgment (Case V).4 This
analysis results in an interval scale image quality. The average results across the six scenes are shown in Figure 3, with
the error bars representing the 95% confidence interval.
Page 4
Figure 3. Interval scale of image quality for the average of six scenes.
From Figure 3 it appears that there is a significant difference between most of the printers, with Printer A and Printer C
the only exceptions. The individual scene scales show that there is scene dependence, however, when judging image
quality. The individual scales are shown in Figure 4.
Figure 4. Interval scales of quality for the six individual scenes.
The results shown in Figures 3 and 4 along with the respective rank orderings are summarized in Tables I and II. From
these data it is clear that there are distinct scene dependencies when judging image quality. For instance, printers A, B,
and C were each judged to be the best printer for two scenes. This seems to indicate that an image quality model that is
based entirely on scene independent measurements might not be adequate for predicting the overall image quality for
any given scene.
Page 5
Table I. Individual scales of image quality
A B C D E F
CANDLE 0.7338 1.6499 0.2755 0.2835 0.9249 0
BUILDING 2.3707 2.1114 1.6602 1.2130 1.4133 0
ARCH 1.9402 1.4455 1.7785 0.9383 0.8174 0
FLOWER 1.2989 2.5154 2.5710 1.1719 1.0952 0
BUSINESS 0.7398 1.5948 1.8855 0 0.7518 0.0521
TARGET 1.5131 2.3299 0.8846 1.2199 1.3014 0
Overall 1.3532 1.7811 1.2998 0.7297 0.9809 0
Table II. Rank order of image quality
A B C D E F
CANDLE 3 1 5 4 2 6
BUILDING 1 2 3 5 4 6
ARCH 1 3 2 4 5 6
FLOWER 3 2 1 4 5 6
BUSINESS 4 2 1 6 3 5
TARGET 2 1 5 4 3 6
mean rank 2.33 1.83 2.83 4.5 3.67 5.83
overall rank 2 1 3 5 4 6
The paired comparison data was also analyzed using dual scaling, which is a multidimensional technique that can be
performed on categorical data.5 The dual scaling analysis determines the number of independent dimensions that
characterizes the observers’ judgments of image quality, and the percent of the variance that each dimension accounts
for in the resulting scale. Thurstone’s law relies on the assumption of a one-dimensional scale, so dual scaling can be
used to test this assumption. The results of the dual scaling analysis for two scenes, Candle and Flower, are shown in
Figure 5. These two scenes represent the extremes in the analysis.
Flower
0
20
40
60
80
100
1 2 3 4 5
Dimensions
% V
arian
ce
% VAR
CUMU
Candle
0
20
40
60
80
100
1 2 3 4 5
Dimensions
% V
arian
ce
% VAR
CUMU
Figure 5. The percent and cumulative percent of the variance accounted for in each dimension
of the dual scaling analysis for two scenes.
From Figure 5 we can see that the first dimension in the Candle image accounts for approximately 55 percent of the
variance in the judgments, while the first dimension in the Flower image accounts for 82 percent. The remaining scenes
Page 6
are between these two, with the first dimension accounting for around 65 percent. This analysis indicates that for all
images there is a single primary dimension that accounts for the majority of variance. This single dimension supports the
underlying assumptions used when creating the interval scale using Thurstone’s law.
The similarity and dissimilarity data were then analyzed using multi-dimensional scaling (MDS) techniques.6 The MDS
creates a representation of the psychological space that represents the relationships among the prints in an n-dimensional
configuration. This space can be 1 or more dimensions. The similarity and dissimilarity data (proximities) were analyzed
in one through three dimensions using the MDS algorithm. The resulting final stress for each of the number of
dimensions used in the fit, where stress is a metric of the difference between the input proximities and the final
configuration that is minimized, reveals the number of dimensions that may be used to represent the number of
psychological dimensions used to judge the similarities and dissimilarities among the prints. The stress of the MDS
configurations for the six scenes is shown in Figure 6.
It is interesting to note that one-dimensional configurations characterizes the psychological space for these images quite
well with the exception of the “Business” image which is better characterized in two dimensions. . This analysis
indicates that the space representing the salient psychological relationship among these stimuli is one-dimensional. It
should be noted, however, that these dimensions do not necessarily have any obvious physical correlate. Thus the MDS,
or dual scaling for that matter, do not indicate that there should be a single measurable parameter that observers look for
when judging image quality. The need for additional dimensions in the business graphic scenes suggests observers judge
quality based on different criteria for pictorial scenes and for scenes that convey other types of information.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 0.5 1 1.5 2 2.5 3 3.5
Number of Dimensions
Fin
al
Stress
Candle
Building
Arch
Flower
Business
Target
Figure 6. Stress of the MDS configurations. The four scenes with pictorial information need only a single dimension to minimize
stress, while the business graphics need two and three dimensions.
4. MODELING IMAGE QUALITY
While the psychophysical experiments and analysis described above are interesting in their own right, they do not do not
provide total insight into the actual criteria observers use to scale image quality. This information is especially important
for imaging system designers, as they actual might want to know what they can do to improve image quality. Since this
experiment was not designed to measure single perceptual dimensions that affect quality, this analysis was done without
knowledge of what was different among the printed images. Essentially all printers were given the identical input, and
we are modeling the black-box that created the output. By measuring various parameters from the output images
Page 7
themselves, as well as from test-targets printed concurrently with the images, we can attempt to determine what factors
were used to scale the image quality.
The first series of measurements can be considered printer system measurements. These measurements are performed on
test-targets and can be considered generic attributes. Examples used in this analysis are tone reproduction, absolute black
level, system MTF, print grain, print noise, and solid line fill. These types of measurements are described in detail in
Grice & Allebach.7 Most of these measurements were performed using an Epson Perfection 2450 flatbed scanner. All
prints were scanned in black and white at 2400 optical DPI. The tone reproduction scale was measured using a Macbeth
4500 spectrophotometer with a 1cm aperture, specular excluded, and a UV filter to avoid paper fluorescence as much as
possible.
There are several measurements that should be described in slightly more detail. The “MTF” of the printer system was
calculated using the slant edge technique described by Burns.8 This MTF actual represents the spatial frequency response
of the printer, paper, and scanner cascaded together, though since the same scanner and scanning resolution were used
for all prints the effects are the same for all printers. Graininess was calculated using both the Weiner spectra techniques
described by Grise and Allebach,7 and by examining the RMS of the scans of solid patches. These solid patches were
first filtered using the modified S-CIELAB luminance filter for the experimental viewing condition.9 The average grain
measurement was defined to be the average of the modified S-CIELAB filtered uniform patches (light gray and dark
gray). Including the area of the MTF, or spatial frequency response, of the printers slightly improves upon the
predictions. Banding artifacts were measured by examining the RMS of the filtered solid patches in both the x and y
directions with a large difference being attributed to banding. The tone reproduction was determined to be the mean
squared error between the measured L* data and a straight line fit to the data for a linear tone-scale.
A stepwise regression was then performed on the experimental interval scales from the paired comparison with the
printer measurements. This was performed for the average of all the scenes, and independently for each scene. The top
four significant measurements were then linearly combined and used to predict the experimental scale of image quality.
This analysis was repeated with the top three, top two, and single significant measurements as well. The scales for all
scenes, as well as the average scale, could be fit with four measurements. This should not come as a surprise, as there are
only six data points in each scale. What is more interesting is the ability to be fit with a fewer number of measurements.
The average scale of the six scenes can be predicted very well using two measurements (MTF area and Average Grain)
and can also be fit well using only a single measurement (Average Grain). This is illustrated in Figure 7.
Average Scene
y = 1.0864x - 0.0877
R2 = 0.9638
y = 0.7327x + 0.3168
R2 = 0.875
-0.5
0
0.5
1
1.5
2
2.5
0 0.5 1 1.5 2
Experimental Scale
Mo
del
Fit MTF + Grain
Grain
Linear (MTF + Grain)
Linear (Grain)
Figure 7. Average experiment scale estimation using two (MTF+Grain) and one (Grain) measurements.
Page 8
The paired comparison results were almost completely predicted by a linear combination of two image independent
measurements. Perhaps more surprising is that a single measurement was able to predict these psychophysical results
almost as well. The only discrepancy is between Printer A and C which show a reversal in rank order. Figure 3 shows
that these two printers are not statistically significant, indicating that a single measurement is able to predict the average
experimental rank ordering within the error of the experiment.
While a single measurement of grain is capable of predicting the average scene data, it is of interest to determine
whether that is the case for any individual scene. At least two measurements must be used to predict the individual scales
for all but the Candle scene for which the grain measurement alone was deemed significant. Figure 8 shows the model fit
for two of the scenes, using two and three measurements respectively. The measurements needed to accurately predict
the scales are listed in Table III.
There are some interesting results that can be seen. For the flower image and the test-target image, the line fill was
considered important along with the grain measurement. Both of these scenes contain many fine lines that were
generated from vector based EPS files. The business graphic was best fit with a combination of grain and noise in the
dark areas. The building and arch images were modeled with some form of banding metric, while the building was also
fit with MTF area while the arch was fit with the tone reproduction error. This suggests that different criteria were used
for the different scenes, as suggested by the dual scaling analysis yet the images themselves are related to each other
along a single dimension as suggested by the MDS analysis.
Business Scene
y = 0.9329x + 0.0894
R2 = 0.9392
-0.5
0
0.5
1
1.5
2
0 0.5 1 1.5 2
Experimental Scale
Mo
del Fit
Grain+RMS Black
Linear (Grain+RMS Black)
Candle Scene
y = 0.9942x + 0.0072
R2 = 0.9995
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0 0.5 1 1.5 2
Experimental Scale
Mo
del Fit
MTF+Noise+yGrain
Linear (MTF+Noise+yGrain)
Figure 8. Model fits for individual scenes. The business graphic required two measurements to predict scale, while the candle scene
required three.
Table III. Measurements necessary to fit experimental data for individual scenes.
Image Measurements Needed
CANDLE MTF, Noise, Grain in Y Direction
BUILDING Tone Reproduction, Grain, Noise
ARCH MTF, Average Grain, Banding
FLOWER Grain, Line Fill
BUSINESS Average Grain, RMS Black
TARGET Grain, Line Fill, Stroke Width
Page 9
5. IMAGE QUALITY: A MATTER OF DIFFERENCE?
It has been shown that the experimental quality scales can be modeled on a “system” level, via the measurement of
various parameters such as average grain (which in turn is actually a system parameter coupled with a modeling of the
human visual system). Another interesting question is whether the image quality scale can be modeled by simply looking
at the differences in the images themselves. The paired comparison paradigm creates an interval scale comparing any
image against all of the other images. One possible way to analyze the data is to examine the difference between each
image and a standard image. All of the prints in this experiment are binary halftone images printed from a continuous
“original.” It is of interest to see if the quality scales can be predicted by examining the differences between each print
and the original digital image file.
These differences should not be performed on a strict pixel-by-pixel calculation, as that does not consider the human
observer who ultimately views the images. The images should first be filtered according to the behavior of the visual
system, as described by Zhang and Wandell10
and Johnson & Fairchild.9,11
Before calculating the image differences the
images were linearized using the scanner characterization. The linearized images were then filtered using only the
luminance CSF filter described by Johnson and Fairchild.9 A pixel-by-pixel difference was then calculated between the
filtered original and the filtered prints. Care was taken to align the images as best possible prior to this calculation. The
pixel differences were actually calculated using a compressive response function similar to CIELAB (1/3) to
approximate differences in perceived lightness. The mean and standard deviation of the output error image can then be
compared with the image quality scale generated from the paired comparison experiment. Figure 9 shows this
comparison for the average image difference and quality scale.
Image Difference Calculation
y = -0.0118x + 0.0926
R2 = 0.3488
y = -0.0098x + 0.1395
R2 = 0.4303
0.0000
0.0200
0.0400
0.0600
0.0800
0.1000
0.1200
0.1400
0.1600
0 0.5 1 1.5 2
Experimental Quality Scale
Im
ag
e D
iffe
ren
ce
Mean Difference
RMS Difference
Linear (Mean Difference)
Linear (RMS Difference)
Figure 9. Comparison of filtered image difference between prints and original and the experimental quality scale.
While a slight trend is suggested, it seems that the average image difference does not relate well with the experimental
quality scales. This is also the case when comparing the individual scenes. This suggests that the quality of the prints is
not directly related to the difference between the print and the original digital file used to make the print. One possible
reason is that the observers never viewed the original image, and therefore could not use that as a benchmark for judging
quality. It might be more appropriate to compare the differences between each image and the image judged to be the best
for that scene. The resulting scales for the best and worst predictions are shown in Figure 10.
Page 10
Candle
y = -0.095x + 0.1751
R2 = 0.6716
0
0.05
0.1
0.15
0.2
0.25
0 0.5 1 1.5 2
Experimental Quality Scale
Dif
feren
ce F
ro
m B
est
Mean
Linear (Mean)
Arch y = -0.0113x + 0.106
R2 = 0.0188
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0 0.5 1 1.5 2 2.5
Experimental Quality Scale
Dif
feren
ce F
ro
m B
est
Mean
Linear (Mean)
Figure 10. Comparison of image differences between each image and the “best” for each scene.
The candle scene shows a trend towards decreasing quality as the difference from the “best” increases, though there is an
outlying point. The arch scene actual shows a decrease in quality as the difference increases, followed by an increase.
This suggests that simple image difference does not account for the differences in quality of the image. It is important to
note that some of the calculated differences were a result of image misregistration, though care was taken to minimize
these differences. It is somewhat surprising that the calculated differences between the images were not better correlated
with the experimental data, while a small number of printer measurements were. Perhaps this was because the
calculations were performed using a simple pixel-by-pixel difference. It may be necessary to alter the filter luminance
filter (CSF) for use with these types of printed images. The use of more complete image difference models might better
fit the experimental results.11,12
7. CONCLUSIONS
Two psychophysical experiments were performed to scale image quality of size black and white electrophotographic
printers. A total of six identical digital scenes were printed using the default printer settings and the resulting prints were
used in a paired comparison experiment, and a triad similarity-dissimilarity experiment. Thurstone’s law was used to
generate integer scales of quality. Dual scaling analysis was performed on the paired comparison data. This analysis
indicated that there was a single perceptual dimension that accounted for most of the variability of the image quality
judgments. This supports the underlying assumptions behind Thurstone’s law. Multidimensional scaling (MDS) was
performed on the similarity-dissimilarity data. This analysis revealed a single psychological dimension described the
pictorial scenes, while the business graphics required two dimensions.
The image quality scales were then modeled using image independent “system” variables. It was determined that the
average quality scale could be modeled well with a single measurement of “average grain.” This measurement was
calculated by filtering uniform gray patches using the modified S-CIELAB luminance filters, and taking the RMS of the
filtered patches. The independent scenes required more measurements to model, though grain was important for most of
them. Finally the quality scales were modeled using a simple filtered image difference. This technique did not prove to
be capable of predicting the quality scales.
8. REFERENCES
1. P. G.Engledrum, “Extending Image Quality Models,” Proc IS&T PICS Conference, 65-69 (2002).
2. B.W. Keelan, Handbook of Image Quality: Characterization and Prediction, Marcel Dekker, New York, NY
(2002).
3. E.D. Montag and H. Kasahara, “Multidimensional Analysis Reveals Importance of Color for Image Quality,”
IS&T/SID 9th
Color Imaging Conference, Scottsdale, AZ, 17-21 (2001).
4. L.L Thurstone, “Psychophysical Analysis,” Am. J. Psych. 38 368-389, (1927).
5. S. Nishisato, Elements of Dual Scaling, Lawrence Erlbaum Assoc., Pub., Hillsdale, NJ, (1994).
6. S. Shiffman, Introduction to Multidimensional Scaling, Academic Press, NY, (1981).
Page 11
7. J. Grice and J.P Allebach, “The Print Quality Toolkit: An Integrated Print Quality Assessment Tool,” Journal of
Imaging Science and Technology, 43 187-199, 1999.
8. P.Burns, Slant edge MTF for digital camera and scanner analysis, Proceedings of IS&T PICS Conference, Portland
OR, 135-138 (2000).
9. G.M. Johnson & M.D. Fairchild, A top-down description of S-CIELAB and CIEDE2000, Color Research and
Application, 28 425-435 (2003).
10. X.M. Zhang and B.A Wandell, A spatial extension to CIELAB for digital color image reproduction, Proceedings of
the SID Symposiums (1996).
11. G.M. Johnson and M.D. Fairchild, Darwinism of color image difference models, IS&T/SID 9th
Color Imaging
Conference, Scottsdale, AZ 108-112 (2001).
12. M.D. Fairchild and G.M. Johnson, Meet iCAM: An next-generation color appearance model, IS&T/SID 10th
Color
Imaging Conference, Scottsdale, AZ 33-38 (2002).