Image Quality Scaling for Electrophotographic Prints

Image Quality Scaling for Electrophotographic PrintsGarrett M. Johnson , Rohit Patil, Ethan D. Montag, and Mark D. Fairchild

Munsell Color Science Laboratory, Chester F. Carlson Center for Imaging Science,

Rochester Institute of Technology, Rochester, NY, USA 14623-5604

ABSTRACT

Two psychophysical experiments were performed scaling overall image quality of black-and-white electrophotographic

(EP) images. Six different printers were used to generate the images. There were six different scenes included in the

experiment, representing photographs, business graphics, and test-targets. The two experiments were split into a paired-

comparison experiment examining overall image quality, and a triad experiment judging overall similarity and

dissimilarity of the printed images. The paired-comparison experiment was analyzed using Thurstone’s Law, to generate

an interval scale of quality, and with dual scaling, to determine the independent dimensions used for categorical scaling.

The triad experiment was analyzed using multidimensional scaling to generate a psychological stimulus space. The

psychophysical results indicated that the image quality was judged mainly along one dimension and that the

relationships among the images can be described with a single dimension in most cases. Regression of various physical

measurements of the images to the paired comparison results showed that a small number of physical attributes of the

images could be correlated with the psychophysical scale of image quality. However, global image difference metrics

did not correlate well with image quality.

Keywords: Image quality, image measurement, psychophysics, vision modeling

1. INTRODUCTION

There are many factors that can influence the perceived quality of an image. These factors include, but are not limited to,

the tone reproduction, noise, sharpness, graininess, and contrast of the image. To add confusion into the mix, the

perceived quality of an image is often mistaken for the “quality” of an imaging system. The parameters of an imaging

system certainly can affect the output quality, and are often measured and supplied as metrics for overall image quality.

These parameters might include system modulation transfer functions (MTF), pixel addressability, paper brightness, and

system gamut. Engeldrum1 gives an excellent overview of the confounding terminology that is often used when

describing image quality.

Image quality becomes an even trickier problem when trying to determine scales of quality among a series of images.

Most people can be handed an image and are able to come up with a mental judgment as to whether the image is good or

bad. This task becomes increasingly difficult when they are handed two (or more) images and asked to determine not

only whether each image is good or bad, but also its relationship with the others. This concept is known as image quality

scaling. Do people use the same perceptual clues to determine general quality of a single image as they use when

determining the relationships among images? If so, what are the most important perceptions that govern overall image

quality?

These questions are typically studied by designing an experiment that alters a single perception, or “ness” from the

Engeldrum framework, and determining the effect of that “ness” on overall image quality. Examples might include

scaling contrast to determine the relationship between contrast and quality. This type of scaling can be performed on

many individual perceptions, and then combined to form an overall image quality metric. Keelan has described methods

and techniques for this type of systematic image quality modeling in detail.2 In this particular experiment we have taken

the opposite approach. An experiment was performed on a series of images with no a priori knowledge of the

differences between the images. We then used several psychological scaling techniques in an attempt to determine what

the observers were using to judge image quality.3 A series of measurements from the images themselves, as well as from

[email protected], [email protected], [email protected], [email protected], www.cis.rit.edu/mcsl

test targets generated from the same imaging systems were then correlated against the psychological scales of image

quality. From these measurements we attempt to predict the results of the psychophysical experiments.

2. EXPERIMENTAL DESIGN

Two psychophysical experiments were performed to scale the image quality of black and white electrophotographic

prints. The prints were generated on six different commercially available high-speed printers (Printers A-F). All of the

printers were black and white printers rated at 600 dpi. Six scenes were chosen: three pictorial images, 2 business type

graphics, and a printer test-target. The six scenes are shown in Figure 1 and are referred to as Candle, Flower, Building,

Arch, Business, and Target. The two business graphics were example images from Adobe Illustrator, and were vector

graphics converted to EPS files. The test target was also a vector EPS, combined with a bitmap image. The photographic

images were PhotoCD images at 300 pixels-per-inch.

Figure 1. Six scenes (clockwise from top left) Target, Candle, Arch, Business, Flower, Building used in psychophysical experiments.

The six scenes used in the experiment, along with several additional printer test targets, were then electronically sent to

the various printer operators as “originals.” The operators were allowed to use whatever traditional means (including the

RIP) they would normally used, but were instructed not to alter the original images in any manner. If possible the paper

was requested to be a standard 20-pound stock, although one operators used a 24 pound stock. The returned prints were

then mounted in a cardboard frame with several pieces of blank paper behind to ensure opacity. The frame provided a

degree of protection from handling and warping.

The experiment was performed in a special viewing room configured with Macbeth fluorescent illuminators

approximating D50 at 2000 lux. The experimental setup is shown in Figure 2. The experiments were divided into two

parts. The first part was a paired comparison experiment in which the observers were shown two prints and ask to choose

the print that was of higher quality. The prints were placed on the viewing table by the experimenter and the observers

were not allowed to handle the prints. The observers maintained a typical viewing distance of 18 inches. Assuming 600

dpi and 8.5x11 inch prints this corresponds to approximately 180 cycles per degree of visual angle. The observers

selected the print of higher quality using a Wacom tablet attached to a computer. The tablet showed two large buttons

(left and right) on the tablet that the observer “pressed” depending on their selection. The computer software recorded

which images were presented, the order, and which image was selected to be of higher quality. The order of presentation,

as well as the layout of the images was randomized for each observer. The six scenes and six printers resulted in 90

observations, n = 6*(6*5)/2. The experiment took approximately 1 hour to complete for each observer. A total of 19

color normal observers participated in the experiment.

The second experiment was a triad experiment, designed to judge the similarity and differences between the prints. The

observers were presented with three images and asked to choose which pair of the three images appeared most similar in

quality, and which pair appeared the most dissimilar. The observers again used the tablet to record their choices, which

were recorded by the computer software. For the triad experiment there were 120 trials, resulting in 240 judgments of

similarity and dissimilarity. Again 19 color normal observers participated, with sessions taking about 1 hour each.

Figure 2. Experimental setup.

2. EXPERIMENTAL ANALYSIS

The paired comparison data was first analyzed using Thurstone’s Law of Comparative Judgment (Case V).4 This

analysis results in an interval scale image quality. The average results across the six scenes are shown in Figure 3, with

the error bars representing the 95% confidence interval.

Figure 3. Interval scale of image quality for the average of six scenes.

From Figure 3 it appears that there is a significant difference between most of the printers, with Printer A and Printer C

the only exceptions. The individual scene scales show that there is scene dependence, however, when judging image

quality. The individual scales are shown in Figure 4.

Figure 4. Interval scales of quality for the six individual scenes.

The results shown in Figures 3 and 4 along with the respective rank orderings are summarized in Tables I and II. From

these data it is clear that there are distinct scene dependencies when judging image quality. For instance, printers A, B,

and C were each judged to be the best printer for two scenes. This seems to indicate that an image quality model that is

based entirely on scene independent measurements might not be adequate for predicting the overall image quality for

any given scene.

Table I. Individual scales of image quality

A B C D E F

CANDLE 0.7338 1.6499 0.2755 0.2835 0.9249 0

BUILDING 2.3707 2.1114 1.6602 1.2130 1.4133 0

ARCH 1.9402 1.4455 1.7785 0.9383 0.8174 0

FLOWER 1.2989 2.5154 2.5710 1.1719 1.0952 0

BUSINESS 0.7398 1.5948 1.8855 0 0.7518 0.0521

TARGET 1.5131 2.3299 0.8846 1.2199 1.3014 0

Overall 1.3532 1.7811 1.2998 0.7297 0.9809 0

Table II. Rank order of image quality

A B C D E F

CANDLE 3 1 5 4 2 6

BUILDING 1 2 3 5 4 6

ARCH 1 3 2 4 5 6

FLOWER 3 2 1 4 5 6

BUSINESS 4 2 1 6 3 5

TARGET 2 1 5 4 3 6

mean rank 2.33 1.83 2.83 4.5 3.67 5.83

overall rank 2 1 3 5 4 6

The paired comparison data was also analyzed using dual scaling, which is a multidimensional technique that can be

performed on categorical data.5 The dual scaling analysis determines the number of independent dimensions that

characterizes the observers’ judgments of image quality, and the percent of the variance that each dimension accounts

for in the resulting scale. Thurstone’s law relies on the assumption of a one-dimensional scale, so dual scaling can be

used to test this assumption. The results of the dual scaling analysis for two scenes, Candle and Flower, are shown in

Figure 5. These two scenes represent the extremes in the analysis.

Flower

0

20

40

60

80

100

1 2 3 4 5

Dimensions

% V

arian

ce

% VAR

CUMU

Candle

0

20

40

60

80

100

1 2 3 4 5

Dimensions

% V

arian

ce

% VAR

CUMU

Figure 5. The percent and cumulative percent of the variance accounted for in each dimension

of the dual scaling analysis for two scenes.

From Figure 5 we can see that the first dimension in the Candle image accounts for approximately 55 percent of the

variance in the judgments, while the first dimension in the Flower image accounts for 82 percent. The remaining scenes

are between these two, with the first dimension accounting for around 65 percent. This analysis indicates that for all

images there is a single primary dimension that accounts for the majority of variance. This single dimension supports the

underlying assumptions used when creating the interval scale using Thurstone’s law.

The similarity and dissimilarity data were then analyzed using multi-dimensional scaling (MDS) techniques.6 The MDS

creates a representation of the psychological space that represents the relationships among the prints in an n-dimensional

configuration. This space can be 1 or more dimensions. The similarity and dissimilarity data (proximities) were analyzed

in one through three dimensions using the MDS algorithm. The resulting final stress for each of the number of

dimensions used in the fit, where stress is a metric of the difference between the input proximities and the final

configuration that is minimized, reveals the number of dimensions that may be used to represent the number of

psychological dimensions used to judge the similarities and dissimilarities among the prints. The stress of the MDS

configurations for the six scenes is shown in Figure 6.

It is interesting to note that one-dimensional configurations characterizes the psychological space for these images quite

well with the exception of the “Business” image which is better characterized in two dimensions. . This analysis

indicates that the space representing the salient psychological relationship among these stimuli is one-dimensional. It

should be noted, however, that these dimensions do not necessarily have any obvious physical correlate. Thus the MDS,

or dual scaling for that matter, do not indicate that there should be a single measurable parameter that observers look for

when judging image quality. The need for additional dimensions in the business graphic scenes suggests observers judge

quality based on different criteria for pictorial scenes and for scenes that convey other types of information.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 0.5 1 1.5 2 2.5 3 3.5

Number of Dimensions

Fin

al

Stress

Candle

Building

Arch

Flower

Business

Target

Figure 6. Stress of the MDS configurations. The four scenes with pictorial information need only a single dimension to minimize

stress, while the business graphics need two and three dimensions.

4. MODELING IMAGE QUALITY

While the psychophysical experiments and analysis described above are interesting in their own right, they do not do not

provide total insight into the actual criteria observers use to scale image quality. This information is especially important

for imaging system designers, as they actual might want to know what they can do to improve image quality. Since this

experiment was not designed to measure single perceptual dimensions that affect quality, this analysis was done without

knowledge of what was different among the printed images. Essentially all printers were given the identical input, and

we are modeling the black-box that created the output. By measuring various parameters from the output images

themselves, as well as from test-targets printed concurrently with the images, we can attempt to determine what factors

were used to scale the image quality.

The first series of measurements can be considered printer system measurements. These measurements are performed on

test-targets and can be considered generic attributes. Examples used in this analysis are tone reproduction, absolute black

level, system MTF, print grain, print noise, and solid line fill. These types of measurements are described in detail in

Grice & Allebach.7 Most of these measurements were performed using an Epson Perfection 2450 flatbed scanner. All

prints were scanned in black and white at 2400 optical DPI. The tone reproduction scale was measured using a Macbeth

4500 spectrophotometer with a 1cm aperture, specular excluded, and a UV filter to avoid paper fluorescence as much as

possible.

There are several measurements that should be described in slightly more detail. The “MTF” of the printer system was

calculated using the slant edge technique described by Burns.8 This MTF actual represents the spatial frequency response

of the printer, paper, and scanner cascaded together, though since the same scanner and scanning resolution were used

for all prints the effects are the same for all printers. Graininess was calculated using both the Weiner spectra techniques

described by Grise and Allebach,7 and by examining the RMS of the scans of solid patches. These solid patches were

first filtered using the modified S-CIELAB luminance filter for the experimental viewing condition.9 The average grain

measurement was defined to be the average of the modified S-CIELAB filtered uniform patches (light gray and dark

gray). Including the area of the MTF, or spatial frequency response, of the printers slightly improves upon the

predictions. Banding artifacts were measured by examining the RMS of the filtered solid patches in both the x and y

directions with a large difference being attributed to banding. The tone reproduction was determined to be the mean

squared error between the measured L* data and a straight line fit to the data for a linear tone-scale.

A stepwise regression was then performed on the experimental interval scales from the paired comparison with the

printer measurements. This was performed for the average of all the scenes, and independently for each scene. The top

four significant measurements were then linearly combined and used to predict the experimental scale of image quality.

This analysis was repeated with the top three, top two, and single significant measurements as well. The scales for all

scenes, as well as the average scale, could be fit with four measurements. This should not come as a surprise, as there are

only six data points in each scale. What is more interesting is the ability to be fit with a fewer number of measurements.

The average scale of the six scenes can be predicted very well using two measurements (MTF area and Average Grain)

and can also be fit well using only a single measurement (Average Grain). This is illustrated in Figure 7.

Average Scene

y = 1.0864x - 0.0877

R2 = 0.9638

y = 0.7327x + 0.3168

R2 = 0.875

-0.5

0

0.5

1

1.5

2

2.5

0 0.5 1 1.5 2

Experimental Scale

Mo

del

Fit MTF + Grain

Grain

Linear (MTF + Grain)

Linear (Grain)

Figure 7. Average experiment scale estimation using two (MTF+Grain) and one (Grain) measurements.

The paired comparison results were almost completely predicted by a linear combination of two image independent

measurements. Perhaps more surprising is that a single measurement was able to predict these psychophysical results

almost as well. The only discrepancy is between Printer A and C which show a reversal in rank order. Figure 3 shows

that these two printers are not statistically significant, indicating that a single measurement is able to predict the average

experimental rank ordering within the error of the experiment.

While a single measurement of grain is capable of predicting the average scene data, it is of interest to determine

whether that is the case for any individual scene. At least two measurements must be used to predict the individual scales

for all but the Candle scene for which the grain measurement alone was deemed significant. Figure 8 shows the model fit

for two of the scenes, using two and three measurements respectively. The measurements needed to accurately predict

the scales are listed in Table III.

There are some interesting results that can be seen. For the flower image and the test-target image, the line fill was

considered important along with the grain measurement. Both of these scenes contain many fine lines that were

generated from vector based EPS files. The business graphic was best fit with a combination of grain and noise in the

dark areas. The building and arch images were modeled with some form of banding metric, while the building was also

fit with MTF area while the arch was fit with the tone reproduction error. This suggests that different criteria were used

for the different scenes, as suggested by the dual scaling analysis yet the images themselves are related to each other

along a single dimension as suggested by the MDS analysis.

Business Scene

y = 0.9329x + 0.0894

R2 = 0.9392

-0.5

0

0.5

1

1.5

2

0 0.5 1 1.5 2

Experimental Scale

Mo

del Fit

Grain+RMS Black

Linear (Grain+RMS Black)

Candle Scene

y = 0.9942x + 0.0072

R2 = 0.9995

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

0 0.5 1 1.5 2

Experimental Scale

Mo

del Fit

MTF+Noise+yGrain

Linear (MTF+Noise+yGrain)

Figure 8. Model fits for individual scenes. The business graphic required two measurements to predict scale, while the candle scene

required three.

Table III. Measurements necessary to fit experimental data for individual scenes.

Image Measurements Needed

CANDLE MTF, Noise, Grain in Y Direction

BUILDING Tone Reproduction, Grain, Noise

ARCH MTF, Average Grain, Banding

FLOWER Grain, Line Fill

BUSINESS Average Grain, RMS Black

TARGET Grain, Line Fill, Stroke Width

5. IMAGE QUALITY: A MATTER OF DIFFERENCE?

It has been shown that the experimental quality scales can be modeled on a “system” level, via the measurement of

various parameters such as average grain (which in turn is actually a system parameter coupled with a modeling of the

human visual system). Another interesting question is whether the image quality scale can be modeled by simply looking

at the differences in the images themselves. The paired comparison paradigm creates an interval scale comparing any

image against all of the other images. One possible way to analyze the data is to examine the difference between each

image and a standard image. All of the prints in this experiment are binary halftone images printed from a continuous

“original.” It is of interest to see if the quality scales can be predicted by examining the differences between each print

and the original digital image file.

These differences should not be performed on a strict pixel-by-pixel calculation, as that does not consider the human

observer who ultimately views the images. The images should first be filtered according to the behavior of the visual

system, as described by Zhang and Wandell10

and Johnson & Fairchild.9,11

Before calculating the image differences the

images were linearized using the scanner characterization. The linearized images were then filtered using only the

luminance CSF filter described by Johnson and Fairchild.9 A pixel-by-pixel difference was then calculated between the

filtered original and the filtered prints. Care was taken to align the images as best possible prior to this calculation. The

pixel differences were actually calculated using a compressive response function similar to CIELAB (1/3) to

approximate differences in perceived lightness. The mean and standard deviation of the output error image can then be

compared with the image quality scale generated from the paired comparison experiment. Figure 9 shows this

comparison for the average image difference and quality scale.

Image Difference Calculation

y = -0.0118x + 0.0926

R2 = 0.3488

y = -0.0098x + 0.1395

R2 = 0.4303

0.0000

0.0200

0.0400

0.0600

0.0800

0.1000

0.1200

0.1400

0.1600

0 0.5 1 1.5 2

Experimental Quality Scale

Im

ag

e D

iffe

ren

ce

Mean Difference

RMS Difference

Linear (Mean Difference)

Linear (RMS Difference)

Figure 9. Comparison of filtered image difference between prints and original and the experimental quality scale.

While a slight trend is suggested, it seems that the average image difference does not relate well with the experimental

quality scales. This is also the case when comparing the individual scenes. This suggests that the quality of the prints is

not directly related to the difference between the print and the original digital file used to make the print. One possible

reason is that the observers never viewed the original image, and therefore could not use that as a benchmark for judging

quality. It might be more appropriate to compare the differences between each image and the image judged to be the best

for that scene. The resulting scales for the best and worst predictions are shown in Figure 10.

Candle

y = -0.095x + 0.1751

R2 = 0.6716

0

0.05

0.1

0.15

0.2

0.25

0 0.5 1 1.5 2


Dif

feren

ce F

ro

m B

est

Mean

Linear (Mean)

Arch y = -0.0113x + 0.106

R2 = 0.0188

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0 0.5 1 1.5 2 2.5


Dif

feren

ce F

ro

m B

est

Mean

Linear (Mean)

Figure 10. Comparison of image differences between each image and the “best” for each scene.

The candle scene shows a trend towards decreasing quality as the difference from the “best” increases, though there is an

outlying point. The arch scene actual shows a decrease in quality as the difference increases, followed by an increase.

This suggests that simple image difference does not account for the differences in quality of the image. It is important to

note that some of the calculated differences were a result of image misregistration, though care was taken to minimize

these differences. It is somewhat surprising that the calculated differences between the images were not better correlated

with the experimental data, while a small number of printer measurements were. Perhaps this was because the

calculations were performed using a simple pixel-by-pixel difference. It may be necessary to alter the filter luminance

filter (CSF) for use with these types of printed images. The use of more complete image difference models might better

fit the experimental results.11,12

7. CONCLUSIONS

Two psychophysical experiments were performed to scale image quality of size black and white electrophotographic

printers. A total of six identical digital scenes were printed using the default printer settings and the resulting prints were

used in a paired comparison experiment, and a triad similarity-dissimilarity experiment. Thurstone’s law was used to

generate integer scales of quality. Dual scaling analysis was performed on the paired comparison data. This analysis

indicated that there was a single perceptual dimension that accounted for most of the variability of the image quality

judgments. This supports the underlying assumptions behind Thurstone’s law. Multidimensional scaling (MDS) was

performed on the similarity-dissimilarity data. This analysis revealed a single psychological dimension described the

pictorial scenes, while the business graphics required two dimensions.

The image quality scales were then modeled using image independent “system” variables. It was determined that the

average quality scale could be modeled well with a single measurement of “average grain.” This measurement was

calculated by filtering uniform gray patches using the modified S-CIELAB luminance filters, and taking the RMS of the

filtered patches. The independent scenes required more measurements to model, though grain was important for most of

them. Finally the quality scales were modeled using a simple filtered image difference. This technique did not prove to

be capable of predicting the quality scales.

8. REFERENCES

1. P. G.Engledrum, “Extending Image Quality Models,” Proc IS&T PICS Conference, 65-69 (2002).

2. B.W. Keelan, Handbook of Image Quality: Characterization and Prediction, Marcel Dekker, New York, NY

(2002).

3. E.D. Montag and H. Kasahara, “Multidimensional Analysis Reveals Importance of Color for Image Quality,”

IS&T/SID 9th

Color Imaging Conference, Scottsdale, AZ, 17-21 (2001).

4. L.L Thurstone, “Psychophysical Analysis,” Am. J. Psych. 38 368-389, (1927).

5. S. Nishisato, Elements of Dual Scaling, Lawrence Erlbaum Assoc., Pub., Hillsdale, NJ, (1994).

6. S. Shiffman, Introduction to Multidimensional Scaling, Academic Press, NY, (1981).

7. J. Grice and J.P Allebach, “The Print Quality Toolkit: An Integrated Print Quality Assessment Tool,” Journal of

Imaging Science and Technology, 43 187-199, 1999.

8. P.Burns, Slant edge MTF for digital camera and scanner analysis, Proceedings of IS&T PICS Conference, Portland

OR, 135-138 (2000).

9. G.M. Johnson & M.D. Fairchild, A top-down description of S-CIELAB and CIEDE2000, Color Research and

Application, 28 425-435 (2003).

10. X.M. Zhang and B.A Wandell, A spatial extension to CIELAB for digital color image reproduction, Proceedings of

the SID Symposiums (1996).

11. G.M. Johnson and M.D. Fairchild, Darwinism of color image difference models, IS&T/SID 9th

Color Imaging

Conference, Scottsdale, AZ 108-112 (2001).

12. M.D. Fairchild and G.M. Johnson, Meet iCAM: An next-generation color appearance model, IS&T/SID 10th

Color

Imaging Conference, Scottsdale, AZ 33-38 (2002).

Image Quality Scaling for Electrophotographic Prints

Documents