PROBABILISTIC APPROACHES TO IMAGE REGISTRATION AND DENOISING By AJIT RAJWADE A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2010
240
Embed
c 2010 Ajit Rajwade - University of Floridaufdcimages.uflib.ufl.edu/UF/E0/04/23/49/00001/rajwade_a.pdf · probabilistic approaches to image registration and denoising by ajit rajwade
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
PROBABILISTIC APPROACHES TO IMAGE REGISTRATION AND DENOISING
By
AJIT RAJWADE
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY
2-1 Comparison between different methods of density estimation w.r.t. nature ofdomain, bias, speed, and geometric nature of density contributions . . . . . . . 43
2-2 Timing values for computation of joint PDFs and L1 norm of difference betweenPDF computed by sampling with that computed using iso-contours; Numberof bins is 128× 128, size of images 122× 146 . . . . . . . . . . . . . . . . . . . 45
3-1 Average and std. dev. of error in degrees (absolute difference between trueand estimated angle of rotation) for MI using Parzen windows . . . . . . . . . . 61
3-2 Average value and variance of parameters θ, s and t predicted by various methods(32 and 64 bins, noise σ = 0.2); Ground truth: θ = 30, s = t = −0.3 . . . . . . 66
3-3 Average value and variance of parameters θ, s and t predicted by various methods(32 and 64 bins, noise σ = 1); Ground truth: θ = 30, s = t = −0.3 . . . . . . . 67
3-4 Average error (absolute diff.) and variance in measuring angle of rotation usingMI, NMI calculated with different methods, noise σ = 0.05 . . . . . . . . . . . . 67
3-5 Average error (absolute diff.) and variance in measuring angle of rotation usingMI, NMI calculated with different methods, noise σ = 0.2 . . . . . . . . . . . . . 68
3-6 Average error (absolute diff.) and variance in measuring angle of rotation usingMI, NMI calculated with different methods, noise σ = 1 . . . . . . . . . . . . . . 68
3-7 Three image case: angles of rotation using MMI, MNMI calculated with theiso-contour method and simple histograms, for noise variance σ = 0.05, 0.1, 1(Ground truth 20 and 30) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3-8 Error (average, std. dev.) validated over 10 trials with LengthProb and histogramsfor 128 bins; R refers to the intensity range of the image . . . . . . . . . . . . . 69
4-1 MSE for filtered images using our method and using mean shift with Gaussiankernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4-2 MSE for filtered images using our method, using mean shift with Gaussiankernels and using mean shift with Epanechnikov kernels . . . . . . . . . . . . . 84
7-1 Avg, max and median error on synthetic patches from Figure 7-4 with MAPand MMSE estimators for DCT bases . . . . . . . . . . . . . . . . . . . . . . . 190
7-2 Avg, max and median error on synthetic patches from Figure 7-4 with MAPand MMSE estimators for SVD basis of the clean synthetic patch . . . . . . . . 190
7-3 PSNR values for noise level σ = 5 on the benchmark dataset . . . . . . . . . . 191
7-4 SSIM values for noise level σ = 5 on the benchmark dataset . . . . . . . . . . 191
10
7-5 PSNR values for noise level σ = 10 on the benchmark dataset . . . . . . . . . 192
7-6 SSIM values for noise level σ = 10 on the benchmark dataset . . . . . . . . . . 192
7-7 PSNR values for noise level σ = 15 on the benchmark dataset . . . . . . . . . 193
7-8 SSIM values for noise level σ = 15 on the benchmark dataset . . . . . . . . . . 193
7-9 PSNR values for noise level σ = 20 on the benchmark dataset . . . . . . . . . 194
7-10 SSIM values for noise level σ = 20 on the benchmark dataset . . . . . . . . . . 194
2-1 p(α) ∝ area between level curves at α and α+ ∆α (i.e. region with red dots) . 42
2-2 (A) Intersection of level curves of I1 and I2: p(α1,α2) ∝ area of dark blackregions. (B) Parallelogram approximation: PDF contribution = area (ABCD). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2-3 (A) Area of parallelogram increases as angle between level curves decreases(left to right). Level curves of I1 and I2 are shown in red and blue lines respectively(B) Joint probability contribution in the case of three images . . . . . . . . . . 43
2-4 A retinogram [1] and its rotated negative . . . . . . . . . . . . . . . . . . . . . 44
2-5 Following left to right and top to bottom, joint densities of the retinogram imagescomputed by histograms (using 16, 32, 64, 128 bins) and by our area-basedmethod (using 16, 32, 64 and 128 bins) . . . . . . . . . . . . . . . . . . . . . . 44
2-6 Marginal densities of the retinogram image computed by histograms [from (A)to (D)] and our area-based method [from (E) to (H)] using 16, 32, 64 and 128bins (row-wise order) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2-7 Probability contribution and geometry of isocontour pairs . . . . . . . . . . . . 46
2-8 Splitting a voxel (A) into 12 tetrahedra, two on each of the six faces of the voxel;and (B) into 24 tetrahedra, four on each of the six faces of the voxel . . . . . . 46
2-9 Counting level curve intersections within a given half-pixel . . . . . . . . . . . . 47
2-10 Biased estimates in 3D: (A) Segment of intersection of planar iso-surfacesfrom the two images, (B) Point of intersection of planar iso-surfaces from thethree images (each in a different color) . . . . . . . . . . . . . . . . . . . . . . . 47
2-12 Plots of the difference between the joint PDF (of the images in subfigure [A])computed by the area-based method and by histogramming with Ns sub-pixelsamples versus logNs using (B) L1 norm, (C) L2 norm, and (D) JSD . . . . . . 49
3-1 Graphs showing the average error and error standard deviation with MI as thecriterion for 16, 32, 64, 128 bins with a noise σ ∈ 0.05, 0.2and1 . . . . . . . . 62
3-2 MI with 32 and 128 bins for a noise level of 0.05, 0.2 and 1 . . . . . . . . . . . 63
5-2 Plot of projected normal and von-Mises densities . . . . . . . . . . . . . . . . . 109
6-1 Mandrill image: (A) with no noise, (B) with noise of σ = 10, (C) with noise ofσ = 20; the noise is hardly visible in the textured fur region (viewed best whenzoomed in the pdf file) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7-1 Global SVD Filtering on the Barbara image . . . . . . . . . . . . . . . . . . . . 166
7-2 Patch-based SVD filtering on the Barbara image . . . . . . . . . . . . . . . . . 167
7-8 Threshold functions for coefficients of (A) the sixth and (B) the seventh patchfrom Figure 7-4 when projected onto SVD bases of patches from the database 172
7-22 For σ = 20, denoised Barbara image with NL-SVD (A) [PSNR = 30.96] andDCT (C) [PSNR = 29.92]. For the same noise level, denoised boat image withNL-SVD (B) [PSNR = 30.24] and DCT (D) [PSNR = 29.95]. . . . . . . . . . . . 185
7-23 (A) Checkerboard image, (B) Noisy version of the image with σ = 20, (C)Denoised with NL-SVD (PSNR = 34) and (D) DCT (PSNR = 27). Zoom in forbetter view. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
14
7-24 Absolute difference between true Barbara image and denoised image producedby (A) NL-SVD, (B) BM3D1, (C) BM3D2. All three algorithms were run on imagewith noise σ = 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
7-25 A zoomed view of Barbara’s face for (A) the original image, (B) NL-SVD and(C) BM3D2. Note the shock artifacts on Barbara’s face produced by BM3D2. . 187
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
PROBABILISTIC APPROACHES TO IMAGE REGISTRATION AND DENOISING
We can now adopt a change of variables from the spatial coordinates (x , y) to u(x , y)
and I (x , y), where u and I are the directions parallel and perpendicular to the level
curve of intensity α, respectively. Observe that I points in the direction of the image
gradient, or the direction of maximum intensity change. Noting this fact, we now obtain
the following:
p(α) =1
A
∫I (x ,y)=α
∣∣∣∣∣∣∣∂x∂I
∂y∂I
∂x∂u
∂y∂u
∣∣∣∣∣∣∣ du. (2–27)
Note that in Eq. (2–27), dα and dI have “canceled” each other out, as they both
stand for intensity change. After performing a change of variables and some algebraic
manipulations (see Appendix A for the complete derivation), we get the following
expression for the marginal density
p(α) =1
A
∫I (x ,y)=α
du√I 2x + I
2y
. (2–28)
From the above expression, one can make some important observations. Each
point on a given level curve contributes a certain measure to the density at that intensity
which is inversely proportional to the magnitude of the gradient at that point. In other
words, in regions of high intensity gradient, the area between two level curves at nearby
intensity levels would be small, as compared to that in regions of lower image gradient
(see Figure 2-1). When the gradient value at a point is zero (owing to the existence of a
peak, a valley, a saddle point or a flat region), the contribution to the density at that point
tends to infinity. (The practical repercussions of this situation are discussed later on in
the paper. Lastly, the density at an intensity level can be estimated by traversing the
28
level curve(s) at that intensity and integrating the reciprocal of the gradient magnitude.
One can obtain an estimate of the density at several intensity levels (at intensity spacing
of h from each other) across the entire intensity range of the image.
2.2.2 Related Work
A similar density estimator has also been developed by another group of researchers
[12], completely independently of this work. Their density estimator is motivated
exclusively by random variable transformations and does not incorporate the notion
of level sets. Furthermore, apart from differences in the derivation of the results, there
are differences in implementation. Moreover the applications they have targeted are
mainly image segmentation, particularly in the biomedical domain [13]. Similar notions
of densities obtained from random variable transformations have been mentioned in [14]
in the context of histogram preserving continuous transformations, with applications to
studying different projections of 3D models. However, in their actual implementation,
only digital samples are used, and there is no notion of any joint statistics. The density
estimator presented in this thesis was specifically developed in the context of an image
registration application (more about this in Chapter 3), and has been extended for
various special cases such as images defined in 3D, two or more than two images in
2D, and biased density estimators in 2D as well as 3D (as will been seen in subsequent
sections of this chapter).
2.2.3 Other Methods for Derivation
There exist at least two other methods of deriving the expression above, which are
discussed below.
1. Using Dirac-delta functions: The Dirac-delta function (with its domain being thereal line) is defined as follows:
δ(x) = +∞( if x = 0) (2–29)= 0( if x 6= 0)
29
in such a way that ∫ +∞−∞
δ(x)dx = 1. (2–30)
The delta function has analogous definitions in higher dimensions. It is awell-known property of the delta function (in any dimension) that∫ +∞
−∞f (~x)δ(I (~x))d~x =
∫I−1(0)
f (~x)du
|∇I (~x)|. (2–31)
Setting f (~x) to be unity throughout and considering that I (~x) is the image function,it is easy to see that
p(I (~x) = α) =
∫δ(I (~x)− α)dx =
∫I−1(0)
du
|∇I (~x)|. (2–32)
2. An intuitive geometric approach: Again consider the 2D gray-scale imageintensity to be a continuous, scalar-valued function of the spatial variables,represented as z = I (x , y). Assuming locations are iid, the cumulative distributionat a certain intensity level α can be written as follows:
Pr(z < α) =1
A
∫∫z<α
dxdy . (2–33)
Now, the probability density at α is the derivative of the cumulative distribution.This is equal to the difference in the areas enclosed within two level curves thatare separated by an intensity difference of ∆α (or equivalently, the area enclosedbetween two level curves of intensity α and α + ∆α), per unit difference, as∆α → 0 (see Figure (2-1)). At every location (x , y) along the level curve at α,the perpendicular distance (in terms of spatial coordinates) to the level curve atα + ∆α is given as ∆α
g(x ,y)where g(x , y) stands for the magnitude of the intensity
gradient at (x , y). Hence the total area enclosed between the two level curves canbe calculated as this distance integrated all along the contour at α. Denoting thetangent to the level curve as u, and taking the limit as ∆α → 0, we obtain the sameexpression.
2.2.4 Estimating the Joint Density
Consider two images represented as continuous scalar valued functions w1 =
I1(x , y) and w2 = I2(x , y), whose overlap area is A. As before, assume a location
random variable Z = X ,Y with a uniform distribution over the (overlap) field of view.
Further, assume two new random variablesW1 andW2 which are transformations of
the random variable Z and with the transformations given by the gray-scale image
intensity functionsW1 = I1(X ,Y ) andW2 = I2(X ,Y ). Let the set of all regions whose
30
intensity in I1 is less than or equal to α1 and whose intensity in I2 is less than or equal
to α2 be denoted by L. The cumulative distribution Pr(W1 ≤ α1,W2 ≤ α2) at intensity
values (α1,α2) is equal to the ratio of the total area of L to the total overlap area A. The
probability density p(α1,α2) in this case is the second partial derivative of the cumulative
distribution w.r.t. α1 and α2. Consider a pair of level curves from I1 having intensity
values α1 and α1 + ∆α1, and another pair from I2 having intensity α2 and α2 + ∆α2. Let
us denote the region enclosed between the level curves of I1 at α1 and α1 + ∆α1 as Q1
and the region enclosed between the level curves of I2 at α2 and α2 + ∆α2 as Q2. Then
p(α1,α2) can geometrically be interpreted as the area of Q1 ∩ Q2, divided by ∆α1∆α2,
in the limit as ∆α1 and ∆α2 tend to zero. The regions Q1, Q2 and also Q1 ∩ Q2 (dark
black region) are shown in Figure 2-2(left). Using a technique very similar to that shown
in Eqs. (2–25)-(2–27), we obtain the expression for the joint cumulative distribution as
follows:
Pr(W1 ≤ α1,W2 ≤ α2) =1
A
∫ ∫L
dxdy . (2–34)
By doing a change of variables, we arrive at the following formula:
Pr(W1 ≤ α1,W2 ≤ α2) =1
A
∫ ∫L
∣∣∣∣∣∣∣∂x∂u1
∂y∂u1
∂x∂u2
∂y∂u2
∣∣∣∣∣∣∣ du1du2. (2–35)
Here u1 and u2 represent directions along the corresponding level curves of the two
images I1 and I2. Taking the second partial derivative with respect to α1 and α2, we get
the expression for the joint density:
p(α1,α2) =1
A
∂2
∂α1∂α2
∫ ∫L
∣∣∣∣∣∣∣∂x∂u1
∂y∂u1
∂x∂u2
∂y∂u2
∣∣∣∣∣∣∣ du1du2. (2–36)
It is important to note here again, that the joint density in (2–36) may not exist
because the cumulative may not be differentiable. Geometrically, this occurs if (a) both
31
the images have locally constant intensity, (b) if only one image has locally constant
intensity, or (c) if the level sets of the two images are locally parallel. In case (a), we
have area-measures and in the other two cases, we have curve-measures. These cases
are described in detail in the following section, but for the moment, we shall ignore these
degeneracies.
To obtain a complete expression for the PDF in terms of gradients, it would be
highly intuitive to follow purely geometric reasoning. One can observe that the joint
probability density p(α1,α2) is the sum total of “contributions” at every intersection
between the level curves of I1 at α1 and those of I2 at α2. Each contribution is the
area of parallelogram ABCD [see Figure 2-2(right)] at the level curve intersection, as
the intensity differences ∆α1 and ∆α2 shrink to zero. (We consider a parallelogram
here, because we are approximating the level curves locally as straight lines.) Let the
coordinates of the point B be (x , y) and the magnitude of the gradient of I1 and I2 at
this point be g1(x , y) and g2(x , y). Also, let θ(x , y) be the acute angle between the
gradients of the two images at B. Observe that the intensity difference between the
two level curves of I1 is ∆α1. Then, using the definition of gradient, the perpendicular
distance between the two level curves of I1 is given as ∆α1g1(x ,y)
. Looking at triangle CDE
(wherein CE is perpendicular to the level curves) we can now deduce the length of
CD (or equivalently that of AB). Similarly, we can also find the length CB. The two
expressions are given by:
|AB| = ∆α1g1(x , y) sin θ(x , y)
, |CB| = ∆α2g2(x , y) sin θ(x , y)
. (2–37)
Now, the area of the parallelogram is equal to
|AB||CB| sin θ(x , y) (2–38)
=∆α1∆α2
g1(x , y)g2(x , y) sin θ(x , y).
32
With this, we finally obtain the following expression for the joint density:
p(α1,α2) =1
A
∑C
1
g1(x , y)g2(x , y) sin θ(x , y)(2–39)
where the set C represents the (countable) locus of all points where I1(x , y) = α1
and I2(x , y) = α2. It is easy to show through algebraic manipulations that Eqs. (2–36)
and (2–39) are equivalent formulations of the joint probability density p(α1,α2). These
results could also have been derived purely by manipulation of Jacobians (as done
while deriving marginal densities), and the derivation for the marginals could also have
proceeded following geometric intuitions.
The formula derived above tallies beautifully with intuition in the following ways.
Firstly, the area of the parallelogram ABCD (i.e. the joint density contribution) in regions
of high gradient [in either or both image(s)] is smaller as compared to that in the case of
regions with lower gradients. Secondly, the area of parallelogram ABCD (i.e. the joint
density contribution) is the least when the gradients of the two images are orthogonal
and maximum when they are parallel or coincident [see Figure 2-3(a)]. In fact, the
joint density tends to infinity in the case where either (or both) gradient(s) is (are)
zero, or when the two gradients align, so that sin θ is zero. The repercussions of this
phenomenon are discussed in the following section.
2.2.5 From Densities to Distributions
In the two preceding sub-sections, we observed the divergence of the marginal
density in regions of zero gradient, or of the joint density in regions where either (or
both) image gradient(s) is (are) zero, or when the gradients locally align. The gradient
goes to zero in regions of the image that are flat in terms of intensity, and also at peaks,
valleys and saddle points on the image surface. We can ignore the latter three cases
as they are a finite number of points within a continuum. The probability contribution
at a particular intensity in a flat region is proportional to the area of that flat region.
Some ad hoc approaches could involve simply “weeding out” the flat regions altogether,
33
but that would require the choice of sensitive thresholds. The key thing is to notice
that in these regions, the density does not exist but the probability distribution does.
So, we can switch entirely to probability distributions everywhere by introducing a
non-zero lower bound on the “values” of ∆α1 and ∆α2. Effectively, this means that
we always look at parallelograms representing the intersection between pairs of level
curves from the two images, separated by non-zero intensity difference, denoted
as, say, h. Since these parallelograms have finite areas, we have circumvented the
situation of choosing thresholds to prevent the values from becoming unbounded,
and the probability at α1,α2, denoted as p(α1,α2) is obtained from the areas of such
parallelograms. We term this area-based method of density estimation as AreaProb.
Later on in the paper, we shall show that the switch to distributions is principled and
does not reduce our technique to standard histogramming in any manner whatsoever.
The notion of an image as a continuous entity is one of the pillars of our approach.
We adopt a locally linear formulation in this paper, for the sake of simplicity, though
the technical contributions of this paper are in no way tied to any specific interpolant.
For each image grid point, we estimate the intensity values at its four neighbors within
a horizontal or vertical distance of 0.5 pixels. We then divide each square defined by
these neighbors into a pair of triangles. The intensities within each triangle can be
represented as a planar patch, which is given by the equation z1 = A1x + B1y + C1 in
I1. Iso-intensity lines at levels α1 and α1 + h within this triangle are represented by the
equations A1x +B1y +C1 = α1 and A1x +B1y +C1 = α1+ h (likewise for the iso-intensity
lines of I2 at intensities α2 and α2 + h, within a triangle of corresponding location). The
contribution from this triangle to the joint probability at (α1,α2), i.e. p(α1,α2) is the
area bounded by the two pairs of parallel lines, clipped against the body of the triangle
itself, as shown in Figure 2-7. In the case that the corresponding gradients from the two
images are parallel (or coincident), they enclose an infinite area between them, which
when clipped against the body of the triangle, yields a closed polygon of finite area, as
34
shown in Figure 2-7. When both the gradients are zero (which can be considered to be a
special case of gradients being parallel), the probability contribution is equal to the area
of the entire triangle. In the case where the gradient of only one of the images is zero,
the contribution is equal to the area enclosed between the parallel iso-intensity lines
of the other image, clipped against the body of the triangle (see Figure 2-7). Observe
that though we have to treat pathological regions specially (despite having switched to
distributions), we now do not need to select thresholds, nor do we need to deal with a
mixture of densities and distributions. The other major advantage is added robustness to
noise, as we are now working with probabilities instead of their derivatives, i.e. densities.
The issue that now arises is how the value of h may be chosen. It should be
noted that although there is no “optimal” h, our density estimate would convey more
and more information as the value of h is reduced (in complete contrast to standard
histogramming). In Figure 2-5, we have shown plots of our joint density estimate and
compared it to standard histograms for P equal to 16, 32, 64 and 128 bins in each
image (i.e. 322, 642 etc. bins in the joint), which illustrate our point clearly. We found
that the standard histograms had a far greater number of empty bins than our density
estimator, for the same number of intensity levels. The corresponding marginal discrete
distributions for the original retinogram image [1] for 16, 32, 64 and 128 bins are shown
in Figure 2-6.
2.2.6 Joint Density between Multiple Images in 2D
For the simultaneous registration of multiple (d > 2) images, the use of a single
d-dimensional joint probability has been advocated in previous literature [15], [16]. Our
joint probability derivation can be easily extended to the case of d > 2 images by using
similar geometric intuition to obtain the polygonal area between d intersecting pairs of
level curves [see Figure 2-3(right) for the case of d = 3 images]. Note here that the
d-dimensional joint distribution lies essentially in a 2D subspace, as we are dealing
with 2D images. A naıve implementation of such a scheme has a complexity of O(NPd)
35
where P is the number of intensity levels chosen for each image and N is the size of
each image. Interestingly, however, this exponential cost can be side-stepped by first
computing the at most (d(d−1)2)P2 points of intersection between pairs of level curves
from all d images with one another, for every pixel. Secondly, a graph can be created,
each of whose nodes is an intersection point. Nodes are linked by edges labeled with
the image number (say k th image) if they lie along the same iso-contour of that image. In
most cases, each node of the graph will have a degree of four (and in the unlikely case
where level curves from all images are concurrent, the maximal degree of a node will
be 2d). Now, this is clearly a planar graph, and hence, by Euler’s formula, we have the
number of (convex polygonal) faces F = d(d−1)2
∗ 4P2 − d(d−1)2P2 + 2 = O(P2d2), which
is quadratic in the number of images. The area of the polygonal faces are contributions
to the joint probability distribution. In a practical implementation, there is no requirement
to even create the planar graph. Instead, we can implement a simple incremental
face-splitting algorithm ([17], section 8.3). In such an implementation, we create a list of
faces F which is updated incrementally. To start with, F consists of just the triangular
face constituting the three vertices of a chosen half-pixel in the image. Next, we consider
a single level-line l at a time and split into two any face in F that l intersects. This
procedure is repeated for all level lines (separated by a discrete intensity spacing) of all
the d images. The final output is a listing of all polygonal faces F created by incremental
splitting which can be created in just O(FPd) time. The storage requirement can be
made polynomial by observing that for d images, the number of unique intensity tuples
will be at most FN in the worst case (as opposed to Pd ). Hence all intensity tuples can
be efficiently stored and indexed using a hash table.
2.2.7 Extensions to 3D
When estimating the probability density from 3D images, the choice of an optimal
smoothing parameter is a less critical issue, as a much larger number of samples
are available. However, at a theoretical level this still remains a problem, which would
36
worsen in the multiple image case. In 3D, the marginal probability can be interpreted as
the total volume sandwiched between two iso-surfaces at neighboring intensity levels.
The formula for the marginal density p(α) of a 3D image w = I (x , y , z) is given as
follows:
p(α) =1
V
d
dα
∫ ∫ ∫I (x ,y ,z)≤α
dxdydz . (2–40)
Here V is the volume of the image I (x , y , z). We can now adopt a change of variables
from the spatial coordinates x , y and z to u1(x , y , z), u2(x , y , z) and I (x , y , z), where I
is the perpendicular to the level surface (i.e. parallel to the gradient) and u1 and u2 are
mutually perpendicular directions parallel to the level surface. Noting this fact, we now
obtain the following:
p(α) =1
V
∫ ∫I (x ,y ,z)=α
∣∣∣∣∣∣∣∣∣∣∂x∂I
∂y∂I
∂z∂I
∂x∂u1
∂y∂u1
∂z∂u1
∂x∂u2
∂y∂u2
∂z∂u2
∣∣∣∣∣∣∣∣∣∣du1du2. (2–41)
Upon a series of algebraic manipulations just as before, we are left with the following
expression for p(α):
p(α) =1
V
∫ ∫I (x ,y ,z)=α
du1du2√( ∂I∂x)2 + ( ∂I
∂y)2 + ( ∂I
∂z)2. (2–42)
For the joint density case, consider two 3D images represented as w1 = I1(x , y , z)
and w2 = I2(x , y , z), whose overlap volume (the field of view) is V . The cumulative
distribution Pr(W1 ≤ α1,W2 ≤ α2) at intensity values (α1,α2) is equal to the ratio of
the total volume of all regions whose intensity in the first image is less than or equal
to α1 and whose intensity in the second image is less than or equal to α2, to the total
image volume. The probability density p(α1,α2) is again the second partial derivative
of the cumulative distribution. Consider two regions R1 and R2, where R1 is the region
trapped between level surfaces of the first image at intensities α1 and α1 + ∆α1, and R2
is defined analogously for the second image. The density is proportional to the volume
37
of the intersection of R1 and R2 divided by ∆α1 and ∆α2 when the latter two tend to zero.
It can be shown through some geometric manipulations that the area of the base of
the parallelepiped formed by the iso-surfaces is given as ∆α1∆α2| ~g1× ~g2| =∆α1∆α2
|g1g2 sin(θ)| , where ~g1
and ~g2 are the gradients of the two images, and θ is the angle between them. Let ~h be
a vector which points in the direction of the height of the parallelepiped (parallel to the
base normal, i.e. ~g1 × ~g2), and d~h be an infinitesimal step in that direction. Then the
probability density is given as follows:
p(α1,α2) =1
V
∂2
∂α1∂α2
∫ ∫ ∫Vs
dxdydz
=1
V
∂2
∂α1∂α2
∫ ∫ ∫Vs
d ~u1d ~u2d~h
|~g1 × ~g2|=1
V
∫C
d~h
|~g1 × ~g2|. (2–43)
In Eq. (2–43), ~u1 and ~u2 are directions parallel to the iso-surfaces of the two images, and
~h is their cross-product (and parallel to the line of intersection of the individual planes),
while C is the 3D space curve containing the points where I1 and I2 have values α1 and
α2 respectively and Vsdef= (x , y , z) : I1(x , y , z) ≤ α1, I2(x , y , z) ≤ α2.
2.2.8 Implementation Details for the 3D case
The density formulation for the 3D case suffers from the same problem of
divergence to infinity, as in the 2D case. Similar techniques can be employed, this
time using level surfaces that are separated by finite intensity gaps. To trace the level
surfaces, each cube-shaped voxel in the 3D image can be divided into 12 tetrahedra.
The apex of each tetrahedron is located at the center of the voxel and the base is
formed by dividing one of the six square faces of the cube by one of the diagonals of
that face [see Figure 2-8(a)]. Within each triangular face of each such tetrahedron, the
intensity can be assumed to be a linear function of location. Note that the intensities
in different faces of one and the same tetrahedron can thus be expressed by different
functions, all of them linear. Hence the iso-surfaces at different intensity levels within a
single tetrahedron are non-intersecting but not necessarily parallel. These level surfaces
at any intensity within a single tetrahedron turn out to be either triangles or quadrilaterals
38
in 3D. This interpolation scheme does have some bias in the choice of the diagonals
that divide the individual square faces. A scheme that uses 24 tetrahedra with the apex
at the center of the voxel, and four tetrahedra based on every single face, has no bias
of this kind [see Figure 2-8(b)]. However, we still used the former (and faster) scheme
as it is simpler and does not noticeably affect the results. Level surfaces are again
traced at a finite number of intensity values, separated by equal intensity intervals. The
marginal density contributions are obtained as the volumes of convex polyhedra trapped
in between consecutive level surfaces clipped against the body of individual tetrahedra.
The joint distribution contribution from each voxel is obtained by finding the volume of
the convex polyhedron resulting from the intersection of corresponding convex polyhedra
from the two images, clipped against the tetrahedra inside the voxel. We refer to this
scheme of finding joint densities as VolumeProb.
2.2.9 Joint Densities by Counting Points and Measuring Lengths
For the specific case of registration of two images in 2D, we present another
method of density estimation. This method, which was presented by us earlier in [10],
is a biased estimator that does not assume a uniform distribution on location. In this
technique, the total number of co-occurrences of intensities α1 and α2 from the two
images respectively, is obtained by counting the total number of intersections of the
corresponding level curves. Each half-pixel can be examined to see whether level
curves of the two images at intensities α1 and α2 can intersect within the half-pixel.
This process is repeated for different (discrete) values from the two images (α1 and α2),
separated by equal intervals and selected a priori (see Figure 2-9). The co-occurrence
counts are then normalized so as to yield a joint probability mass function (PMF). We
denote this method as 2DPointProb. The marginals are obtained by summing up the
joint PMF along the respective directions. This method, too, avoids the histogramming
binning problem as one has the liberty to choose as many level curves as desired.
However, it is a biased density estimator because more points are picked from regions
39
with high image gradient. This is because more level curves (at equi-spaced intensity
levels) are packed together in such areas. It can also be regarded as a weighted version
of the joint density estimator presented in the previous sub-section, with each point
weighted by the gradient magnitudes of the two images at that point as well as the sine
of the angle between them. Thus the joint PMF by this method is given as
p(α1,α2) =∂2
∂α1∂α2
1
K
∫ ∫D
g1(x , y)g2(x , y) sin θ(x , y)dxdy (2–44)
where D denotes the regions where I1(x , y) ≤ α1, I2(x , y) ≤ α2 and K is a normalization
constant. This simplifies to the following:
p(α1,α2) =1
K
∑C
1. (2–45)
Hence, we have p(α1,α2) =|C |K
, where C is the (countable) set of points where
I1(x , y) = α1 and I2(x , y) = α2. The marginal (biased) density estimates can be regarded
as lengths of the individual iso-contours. With this notion in mind, the marginal density
estimates are seen to have a close relation with the total variation of an image, which
is given by TV =∫I=α
|∇I (x , y)|dxdy [18]. We clearly have TV =∫I=αdu, by doing
the same change of variables (from x , y to u, I ) as in Eqs. (2–27) and (2–28), thus
giving us the length of the iso-contours at any given intensity level. In 3D, we consider
the segments of intersection of two iso-surfaces and calculate their lengths, which
become the PMF contributions. We refer to this as LengthProb [see Figure 2-10(a)].
Both 2DPointProb and LengthProb, however, require us to ignore those regions in which
level sets do not exist because the intensity function is flat, or those regions where level
sets from the two images are parallel. The case of flat regions in one or both images can
be fixed to some extent by slight blurring of the image. The case of aligned gradients
is trickier, especially if the two images are in complete registration. However, in the
multi-modality case or if the images are noisy/blurred, perfect registration is a rare
occurrence, and hence perfect alignment of level surfaces will rarely occur.
40
To summarize, in both these techniques, location is treated as a random vari-
able with a distribution that is not uniform, but instead peaked at (biased towards)
locations where specific features of the image itself (such as gradients) have large
magnitudes or where gradient vectors from the two images are closer towards being
perpendicular than parallel. Such a bias towards high gradients is principled, as these
are the more salient regions of the two images. Empirically, we have observed that
both these density estimators work quite well on affine registration, and that Length-
Prob is more than 10 times faster than VolumeProb. This is because the computation
of segments of intersection of planar iso-surfaces is much faster than computing
polyhedron intersections. Joint PMF plots for histograms and LengthProb for 128 bins
and 256 bins are shown in Figure 2-11.
There exists one more major difference between AreaProb and VolumeProb on
one hand, and LengthProb or 2DPointProb on the other. The former two can be easily
extended to compute joint density between multiple images (needed for co-registration
of multiple images using measures such as modified mutual information (MMI) [15]). All
that is required is the intersection of multiple convex polyhedra in 3D or multiple convex
polygons in 2D (see Section 2.2.6). However, 2DPointProb is strictly applicable to the
case of the joint PMF between exactly two images in 2D, as the problem of intersection
of three or more level curves at specific (discrete) intensity levels is over-constrained.
In 3D, LengthProb also deals with strictly two images only, but one can extend the
LengthProb scheme to also compute the joint PMF between exactly three images. This
can be done by making use of the fact that three planar iso-surfaces intersect in a point
(excepting degenerate cases) [see Figure 2-10(b)]. The joint PMFs between the three
images are then computed by counting point intersections. We shall name this method
as 3DPointProb. The differences between all the aforementioned methods: AreaProb,
2DPointProb, VolumeProb, LengthProb and 3DPointProb are summarized in Table 2-1
for quick reference. It should be noted that 2DPointProb, LengthProb and 3DPointProb
41
Level Curve at α+∆α
Level curve at α
Area between level curves
Figure 2-1. p(α) ∝ area between level curves at α and α+∆α (i.e. region with red dots)
compute PMFs, whereas AreaProb and VolumeProb compute cumulative measures
over finite intervals.
2.3 Experimental Results: Area-Based PDFs Versus Histograms with SeveralSub-Pixel Samples
The accuracy of the histogram estimate will no doubt approach the true PDF as
the number of samples Ns (drawn from sub-pixel locations) tends to infinity. However,
we wish to point out that our method implicitly and efficiently considers every point as
a sample, thereby constructing the PDF directly, i.e. the accuracy of what we calculate
with the area-based method will always be an upper bound on the accuracy yielded by
any sample-based approach, under the assumption that the true interpolant is known to
us. We show here an anecdotal example for the same, in which the number of histogram
samples Ns is varied from 5000 to 2 × 109. The L1 and L2 norms of the difference
between the joint PDF of two 90 x 109 images (down-sampled MR-T1 and MR-T2 slices
obtained from Brainweb [19]) as computed by our method and that obtained by the
histogram method, as well as the Jensen-Shannon divergence (JSD) between the two
joint PDFs, are plotted in the figures below versus logNs (see Figure 2-12). The number
of bins used was 128× 128 (i.e. h = 128). Visually, it was observed that the joint density
surfaces begin to appear ever more similar as Ns increases. The timing values for the
joint PDF computation are shown in Table 2-2, clearly showing the greater efficiency of
our method.
42
Level Curves of Image 1at levels α1 and α1+∆α1
Level Curves of Image 2at levels α2 and α2+∆α2
Region P
Region Q
Intersection of P and Q
A
A
D C
B
Level Curves of I1at α1 and α1+∆α1
The level curves of I1 and I2make an angle θ w.r.t. each other
E
Level Curves of I2at α2 and α2+∆α2
length(CE) = ∆α1/g1(x,y);intensity spacing = ∆α1
B
Figure 2-2. (A) Intersection of level curves of I1 and I2: p(α1,α2) ∝ area of dark blackregions. (B) Parallelogram approximation: PDF contribution = area (ABCD)
A
Level Curves of I1
Level Curves of I2
Level Curves of I3
B
Figure 2-3. (A) Area of parallelogram increases as angle between level curvesdecreases (left to right). Level curves of I1 and I2 are shown in red and bluelines respectively (B) Joint probability contribution in the case of threeimages
Table 2-1. Comparison between different methods of density estimation w.r.t. nature ofdomain, bias, speed, and geometric nature of density contributions
Method 2D/3D Density Contr. Bias No. of imagesAreaProb 2D Area No Any
VolumeProb 3D Volume No AnyLengthProb 3D Length Yes 2 only2DPointProb 2D Point count Yes 2 only3DPointProb 3D Point count Yes 3 only
43
A B
Figure 2-4. A retinogram [1] and its rotated negative
0
10
20
0
10
200
0.02
0.04
0.06
Joint PDF 16 bins (using simple hist.)
A
0
20
40
0
20
400
0.005
0.01
0.015
Joint PDF 32 bins (using simple hist.)
B
0
50
100
0
50
1000
1
2
3
4x 10
−3
Joint PDF 64 bins (using simple hist.)
C
050
100150
0
50
100
1500
0.5
1
1.5
2x 10
−3
Joint PDF 128 bins (using simple hist.)
D
0
10
20
0
10
200
0.02
0.04
0.06
Joint PDF 16 bins (using isocontours)
E
0
20
40
0
20
400
0.005
0.01
0.015
Joint PDF 32 bins (using isocontours)
F
0
50
100
0
50
1000
1
2
3
4x 10
−3
Joint PDF 64 bins (using isocontours)
G
050
100150
0
50
100
1500
0.5
1x 10
−3
Joint PDF 128 bins (using isocontours)
H
Figure 2-5. Following left to right and top to bottom, joint densities of the retinogramimages computed by histograms (using 16, 32, 64, 128 bins) and by ourarea-based method (using 16, 32, 64 and 128 bins)
44
0 5 10 15 200
0.05
0.1
0.15
0.2
0.25Marginal PDF 16 bins (using simple hist.)
A0 10 20 30 40
0
0.02
0.04
0.06
0.08
0.1
0.12Marginal PDF 32 bins (using simple hist.)
B0 20 40 60 80
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07Marginal PDF 64 bins (using simple hist.)
C
0 50 100 1500
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04Marginal PDF 128 bins (using simple hist.)
D0 5 10 15 20
0
0.05
0.1
0.15
0.2
0.25Marginal PDF 16 bins (using isocontours.)
E0 10 20 30 40
0
0.02
0.04
0.06
0.08
0.1
0.12Marginal PDF 32 bins (using isocontours.)
F
0 20 40 60 800
0.01
0.02
0.03
0.04
0.05
0.06Marginal PDF 64 bins (using isocontours.)
G0 50 100 150
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035Marginal PDF 128 bins (using isocontours.)
H
Figure 2-6. Marginal densities of the retinogram image computed by histograms [from(A) to (D)] and our area-based method [from (E) to (H)] using 16, 32, 64 and128 bins (row-wise order)
Table 2-2. Timing values for computation of joint PDFs and L1 norm of differencebetween PDF computed by sampling with that computed using iso-contours;Number of bins is 128× 128, size of images 122× 146
Method Time (secs.) Diff. with iso-contour PDFIso-contours 5.1 0
Figure 2-7. Left: Probability contribution equal to area of parallelogram between levelcurves clipped against the triangle, i.e. half-pixel. Middle: Case of parallelgradients. Right: Case when the gradient of one image is zero (blue levellines) and that of the other is non-zero (red level lines). In each case,probability contribution equals area of the dark black region
Center ofvoxel
Face of one ofthe tetrahedra
A
Center ofvoxel
Each square face is the baseof four tetrahedra
B
Figure 2-8. Splitting a voxel (A) into 12 tetrahedra, two on each of the six faces of thevoxel; and (B) into 24 tetrahedra, four on each of the six faces of the voxel
46
Neighbors ofgrid point
Pixel grid point
Square divided intotwo triangles
Iso-intensity line of I1 at α1
Iso-intensity line of I2 at α2
A vote for p(α1,α2)
A vote for p(α1+∆,α2+∆)
Figure 2-9. Counting level curve intersections within a given half-pixel
Line of intersectionof two planes
Planar Isosurfacesfrom the two images
A
Line of intersectionof two planes
Planar Isosurfacesfrom the three images
Point of intersection ofthree planes
B
Figure 2-10. Biased estimates in 3D: (A) Segment of intersection of planar iso-surfacesfrom the two images, (B) Point of intersection of planar iso-surfaces fromthe three images (each in a different color)
L1 norm of difference betn. true and est. PDF vs. log Ns
B
8 10 12 14 16 18 20 220
1
2
x 10−4
L2 norm of difference betn. true and est. PDF vs. log Ns
C
8 10 12 14 16 18 20 220
0.02
0.04
0.06
0.08
0.1
0.12
JSD betn. true and est. PDF vs. log Ns
D
Figure 2-12. Plots of the difference between the joint PDF (of the images in subfigure[A]) computed by the area-based method and by histogramming with Nssub-pixel samples versus logNs using (B) L1 norm, (C) L2 norm, and (D)JSD
49
CHAPTER 3APPLICATION TO IMAGE REGISTRATION
3.1 Entropy Estimators in Image Registration
Information theoretic tools have for a long time been established as the de facto
technique for image registration, especially in the domains of medical imaging [20] and
remote sensing [21] which deal with a large number of modalities. The ground-breaking
work for this was done by Viola and Wells [22], and Maes et al. [23] in their widely cited
papers1 . A detailed survey of subsequent research on information theoretic techniques
in medical image registration is presented in the works of Pluim et al. [20] and Maes
et al. [24]. A required component of all information theoretic techniques in image
registration is a good estimator of the joint entropies of the images being registered.
Most techniques employ plug-in entropy estimators, wherein the joint and marginal
probability densities of the intensity values in the images are first estimated and these
quantities are then used to obtain the entropy. There also exist recent methods which
define a new form of entropy using cumulative distributions instead of probability
densities (see [25], [26] and [27]). Furthermore, there also exist techniques which
directly estimate the entropy, without estimating the probability density or distribution as
an intermediate step [28]. Below, we present a bird’s eye view of these techniques and
their limitations. Subsequently, we introduce our method and bring out its salient merits.
The plug-in entropy estimators rely upon techniques for density estimation as a
key first step. The most popular density estimators are the simple image histogram and
the Parzen window. The latter have been widely employed as a differentiable density
estimator for image registration in [22].The problems associated with these estimators
were 0.01, 0.05, 0.1, 0.2, 0.5, 1 and 2. All these variances are chosen for an intensity
range between 0 and 1. To create the probability distributions, we chose bin counts of
16, 32, 64 and 128. For each combination of bin-count and noise, a brute-force search
was performed so as to optimally align the synthetically rotated noisy image with the
original one, as determined by finding the maximum of MI or NMI between the two
images. Six different techniques were used for MI estimation: (1) simple histograms with
bilinear interpolation for image warping (referred to as “Simple Hist”), (2) our proposed
method using iso-contours (referred to as “Iso-contours”), (3) histogramming with
partial volume interpolation (referred to as “PVI”) (4) histogramming with cubic spline
interpolation (referred to as “Cubic”), (5) the method 2DPointProb proposed in [10], and
(6) simple histogramming with 106 samples taken from sub-pixel locations uniformly
55
randomly followed by usual binning (referred to as “Hist Samples”). These experiments
were repeated for 30 noise trials at each noise standard deviation. For each method, the
mean and the variance of the error (absolute difference between the predicted alignment
and the ground truth alignment) was measured (Figure 3-1). The same experiments
were also performed using a Parzen-window based density estimator using a Gaussian
kernel and σ = 5 (referred to as “Parzen”) over 30 trials. In each trial, 10,000 samples
were chosen. Out of these, 5000 were chosen as centers for the Gaussian kernel and
the rest were used for the sake of entropy computation. The error mean and variance
was recorded (see Table 3-1).
The adjoining error plots (Figure 3-1) show results for all these methods for
all bins counts, for noise levels of 0.05, 0.2 and 1. The accompanying trajectories
(for all methods except histogramming with multiple sub-pixel samples) with MI for
bin-counts of 32 and 128 and noise level 0.05, 0.2 and 1.00 are shown as well, for sake
of comparison, for one arbitrarily chosen noise trial (Figure 3-2). From these figures,
one can appreciate the superior resistance to noise shown by both our methods, even
at very high noise levels, as evidenced both by the shape of the MI and NMI trajectories,
as well as the height of the peaks in these trajectories. Amongst the other methods,
we noticed that PVI is more stable than simple histogramming with either bilinear or
cubic-spline based image warping. In general, the other methods perform better when
the number of histogram bins is small, but even there our method yields a smoother
MI curve. However, as expected, noise does significantly lower the peak in the MI as
well as NMI trajectories in the case of all methods including ours, due to the increase
in joint entropy. Though histogramming with 106 sub-pixel samples performs well (as
seen in Figure 3-1), our method efficiently and directly (rather than asymptotically)
approaches the true PDF and hence the true MI value, under the assumption that we
have access to the true interpolant. Parzen windows with the chosen σ value of 5 gave
good performance, comparable to our technique, but we wish to re-emphasize that the
56
choice of the parameter was arbitrary and the computation time was much more for
Parzen windows.
All the aforementioned techniques were also tested on affine image registration
(except for histogramming with multiple sub-pixel samples and Parzen windowing,
which were found to be too slow). For the same image as in the previous experiment,
an affine-warped version was created using the parameters θ = 30 = 30, t = -0.3,
s = -0.3 and φ = 0. During our experiments, we performed a brute force search on
the three-dimensional parameter space so as to find the transformation that optimally
aligned the second image with the first one. The exact parameterization for the affine
transformation is given in [40]. Results were collected for a total of 20 noise trials
and the average predicted parameters were recorded as well as the variance of
the predictions. For a low noise level of 0.01 or 0.05, we observed that all methods
performed well for a quantization up to 64 bins. With 128 bins, all methods except
the two we have proposed broke down, i.e. yielded a false optimum of θ around 38,
and s and t around 0.4. For higher noise levels, all methods except ours broke down
at a quantization of just 64 bins. The 2DPointProb technique retained its robustness
until a noise level of 1, whereas the area-based technique still produced an optimum
of θ = 28, s = -0.3, t = -0.4 (which is very close to the ideal value). The area-based
technique broke down only at an incredibly high noise level of 1.5 or 2. The average and
standard deviation of the estimate of the parameters θ, s and t, for 32 and 64 bins, for all
five methods and for noise levels 0.2 and 1.00 are presented in Tables 3-2 and 3-3. We
also performed two-sided Kolmogorov-Smirnov tests [41] for statistical significance on
the absolute errors (between the true and estimated affine transformation parameters)
yielded by standard histogramming and the isocontour method, both for 64 bins
and a noise of variance 1. We found that the difference in the error values for MI, as
computed using standard histogramming and our iso-contour technique, was statistically
significant, as ascertained at a level of 0.01.
57
We also performed experiments on determining the angle of rotation using larger
images with varying levels of noise (σ = 0.05, 0.2, 1). The same Brainweb images,
as mentioned before, were used, except that their original size of 183 × 219 was
retained. For a bin count up to 128, all/most methods performed quite well (using a
brute-force search) even under high noise. However with a large bin count (256 bins),
the noise resistance of our method stood out. The results of this experiment with
different methods and under varying noise are presented in Tables 3-4, 3-5 and 3-6.
3.3.2 Registration of Multiple Images in 2D
The images used were pre-registered MR-PD, MR-T1 and MR-T2 slices (from
Brainweb) of sizes 90 x 109. The latter two were rotated by θ1 = 20 and by θ2 = 30
respectively (see Figure 3-3). For different noise levels and intensity quantizations,
a set of experiments was performed to optimally align the latter two images with the
former using modified mutual information (MMI) and its normalized version (MNMI) as
criteria. These criteria were calculated using our area-based method as well as simple
histogramming with bilinear interpolation. The range of angles was from 1 to 40 in
steps of 1. The estimated values of θ1 and θ2 are presented in Table 3-7.
3.3.3 Registration of Volume Datasets
Experiments were performed on sub-volumes of size 41 × 41 × 41 from MR-PD
and MR-T2 datasets from the Brainweb simulator [19]. The MR-PD portion was warped
by 20 about the Y as well as Z axes. A brute-force search (from 5 to 35 in steps of
1, with a joint PMF of 64 × 64 bins) was performed so as to optimally register the
MR-T2 volume with the pre-warped MR-PD volume. The PMF was computed both
using LengthProb as well as using simple histogramming, and used to compute the
MI/NMI just as before. The computed values were also plotted against the two angles as
indicated in the top row of Figure 3-4. As the plots indicate, both the techniques yielded
the MI peak at the correct point in the θY , θZ plane, i.e. at 20, 20. When the same
experiments were run using VolumeProb, we observed that the joint PMF computation
58
for the same intensity quantization was more than ten times slower. Similar experiments
were performed for registration of three volume datasets in 3D, namely 41 × 41 × 41
sub-volumes of MR-PD, MR-T1 and MR-T2 datasets from Brainweb. The three datasets
were warped through −2, −21 and −30around the X axis. A brute force search was
performed so as to optimally register the latter two datasets with the former using MMI
as the registration criterion. Joint PMFs of size 64 × 64 × 64 were computed and these
were used to compute the MMI between the three images. The MMI peak occurred
when the second dataset was warped through θ2 = 19 and the third was warped
through θ3 = 28, which is the correct optimum. The plots of the MI values calculated by
simple histogramming and 3DPointProb versus the two angles are shown in Figure 3-4
(bottom row) respectively.
The next experiment was designed to check the effect of zero mean Gaussian
noise on the accuracy of affine registration of the same datasets used in the first
experiment, using histogramming and LengthProb. Additive Gaussian noise of variance
σ2 was added to the MR-PD volume. Then, the MR-PD volume was warped by a 4 × 4
affine transformation matrix (expressed in homogeneous coordinate notation) given
as A = SHRzRyRxT where Rz , Ry and Rx represent rotation matrices about the Z ,
Y and X axes respectively, H is a shear matrix and S represents a diagonal scaling
matrix whose diagonal elements are given by 2sx , 2sy and 2sz . (A translation matrix T
is included as well. For more information on this parameterization, please see [42].)
The MR-T1 volume was then registered with the MR-PD volume using a coordinate
descent on all parameters. The actual transformation parameters were chosen to be 7
for all angles of rotation and shearing, and 0.04 for sx , sy and sz . For a smaller number
of bins (32), it was observed that both the methods gave good results under low noise
and histogramming occasionally performed better. Table 3-8 shows the performance
of histograms and LengthProb for 128 bins, over 10 different noise trials. Summarily,
we observed that our method produced superior noise resistance as compared to
59
histogramming when the number of bins was larger. To evaluate the performance on real
data, we chose volumes from the Visible Human Dataset2 (Male). We took sub-volumes
of MR-PD and MR-T1 volumes of size 101 × 101 × 41 (slices 1110 to 1151). The
two volumes were almost in complete registration, so we warped the former using an
affine transformation matrix with 5 for all angles of rotation and shearing, and value
of 0.04 for sx , sy and sz resulting in a matrix with sum of absolute values 3.6686. A
coordinate descent algorithm for 12 parameters was executed on mutual information
calculated using LengthProb so as to register the MR-T1 dataset with the MR-PD
dataset, producing a registration error of 0.319 (see Figure 3-5).
3.4 Discussion
Thus far in this chapter and the previous one, we have presented a new density
estimator which is essentially geometric in nature, using continuous image representations
and treating the probability density as area sandwiched between iso-contours at
intensity levels that are infinitesimally apart. We extended the idea to the case of joint
density between two images, both in 2D and 3D, as also the case of multiple images
in 2D. Empirically, we showed superior noise resistance on registration experiments
involving rotations and affine transformations. Furthermore, we also suggested a
faster, biased alternative based on counting pixel intersections which performs well,
and extended the method to handle volume datasets. The relationship between our
techniques and histogramming with multiple sub-pixel samples was also discussed.
There are a few clarifications in order as follows:
1. Comparison to histogramming on an up-sampled image If an image isup-sampled several times and histogramming is performed on it, there will be moresamples for the histogram estimate. At a theoretical level, though, there is stillthe issue of not being able to relate the number of bins to the available number of
samples. Furthermore, it is recommended that the rate of increase in the numberof bins be less than the square root of the number of samples for computing thejoint density between two images [16], [43]. If there are d images in all, the numberof bins ought to be less than N
1d , where N is the total number of pixels, or samples
to be taken [16], [43]. Consider that this criterion suggested that N samples wereenough for a joint density between two images with χ bins. Suppose that we nowwished to compute a joint density with χ bins for d images of the same size. Thiswould require the images to be up-sampled by a factor of at least N
d−22 , which
is exponential in the number of images. Our simple area-based method clearlyavoids this problem.
2. Choice of interpolant We chose a (piece-wise) linear interpolant for the sake ofsimplicity, though in principle any other interpolant could be used. It is true that weare making an assumption on the continuity of the intensity function which maybe violated in natural images. However, given a good enough resolution of theinput image, interpolation across a discontinuity will have a negligible impact onthe density as those discontinuities are essentially a measure zero set. One couldeven incorporate an edge-preserving interpolant [44] by running an anisotropicdiffusion to detect the discontinuities and then taking care not to interpolate acrossthe two sides of an edge.
3. Non-differentiability The PDF estimates of our method are not differentiable,which can pose a problem for non-rigid registration applications. Differentiabilitycould be achieved by fitting (say) a spline to the obtained probability tables.However, this again requires smoothing the density estimate in a manner that isnot tied to the image geometry. Hence, this goes against the philosophy of ourapproach. For practical or empirical reasons, however, there is no reason why oneshould not experiment with this. Moreover, currently, we do not have a closed formexpression for our density estimate. Expressing the marginal and joint densitiessolely in terms of the parameters of the chosen image interpolant is a challengingproblem.
Table 3-1. Average and std. dev. of error in degrees (absolute difference between trueand estimated angle of rotation) for MI using Parzen windows
Figure 3-1. Graphs showing the average error A (i.e. abs. diff. between the estimatedand the true angle of rotation) and error standard deviation S with MI as thecriterion for 16, 32, 64, 128 bins (row-wise) with a noise of 0.05 [ from (A) to(D)], with a noise of 0.2 [from (E) to (H)] and with a noise of 1 [from (I) to (L)].Inside each sub-figure, error-bars are plotted for six diff. methods, in the foll.order: Simple Histogramming, Iso-contours, PVI, Cubic, 2DPointProb,Histogramming with 106 samples. Error-bars show the values of A− S , A,A+ S . If S is small, only the value of A is shown.
62
0 10 20 30 40 500
0.1
0.2
0.3
0.4
ISOCONTOURSHIST BILINEARPVIHIST CUBIC2DPointProb
0 10 20 30 40 500
0.2
0.4
0.6
0.8
ISOCONTOURSHIST BILINEARPVIHIST CUBIC2DPointProb
0 10 20 30 40 500
0.05
0.1
0.15
0.2
ISOCONTOURSHIST BILINEARPVIHIST CUBIC2DPointProb
0 10 20 30 40 500
0.2
0.4
0.6
0.8
ISOCONTOURSHIST BILINEARPVIHIST CUBIC2DPointProb
0 10 20 30 40 500
0.02
0.04
0.06
0.08
ISOCONTOURSHIST BILINEARPVIHIST CUBIC2DPointProb
0 10 20 30 40 500
0.1
0.2
0.3
0.4
0.5
ISOCONTOURSHIST BILINEARPVIHIST CUBIC2DPointProb
Figure 3-2. First two: MI for 32, 128 bins with noise level of 0.05; Third and fourth: with anoise level of 0.2; Fifth and sixth: with a noise level of 1.0. In all plots, darkblue: iso-contours, cyan: 2DPointProb, black: cubic, red: simplehistogramming, green: PVI. (Note: These plots should be viewed in color.)
63
A B C
Figure 3-3. MR slices of the brain (A) MR-PD slice, (B) MR-T1 slice rotated by 20degrees, (C) MR-T2 slice rotated by 30 degrees
020
40
0
20
400
1
2
A
020
40
0
20
400
1
2
B
020
40
0
20
401
2
3
4
C
020
40
0
20
400
1
2
3
D
Figure 3-4. MI computed using (A) histogramming and (B) LengthProb (plotted versusθY and θZ ); MMI computed using (C) histogramming and (D) 3DPointProb(plotted versus θ2 and θ3)
64
Figure 3-5. TOP ROW: original PD image (left), warped T1 image (middle), imageoverlap before registration (right), MIDDLE ROW: PD image warped usingpredicted matrix (left), warped T1 image (middle), image overlap afterregistration (right). BOTTOM ROW: PD image warped using ideal matrix(left), warped T1 image (middle), image overlap after registration in the idealcase (right)
65
Table 3-2. Average value and variance of parameters θ, s and t predicted by variousmethods (32 and 64 bins, noise σ = 0.2); Ground truth: θ = 30, s = t = −0.3
Table 3-3. Average value and variance of parameters θ, s and t predicted by variousmethods (32 and 64 bins, noise σ = 1); Ground truth: θ = 30, s = t = −0.3
Table 3-7. Three image case: angles of rotation using MMI, MNMI calculated with theiso-contour method and simple histograms, for noise variance σ = 0.05, 0.1, 1(Ground truth 20 and 30)
Table 3-8. Error (average, std. dev.) validated over 10 trials with LengthProb andhistograms for 128 bins; R refers to the intensity range of the image
To the best of our knowledge, ours is the first piece of work on denoising which explicitly
incorporates the relationship between image level curves and uses local interpo-
lation between pixel values in order to perform filtering. Future work will involve a
more detailed investigation into the relationship between our work and that in [58], by
computing the areas of the contributing regions with explicit treatment of the image
I (x , y) as a surface embedded in 3D. Secondly, we also plan to develop topologically
inspired criteria to automate the choice of the spatial neighborhood and the parameter
Wr for controlling the anisotropic smoothing.
It should be noted that the main aim of this chapter was to demonstrate the effect
of using interpolant information for denoising. Our contributions lie within the mean
shift framework, and therefore we have performed comparisons with other methods
that lie within this framework. For this reason, we have not performed experimental
comparisons with some leading local convolution approaches like [64] or [65].
LOW GRADIENTREGION (LARGE GAP BETWEENLEVEL SETS)
HIGHER GRADIENT REGION(LEVEL SETS CLOSELY PACKED)
CENTRAL PIXEL
A B
FACETS INDUCED BY LEVEL CURVES AND PIXEL GRID
C
Figure 4-1. Image contour maps in a neighborhood (A) with high and low gradientregions in a neighborhood around a pixel (dark dot); (B) a contour map of anRGB image in a neighborhood; red, green and blue contours correspond tocontours of the R,G,B channels respectively and the tessellation induced bythe above level-curve pairs contains 19 facets; (C) A tessellation induced byRGB level curve pairs and the square pixel grid
83
Table 4-1. MSE for filtered images using (M1) = Our method withWs =Wr = 3, using(M2) = Mean shift with Gaussian kernels withWs =Wr = 3 and (M3) = Meanshift with Gaussian kernels withWs =Wr = 5. MSE = mean-squared error inthe corrupted image. Intensity scale is from 0 to 255.
Table 4-2. MSE for filtered images using (M1) = Our method withWs =Wr = 6, using(M2) = Mean shift with Gaussian kernels withWs =Wr = 6 and (M3) = Meanshift with Epanechnikov kernels withWs =Wr = 6. MSE = mean-squarederror in the corrupted image. Intensity scale is from 0 to 255 for each channel.
Figure 4-2. For each image, top left: original image, top right: degraded images withzero mean Gaussian noise of std. dev. 0.003, bottom left: results obtainedby our algorithm, and bottom right: mean shift with Gaussian kernel (rightcolumn). Both both methods,Ws =Wr = 3; Viewed best when zoomed inthe pdf file
85
Figure 4-3. For each image, top left: original image, top right: degraded images withzero mean Gaussian noise of std. dev. 0.003, bottom left: results obtainedby our algorithm, and bottom right: mean shift with Gaussian kernel (rightcolumn). Both both methods,Ws =Wr = 3; Viewed best when zoomed inthe pdf file
86
Figure 4-4. Top left: original image, top right: degraded images with zero meanGaussian noise of std. dev. 0.003, bottom left: results obtained by ouralgorithm, and bottom right: mean shift with Gaussian kernel (right column).Both both methods,Ws =Wr = 3; Viewed best when zoomed in the pdf file
87
A B C
D E F
Figure 4-5. (A), (C) and (E): Fingerprint image subjected to additive Gaussian noise ofstd. dev. σ = 5
255, 10255
and 15255
respectively. (B), (D) and (F): Denoisedversions of (A), (C) and (E) respectively. Viewed best when zoomed in thepdf file (in color).
88
5 10 15 20 250
100
200
300
400
500
600
700
Sigma
Evaluations Averaged over all Test Images
Our spatial5 intensity5 AWGN
Our spatial5 intensity5 MWGN
Our spatial5 intensity5 Poisson
Noisy AWGN
Noisy MWGN
Noisy Poisson
5 10 15 20 25
0.4
0.5
0.6
0.7
0.8
0.9
1
Sigma
Evaluations Averaged over all Test Images
Our spatial5 intensity5 AWGN
Our spatial5 intensity5 MWGN
Our spatial5 intensity5 Poisson
Noisy AWGN
Noisy MWGN
Noisy Poisson
Figure 4-6. A plot of the performance of our algorithm on the benchmark dataset,averaged over all images from each noise model (Additive Gaussian(AWGN), multiplicative Gaussian (MWGN) and Poisson) and over all five σvalues, using MSE (top) and MSSIM (bottom) as the metric; Viewed bestwhen zoomed in the pdf file (in color)
89
Figure 4-7. For each image, top left: original image, top right: degraded images withzero mean Gaussian noise of std. dev. 0.003, bottom left: results obtainedby our algorithm, and bottom right: mean shift with Gaussian kernel (rightcolumn); for both methods,Ws =Wr = 3; Viewed best when zoomed in thepdf file
90
Figure 4-8. For each image, top left: original image, top right: degraded images withzero mean Gaussian noise of std. dev. 0.003, bottom left: results obtainedby our algorithm, and bottom right: mean shift with Gaussian kernel (rightcolumn); For both methods,Ws =Wr = 3; Viewed best when zoomed in thepdf file
91
Figure 4-9. An image and its corrupted version obtained by adding chromaticity noise(top left and top right respectively). Results obtained by filtering with ourmethod (bottom left), and with Gaussian mean shift (bottom right); Viewedbest when zoomed in the pdf file (in color)
92
Figure 4-10. An image and its corrupted version obtained by adding chromaticity noise(top left and top right respectively). Results obtained by filtering with ourmethod (bottom left), and with Gaussian mean shift (bottom right); Viewedbest when zoomed in the pdf file (in color)
93
Figure 4-11. First two images: frames from the corrupted sequence. Third and fourth:images filtered by our algorithm. Fifth and sixth images: a slice through thetenth row of the corrupted and filtered video sequences; images arenumbered left to right, top to bottom
94
CHAPTER 5A RELATED PROBLEM: DIRECTIONAL STATISTICS IN EUCLIDEAN SPACE
5.1 Introduction
When the samples do not reside in Euclidean space, conventional density
estimation techniques such as mixtures of Gaussians or kernel density estimation
(KDE) using Gaussian kernels are not applied directly. For the special case when the
data reside on Sn, i.e. the sphere embedded in Rn, there exists extensive literature from
the field of directional statistics that is summarized in several exemplary books such as
[66]. Conventionally, for KDE or mixture model density estimation of unit vectors, the
Gaussian kernel has been replaced by von-Mises or von-Mises Fisher (voMF) kernels
for circular and spherical data respectively. These computational techniques have
been applied for solving numerous problems in computer vision, image processing,
medical imaging and computer graphics. Mixture modeling for directional data was
proposed originally by Kim et al. [67]. Banerjee et al. [68] also proposed a mixture
model for directional data and applied it for clustering problems. In medical imaging,
McGraw et al. [69] have modeled the displacement of water molecules in high angular
resolution diffusion images by means of a voMF mixture model. More recently, mixture
models of circular data have also been used for trajectory shape analysis in studying
object motion [70]. KDE of unit-vector data has been used in the context of smoothing
chromaticity vectors in color images [48]. Applications of such density estimators
in computer graphics include the work on approximation of the Torrance-Sparrow
Bidirectional Reflectance Functions (BRDF) as reported in [71], or the recent work in
[72] for approximating the distribution of surface normals. Eugeciouglu et al. [73] use a
kernel based on powers of cosines instead of a voMF in KDE, motivated by the superior
computational speed of the cosine estimator, and apply their technique for the analysis
of flow vectors in fluid mechanics.
95
The above techniques ignore the fact that the directional data are often obtained
as a transformation of the original measurements which are typically assumed to
reside in Euclidean space. Therefore the true probability density of the unit vector data
is related to that of the original data by means of a relationship dictated by random
variable transformations, a key concept in basic probability theory [74]. However,
a kernel density estimate or a mixture model estimate using (say) voMF kernels
ignores this very fundamental relationship. The technique proposed here exploits
exactly this relationship in the following way: (1) It performs density estimation in the
original space, and (2) It then transforms this density to the directional space using
random variable transformations. Thereby, it avoids the aforementioned inconsistency.
Secondly, conventional density estimation techniques for directional data also require
the solution of complicated nonlinear equations for key parameter updates such as
the covariance. This issue is completely circumvented by the presented technique.
A density estimator is built also for another directional quantity: hue in color images
(part of the HSI or hue-saturation-intensity color model), which is computed from a very
different transformation of the RGB color values obtained from a sensor (camera).
This chapter is organized as follows. Section 5.2 is a review on the choice of
kernel for density estimation for circular and spherical data. The drawbacks of these
approaches are enumerated and a new approach to density estimation for directional
data is introduced. This concept is extended for hue data in Section 5.3. A discussion is
presented in Section 5.4.
5.2 Theory
In this section, the theory of the new method is presented, starting with a review on
the choice of kernels for directional density estimation in contemporary vision literature.
5.2.1 Choice of Kernel
There exist a plethora of kernels used for estimating the density for unit vector
data, and the reasons for choosing one over the other require careful study. For KDE
96
of directional data, the voMF kernel is highly popular [69]. It has great computational
convenience because (1) it is symmetric, (2) it yields elegant closed-form formulae
for the Renyi entropy of a voMF mixture model, and for the distance between two
voMF distributions [69], and (3) the information-geometric properties of voMF mixtures
are simple [69]. Despite these algebraic properties, there are ambiguities [75] in the
oft-repeated [68] notion that the voMF is the ‘spherical analogue of the Gaussian’. The
voMF distribution does possess properties similar to a Gaussian such as those related
to maximum likelihood, and maximum differential entropy for fixed mean and variance,
besides symmetry. However, the voMF also differs from the Gaussian in the sense that
(1) the central limit theorem on the sphere does not involve the voMF but a uniform
distribution instead [75], (2) the voMF is not the solution to the isotropic heat equation
on the sphere [67] and (3) the convolution of two voMF distributions does not produce
exactly another voMF [66]. If we restrict ourselves to just the non-negative orthant of the
sphere (i.e. axial data), then the Bingham distribution also possesses many properties
similar to the Gaussian [75]. Another popular kernel for axial statistics is the Watson
distribution [76]. Some papers even consider a symmetrized version of the voMF kernel,
for instance [69]. However, the choice between Bingham, Watson and symmetrized
voMF kernels is unclear, and they will produce different density estimates for finite
sample sizes. Often, the motivation for choosing one over the other is computational
convenience, which is the chief reason behind the popularity of the voMF kernel.
5.2.2 Using Random Variable Transformation
The aforementioned density estimation techniques for directional data typically
assume that only the final unit vector data are available. However, very often in
computer vision applications, the original data are available as the output of a sensor.
These are then converted into unit vectors typically (though not always - see Section
5.3) by means of a projective transformation (unit normalization). The best instance
thereof is that color images are output by a camera usually in RGB format and
97
the intensity triple at each pixel is unit-normalized to produce chromaticity vectors.
Similarly, surface normals output by a 3D scanner are unit-normalized to produce the
corresponding unit vectors. KDE or mixture modeling techniques for spherical data are
applied thereafter.
The new approach to density estimation for directional data that directly exploits
the fact that the unit vectors are a transformation of the original data, is now presented.
Consider the original data to be a random variable X with a probability density function
(PDF) p(X ). Let Y = f (X ) be a known function of X . Then the PDF of Y is given by
p(Y = y) =
∫f −1(y)
p(X = x)dx
|f ′(x)|. (5–1)
Here f −1(y) represents the set of all those values x such that f (x) = y . This is known as
a random variable transformation in density estimation [74], and is a very fundamental
concept in probability theory.
This principle for estimating the density of unit vectors is presented as follows.
Let the original random variable in R2 be ~W having density p( ~W ) and let ~V = ~W|W | =
g( ~W ) be its directional component. Clearly, ~V is defined on S1. Let ~w = (x , y) be a
sample of ~W and ~v = ~w|~w | be the directional component of ~w . Let the polar coordinate
representation of ~w be (r , θ). Now, the the joint density of (r , θ) is given by
p(r , θ) =p(x , y)
| ∂(r ,θ)∂(x ,y)
|= rp(x , y). (5–2)
By integrating out the radius, we have the density of θ, i.e. the density of the unit-vector
~v = ~w|w | , as follows:
p(~V = ~v) =
∫ ∞
r=0
p(r , θ)dr =
∫ ∞
r=0
rp(x , y)dr . (5–3)
98
If ~w is a sample from an isotropic Gaussian distribution of variance σ2 and centered at
(0, 0), then it follows that
p(~V = ~v) =1
2πσ2
∫ ∞
r=0
re−r2
2σ2 dr =1
2π. (5–4)
If ~w is a sample from an isotropic Gaussian distribution of variance σ2 and centered at
(x0, y0), then it follows that
p(~V = ~v) =1
2πσ2
∫ ∞
r=0
re−(r2+r20−2rr0 cos (θ−θ0))/(2σ
2)dr (5–5)
where (r0, θ0) is a polar coordinate representation for (x0, y0). Upon simplification, we
have:
p(~v) =1
2πσ2σ2 exp
(− r
20
2σ2
)+ σ
√π
2r0 cos (θ − θ0)
(1 + erf
(r0 cos (θ − θ0)
σ√2
)) exp
(−r 20 sin2 (θ − θ0)
2σ2
). (5–6)
As seen from the previous equations, a random variable transformation of a vector-valued
Gaussian random variable followed by marginalization over the magnitude component
does not yield a von-Mises distribution. In fact, a von-Mises is obtained by condition-
ing the value of r to some constant (typically r = 1) as opposed to integrating over r
(see pages 107-108 of [5], and [77]), and therefore represents a conditional and not a
marginal density. The density in Equation 5–6 above is known in the statistics literature
as one corresponding to a projected normal distribution [78] or angular Gaussian
distribution [66], however it has not been introduced in the computer vision community
so far to the best of this author’s knowledge. Furthermore, it has not been employed in a
KDE or mixture modeling framework so far (see Section 5.2.3 and Section 5.2.4).
5.2.3 Application to Kernel Density Estimation
Now, suppose ~w follows some unknown distribution. The density of ~w is conventionally
approximated by means of kernel methods acting on N samples of the random variable.
If a Gaussian kernel centered at each sample and having variance σ2 is used, then we
99
have:
p(~w) =1
2πNσ2
N∑i=1
exp
(−|~w − ~wi |2
2σ2
). (5–7)
The earlier procedure will yield us the following estimate of the density of ~v :
p1(~v) =
∫ ∞
r=0
p(r , θ)dr
=
∫ ∞
r=0
r
2Nπσ2
N∑i=1
e−(r2+r2i −2rri cos (θ−θi ))/(2σ
2)dr (5–8)
where (ri , θi) is the standard polar coordinate representation for the sample point
~wi = (xi , yi). After evaluating the integral, we obtain the following expression:
p1(~v) =1
2πNσ2
N∑i=1
[σ2 exp
(− r
2i
2σ2
)+ σ
√π
2ri cos (θ − θi)
(1 + erf
(ri cos (θ − θi)
σ√2
)) exp
(−r 2i sin
2 (θ − θi)
2σ2
)]. (5–9)
Let p2(~v) be the estimate of the density of θ using the popular von-Mises kernel with a
concentration parameter κ. Then we have:
p2(~v) =1
2πI0(κ)N
N∑i=1
eκ~vT
~wi|~wi | (5–10)
where I0(κ) is the modified Bessel function of order zero. It is easy to see that for finite
sample sizes, p1(~v) 6= p2(~v) in general, even if a suitable variable bandwidth kernel
density estimate is used for p2(~v). Equation 5–9 is clearly different from a superposition
of von-Mises kernels, and can be considered as a directional density estimator for
unit-vector data on S1 obtained by a unit-normalization operation of original data in R2,
using a new kernel G:
p(~v) =1
N
N∑i=1
G(~v ; ~wi ,σ) (5–11)
where G is defined as follows:
G(~v ; ~wi ,σ) =1
2πexp
(−|~wi |2
2σ2
)+
1
2√2πσ
~v · ~wi
100
(1 + erf
(~v · ~wiσ√2
))· exp
(−| ~wi |2 + (~v · ~wi)2
2σ2
). (5–12)
A similar PDF can be defined for unit vector data (denoted as ~v ) on S2, obtained by
projective transformation of data denoted as ~w = (x , y , z) residing in R3 belonging to
an isotropic Gaussian distribution centered at ~wi = (xi , yi , zi). This yields the following
expression:
p(~v) =
∫ ∞
r=0
r 2e−(x−xi )
2+(y−yi )2+(z−zi )
2
2σ2 dr . (5–13)
p(~v) =e
− ~|wi |2+(~v .~wi )
2
2σ2
2σ2(2π)1.5(√π
2[erf(
~v . ~wi
σ√2) + 1][(~v . ~wi)
2 + σ2] +σ
2~v . ~wie
− (~v .~wi )2
2σ2
). (5–14)
The key feature of the kernel density estimation approach in this section (and also the
pith of this chapter, in general) is that the model-fitting (selection of parameters such as
σ) can all be done in Euclidean space. The new kernels proposed in Equations 5–12
and 5–14 appear only in an emergent way out of the random variable transformation.
5.2.4 Mixture Models for Directional Data
Existing mixture modeling algorithms have difficulties associated with the choice
of the number of mixture components and local minima issues during model fitting.
Additionally, there are other practical difficulties involved in mixture modeling for the case
of directional data. Firstly, if von-Mises kernels [68] are used, the maximum-likelihood
estimate of the variance (or concentration parameter, often denoted as κ) is not
available in closed form and requires the solution to a non-linear equation involving
Bessel functions. In [68], the parameter κ is updated using various approximations for
the Bessel functions that are part of the normalization constant for voMF distributions,
followed by the addition of an empirically discovered bias that is a polynomial function of
the estimated mean vectors. The difficulties faced by a mixture of voMF distributions in
modeling data that are spread out anisotropically are mitigated by the use of a mixture
101
of Watson kernels as claimed in [76]. Nonetheless, iterative numerical procedures to
estimate κ are still required, and the case where a full covariance is to be obtained, will
be even more complicated. Moreover the method in [76] also requires solving non-linear
equations for the update of the centers of the individual components over and above
the κ values. Over and above this, the update of the mean vectors in both [68] and [76]
involves vector addition followed by unit normalization, which is unstable if antipodal
vectors are involved as the norm of the resultant vector will be very small.
The approach based on the theory presented in the previous subsections
overcomes these difficulties by following a two-step procedure: (1) a mixture-model fit in
the original Euclidean space given a set of N samples, followed by (2) a transformation
of random variables. If a Gaussian mixture model is fit to the original data samples,
using M components, with priors pk, centers (µxk ,µyk) = (rk cos θk , rk sin θk) and
variances σk, then a random variable transformation results in the following form of
directional mixture model:
p(~v) =
∫ ∞
r=0
r
2πσ2
M∑k=1
pke−(x−µxk )
2+(y−µyk )2
2σ2k dr ,
p(~v) =
M∑k=1
pkG(~v ; ~µk ,σk), (5–15)
where G was defined in Equation 5–12. Since the entire mixture-modeling procedure is
performed in the original space, the aforementioned difficulties in estimating the mean
and concentration parameters are automatically avoided.
If we continue to follow this line of reasoning, we can now achieve a fresh
perspective on mixtures of voMF distributions as well. As mentioned previously and as
clearly documented in [5], the voMF distribution is obtained from a Gaussian distribution
by conditioning the magnitude of the random variable to be some constant. If we fit a
Gaussian mixture model to the original data and expressed it in polar coordinates, we
102
are left with the following expression:
p(r , θ) =M∑k=1
pk2πσ2k
e− (r cos θ−rk cos θk )
2+(r sin θ−rk sin θk )2
2σ2k . (5–16)
By conditioning on r = 1, we have:
p(θ|r = 1) =M∑k=1
pk2πI0(
rkσ2k)erk cos (θ−θk )
σ2k . (5–17)
This procedure basically suggests again that the entire mixture modeling algorithm
can be executed in Euclidean space, and that a mixture of voMF distributions can be
obtained by conditioning the magnitude of the random variable to be 1 (or some other
constant)1 . The polar coordinates transformations yield a formula for the concentration
parameter κk of the k th component, given as κk = rkσ2k
. This procedure therefore suggests
us a viable alternative to fitting a mixture of voMF distributions when the original data are
available (and not just the unit-vector data). Similar expressions can be derived for the
case of data on S2 derived from R3 as well.
5.2.5 Properties of the Projected Normal Estimator
The projected normal distribution is symmetric and unimodal just like the von-Mises
distribution. Figure 5-1 shows the projected normal distribution corresponding to an
original Gaussian distribution centered at ~µ0 = (1, 0) having a variance of σ0 = 10,
and a von-Mises distribution centered at (1, 0) with κ0 =|~µ0|σ20= 0.01. Similarly, plots
of the projected normal distribution on S2 for an original Gaussian distribution with
~µ0 = (1, 0, 0) and variance 10, and a voMF distribution with mean ~µ0 = (1, 0, 0)
and concentration κ0 =|~µ0|σ2
are shown in Figure 5-2. As indicated by the plots, both
distributions have a distinct peak at θ ≈ 54,φ ≈ 45 as expected.
1 Note that the voMF distribution or a mixture of voMF distributions are conditional andnot marginal distributions.
103
From Equations 5–9, 5–12 and 5–14, it can be seen that the density estimator does
not require the conversion of the original samples to unit vectors, but operates entirely in
the original space.
5.3 Estimation of the Probability Density of Hue
Directional data are usually obtained by the process of unit-normalization of
the original vector data measured by a sensor. However, this isn’t always the case.
For instance, color sensors typically output values in the RGB color format. These
values are then converted to other color systems such as HSI using transformations
of a different kind, presented below. The HSI color model is based on the notion of
separating a color into three quantities - the hue H (which is the basic color such as red
or green), the saturation S (which indicates the amount of white present in a color) and
the value I (which indicates the amount of shading or black). The component hue (H) is
an angular quantity. The rules for conversion between the RGB and HSI color models
are as follows [48]:
H = cos−1
(0.5(2R − G − B)√
(R − G)2 + (R − B)(G − B)
)S = 1− 3
R + G + Bmin(R,G ,B)
I =1
3(R + G + B). (5–18)
The inverse transformation from HSI to RGB, for hue values 0 < H ≤ 2π3
, is:
B = I (1− S)
R = I (1 +S cosH
cos(π3− H)
)
G = 3I − (R + B). (5–19)
For hue values 2π3< H ≤ 4π
3, the formulae are given by:
H = H − 2π3
R = I (1− S)
104
G = I (1 +S cosH
cos(π3− H)
)
B = 3I − (R + B). (5–20)
and for hue values 4π3< H ≤ 2π,
H = H − 4π3
G = I (1− S)
B = I (1 +S cosH
cos(π3− H)
)
R = 3I − (R + B). (5–21)
If p(R,G ,B) is the density of the RGB values, and taking into account the fact that the
RGB to HSI transformation is one-one and onto, the density of the HSI values is given
as:
p(H,S , I ) =p(R,G ,B)
| ∂(H,S,I )∂(R,G ,B)
|= |∂(R,G ,B)
∂(H,S , I )|p(R,G ,B). (5–22)
Now, for all hue values, we have:
|∂(R,G ,B)∂(H,S , I )
| = 2√3 sec2H
(1 +√3 tanH)2
[IS(1− S) + I 2S(S + 2)]. (5–23)
Supposing the RGB values were drawn from a Gaussian distribution centered at
(Ri ,Gi ,Bi) having variance σ2, then the distribution of HSI is given as:
p(H,S , I ) =
(2√3 sec2H
(1 +√3 tanH)2
[IS(1− S) + I 2S(S + 2)]
)(
1
σ3(2π)1.5e−
(R−Ri )2+(G−Gi )
2+(B−Bi )2
2σ2
). (5–24)
Further simplification gives
p(H,S , I ) =
(2√3 sec2H
(1 +√3 tanH)2
[IS(1− S) + I 2S(S + 2)]
)(
1
σ3(2π)1.5e−
(I+ISk−Ri )2+(I−IS−Bi )
2+(I+IS−ISk−Gi )2
2σ2
)(5–25)
105
where k = 2
1+√3 tanH
. To find the marginal density of hue, we integrate over the values of
I and S (both lying in the interval [0, 1]), giving us:
p(H) =
∫ 1I=0
∫ 1S=0
p(H,S , I )dSdI . (5–26)
Unlike the case with the preceding section, this formula is not available in closed form.
However it is easy to approximate this formula numerically, as it just involves a 2D
definite integral over a bounded range of values (of S and I ).
Instead of marginalizing, if we conditioned S and I to take on the value of 1, then
the conditional density of H is obtained as follows:
p(H|S = I = 1) =
(6√3 sec2H
σ3(2π)1.5(1 +√3 tanH)2
)(e−
(1+k−Ri )2+B2i +(2−k−Gi )
2
2σ2
)(5–27)
Notice that equation 5–27 is analogous to Equation 5–17 in the sense that both are
conditional densities (obtained by conditioning other variables to have constant values).
On the other hand, Notice that equation 5–26 is analogous to Equation 5–9 in the sense
that both are marginal densities (obtained by integrating out other variables).
We would like to draw the reader’s attention to the fact that both these approaches
are radically different from that proposed in [79]. The latter approach performs density
estimation of the hue by first converting the RGB samples to hue values. Then, it
centers a kernel with a different bandwidth around each hue sample. The value of the
bandwidth for the i th sample Hi is determined by the partial derivatives ∂Hi∂R, ∂Hi∂G, ∂Hi∂B
which indicates the sensitivity of Hi w.r.t. the original RGB values. As hue is a non-linear
function of RGB, the sensitivity in the hue values varies with the RGB values of the
samples obtained from the sensor. For instance, hue is highly unstable at RGB values
that are close to the achromatic axis R = G = B.
106
5.4 Discussion
Most techniques that estimate the PDF of directional data assume that only the
directional data are available. This fact is exploited to derive a new approach for density
estimation of directional data by first estimating the density in the original space followed
by a random variable transformation. Therefore, this is the only circular/spherical density
estimator in the computer vision community, which is consistent with the estimate of
the density of the original data from which the directional data are derived, in the sense
of random variable transformations, a key concept in probability theory. Secondly,
this method circumvents issues involved in solving complicated non-linear equations
that arise in maximum likelihood estimates for the parameters of conventional density
estimators, as it operates in the original space, and therefore uses the much simpler
mixture-modeling or KDE techniques that are popular for Euclidean data. The theory
for this estimator is built for unit-normal vectors as well as quantities such as hue in
color imaging. Though this work deals strictly with directional data, the underlying
philosophy of this approach is easily extensible to data residing on other kinds of
manifolds. Therefore it has the potential of posing as a viable alternative to existing
kernel density estimators that require the usage of non-trivial mathematical techniques
(such as computation of geodesic distances between samples on a given manifold) in
order to be tuned to data that reside on non-Euclidean manifolds [80].
The approach presented in this chapter also raises the following question. Consider
a random variable f whose estimated PDF (say using a kernel method) using samples
f1, f2, ..., fn is given by
pf (f = α) =1
n
n∑i=1
Kf (α− fi ;σf ). (5–28)
Now consider a transformation T of f , yielding the transformed random variable
g = T (f ). One method could be to apply a kernel density method to directly to the
transformed samples g1 = T (f1), g2 = T (f2), ..., gn = T (fn), yielding the density
107
estimate
pg(g = β) =1
n
n∑i=1
Kg(β − gi ;σg) (5–29)
where β = T (α). Alternatively, one could apply a random variable transformation to
pf (f = α) to yield
pg(g = β) =
∫γ=T−1(β)
pf (f = γ)
|T ′(γ)|. (5–30)
The relationship between pg(.) and pg(.) will depend upon the choice of kernels Kf (.)
and Kg(.) and the parameters σf and σg, which requires further investigation. Note
that the PDF estimator for image intensities from Chapter 2 follows the approach
in Equation 5–30 as it is an explicit random variable transformation from location to
intensity, whereas all the sample-based methods reviewed in Chapter 2 follow the former
approach in Equation 5–29.
Consider yet another scenario where the technique from Chapter 2 was used
to estimate the density of the intensity values in an image I (x , y). Now let J(x , y) =
T (I (x , y)) be a transformation of the image I . There are two ways to arrive at the PDF
of J(x , y) - one estimate (denoted p1(.)) is by interpolating the value of I , and then
applying the random variable transformation . The other estimate (denoted p2(.)) is
obtained by first computing the J values at the discrete locations and then interpolating
those values to yield another estimate of the density of J. In this case, the two estimates
would be related by the specific interpolants employed. Let us consider the specific case
where I (x , y) was an RGB image, and J(x , y) was the image of chromaticity vectors.
Consider that the interpolant used for I (x , y) was such that the directions of the subpixel
RGB values were spherical linear functions of the spatial coordinates, whereas the
magnitudes were linear functions of the spatial coordinates. Consider also that the
interpolant used for J(x , y) was spherical linear in nature. It can be seen easily that the
estimates p1(.) and p2(.) using these rules would be equal.
108
−200 −150 −100 −50 0 50 100 150 2002.4
2.5
2.6
2.7
2.8
2.9
3
3.1
3.2x 10
−4
PROJECTED NORMAL DISTRIBUTIONVON−MISES DISTRIBUTION
Figure 5-1. A projected normal distribution (~µ0 = (1, 0),σ0 = 10) and a von-Misesdistribution (~µ0 = (1, 0),κ0 =
|~µ0|σ20= 0.01)
0
50
100
150
200
0
20
40
60
80
1005.2
5.4
5.6
5.8
6
6.2
6.4
6.6
6.8
7
x 10−5
A
0
50
100
150
200
0
20
40
60
80
1006
6.02
6.04
6.06
6.08
6.1
6.12
6.14
x 10−5
B
0
50
100
150
200
0
20
40
60
80
1000
1
2
3
4
5
6
7
8
9
x 10−6
C
Figure 5-2. Plots of (A) a projected normal density (~µ0 = (1, 0, 0),σ0 = 10), (B) a voMFdensity (~µ0 = (1, 0, 0),κ0 =
|~µ0|σ20= 0.01), and (C) L1 norm of the difference
between the two densities
109
CHAPTER 6IMAGE DENOISING: A LITERATURE REVIEW
6.1 Introduction
In this chapter, we give a detailed review of contemporary literature on image
denoising. We make an attempt to cover as many diverse approaches as possible,
though a complete overview is beyond the scope of the thesis, given the sheer
magnitude of existing research on this topic. To the best of our knowledge, there
exist very few surveys on image denoising. The review in [2] focuses on mathematical
characteristics of the residual images (defined as the difference between the given
noisy and the denoised image) for different types of image filters ranging from
partial differential equations to wavelet based methods. A summary of recent trends
in denoising was presented by Donoho and Weissman at the IEEE International
Symposium on Information Theory (ISIT) in 2007 [81]. This tutorial focussed on wavelet
and other transform based methods, some learning based methods and non-local
methods. In the present review, we discuss and critique methods based on partial
differential equations, local convolution and regression, transform domain methods
using wavelets and the discrete cosine transform (DCT), non-local approaches,
methods based on analysis of the properties of residuals and methods that use
various machine learning tools. The aforementioned categories constitute the bulk
of modern image denoising literature. The focus of the survey is on gray-scale image
denoising, though we make occasional references to papers on color image denoising.
Throughout this chapter and in subsequent chapters, we consider noise to be a random
signal independent of the original signal that it corrupts. Apart from a descriptive
survey of the contemporary techniques as such, we also cover some common issues
concerning almost all contemporary denoising techniques: methods for validation of filter
performance and methods for automated parameter selection.
110
6.2 Partial Differential Equations
The isotropic heat equation was used for image smoothing in [82]. It is known
that executing this partial differential equation (PDE) on the image is equivalent to
convolution with a Gaussian kernel, where the kernel parameter (often denoted
by σ) is related to the time step and number of iterations of the PDE. However,
isotropic smoothing blurs away significant image features such as edges along with
the noise, and hence is not used in contemporary denoising algorithms. Instead, in most
contemporary diffusion methods, the diffusion process is directed by edge information
in the form of a diffusivity function which prevents blurring across edges and allows
diffusion along them [44]. The chosen diffusivity function is actually a monotonically
decreasing function of the gradient magnitude. The equation for the PDE can be written
as follows∂I
∂t= div(g(|∇I |)∇I ) (6–1)
where I : Ω → R is a gray-scale image defined on domain Ω and g(|∇I |) is a diffusivity
function typically defined as
g(|∇I |;λ) = 1
1 + |∇I |2/λ2. (6–2)
Several different diffusivity functions have been proposed, for instance those by Perona
and Malik [44], Weickert [83] and Black et al. [84]. A regularized version of the above
equation has been proposed in [85]. Connections between robust statistics and
anisotropic diffusion (which show up in the choice of diffusivity function) have been
established in [84].
Some PDEs are obtained from the Euler-Lagrange equations corresponding to
energy functionals. One example is the image total variation defined as
E(I ) =
∫Ω
|∇I (x , y)|dxdy (6–3)
111
giving rise to the PDE∂I
∂t= div(
∇I|∇I |). (6–4)
It should be noted that the aforementioned techniques are based on the assumption
that natural images are piecewise constant, which is not necessarily a valid assumption.
They also require the choice of the parameter λ in the diffusivity. This parameter need
not be constant throughout the image. The number of iterations for which these PDEs
are executed is an important parameter critical for good performance. In the limit of
infinite iterations, constant or piecewise-constant images are produced. Some authors
remedy the stopping time selection issue by introducing a prior model in the energy
formulation, for example the following modification of the total variation model, starting
with an initial image I0
E(I ) =
∫Ω
|∇I (x , y)|dxdy + µ∫Ω
(I (x , y)− I0(x , y))2dxdy (6–5)
where µ is a parameter that trades data fidelity with regularity. The implicit assumption
in the term (I (x , y) − I0(x , y))2 is a Gaussian noise model. Assuming that the image
has been corrupted with zero mean Gaussian noise of known variance σ2n, a constrained
version of the objective function has been proposed in [86]:
minIE(I ) =
∫Ω
|∇I (x , y)|dxdy (6–6)
subject to ∫Ω
(I − I0)2dxdy = σ2n (6–7)∫Ω
I (x , y)dxdy =
∫Ω
I0(x , y)dxdy . (6–8)
For different noise models, such as Poisson or impulse noise, different priors can be
used [87]. A highly comprehensive review of several such PDE-based approaches can
be found in exemplary books such as [83] and [53], to name a few. Recently, some
authors have also introduced the concept of diffusion with complex numbers, which
112
brings about denoising in conjunction with edge enhancement [88], [89]. The latter
technique performs the complex diffusion by treating the image I : Ω → R as a graph of
the form (x , y , I (x , y)), a framework for diffusion developed in [58].
Some researchers have developed PDEs based on a piecewise linear assumption
on natural images, examples being [90] and [91]. These turn out to be fourth order
PDEs and their energy functions penalize deviation in the intensity gradient as opposed
to deviation in intensity, and preserve fine shading better. However in some cases such
as [90], speckle artifacts have been observed which need to be retroactively remedied
using median filters [90]. Another class of approaches consists of independently filtering
the gradients in the x and y directions, and then using some prior assumption on the
image geometry to reconstruct the image intensity from the smoothed gradient values
[92].
6.3 Spatially Varying Convolution and Regression
A rich class of techniques for image filtering involve the so-called spatially varying
convolutions. In these methods, an image is convolved with a pointwise varying mask
which is derived from the local geometry extracted from the signal. A closely related
idea is the modeling of the local geometry of an image (signal) by means of a low-order
polynomial function. The signal is approximated locally by a pointwise-varying weighted
polynomial fit. The coefficients of the polynomial are computed by a least-squares
regression, and these are then used to compute the value of the (filtered) signal at
a central point. For instance, the signal could be modeled as follows, restricted to a
neighborhood Ω around a point x0:
I (x) = a0 +
m∑i=1
ai(x − x0)i (6–9)
where ai (0 ≤ i ≤ m) are coefficients of the polynomial. These coefficients are
obtained by least squares fitting, and the filtered signal value is given by I (x0) = a0.
This procedure is not guaranteed to preserve edges as it allows even disparate intensity
113
values to affect the polynomial fit. Instead in practice, the signal is modeled as follows:
I (x) = a0 +
m∑i=1
aiw(x − x0, I (x)− I (x0); hs , hv)(x − x0)i . (6–10)
Here w(x − x0, I (x) − I (x0); hs , hv) is a weighting scheme which is basically a
non-increasing function of the difference between spatial locations, i.e. x − x0, and
the difference between the signal values at those locations, i.e. I (x) − I (x0). The
function w is parameterized by hs and hv which act as spatial and intensity smoothing
parameters respectively. The fitting procedure is now, of course, a weighted least
squares regression. These ideas trace back to the Savitzky-Golay filter [93], [94] and
are the subject of beautiful books such as [95]. Two-dimensional versions of these ideas
have been recently used in modified forms for image filtering applications in [65] and
[55]. In [65], the parameter hs is replaced by a matrix, which is selected in a manner
dictated by local image edge geometry and no penalty is applied on intensity deviation.
On the other hand, in the latter case [55], the weights for regression are affected solely
by intensity difference. The popular bilateral filtering technique [49], [96] is again based
on a weighted linear combination of intensities, with weights driven by both location and
intensity differences. In fact, the kernel regression approach in [65] has been framed as
a higher-order generalization of the bilateral filter. If the polynomial order is restricted to
one and the weights are applied only on intensity differences, one gets the so-called the
kernel density based filter [48], also called as the anisotropic neighborhood filter [55].
A version where the weights are obtained from intensity gradient magnitudes has been
presented in [47] and is called as the adaptive filter. An extension to the anisotropic
neighborhood filter using interpolation between noisy image intensity values (and the
induced isocontour map) has been recently presented by us in [11] and in Chapter
4. In all these techniques, a crucial parameter is the size and also the shape of the
neighborhood for local signal modeling. An important contribution toward solving this
114
problem is a data-driven approach presented in [97], which derives a multi-directional
star-shaped neighborhood (of largest possible size) around each image pixel.
The mean-shift procedure, a clustering technique proposed in [98], and applied to
filtering (and segmentation) in [51], can be considered as a generalization of bilateral
filtering, where the window for local signal modeling is allowed to grow dynamically.
This growth is directed by an ascent on a local joint density function of spatial as well
as intensity values. It should be noted that both bilateral filtering and mean shift are
related to the Beltrami flow PDE developed in [58]. These relationships have been
explored in [99]. The connections between nonlinear diffusion PDEs over small periods
of time and spatially varying convolutions have been shown in [100]. In [45], the authors
present so-called trace-based PDEs for smoothing of color images and prove that the
corresponding diffusion is exactly equivalent to convolutions with oriented Gaussians,
where the orientation is dictated by local image geometry or edge direction.
Thus, spatially varying convolutions for filtering have a rich history. The most recent
contribution in this area is the one presented in [64] and [101]. This framework is based
upon the Jian-Vemuri continuous mixture model from the field of diffusion-weighted
magnetic resonance imaging (DW-MRI) [102]. In [64], complicated local image
geometries such as edges as well as X, Y or T junctions are modeled using a Gabor
filter bank at different orientations. The collection of Gabor-filter responses is expressed
as a discrete mixture of a continuous mixture of Gaussians (with Wishart mixing
densities) or a discrete mixture of a continuous mixture of Watson distributions (with
Bingham mixing densities) to respectively yield two different types of kernels for local
geometry-preserving convolutions. The number of components of the discrete mixture
is given by an appropriate sampling of the 2D orientation space and the weights of the
discrete mixture are solved by local regularized least squares fitting. The novelty of this
technique is (1) the automatic setting of weights for geometry-preserving smoothing,
and (2) the ability to preserve features such as image corners and junctions (which are
115
ignored by the other convolution-based methods mentioned before). While techniques
such as curvature-preserving PDEs [103] attempt preservation of such geometries,
their behavior at X, Y or T junctions (where curvature is not defined) may need further
exploration.
The mean shift procedure or other local convolution filters can also be applied to
the image gradients to better facilitate the preservation of shading. An extensive survey
of various applications with different types of filtering operations on image gradients,
followed by image reconstruction using a projection onto the nearest integrable surface
[104] or by solving the Poisson equation, has been presented in [105] in a short course
at the International Conference on Computer Vision, 2007, and in papers such as [106].
6.4 Transform-Domain Denoising
Transform-domain denoising approaches typically work at the level of small image
patches. In these approaches, the image patch is projected onto a chosen orthonormal
basis (such as a wavelet basis or the DCT basis) to yield a set of coefficients. It is
well-known that the coefficients in the transform domain are highly compressible in the
sense that the vast majority of these coefficients are very close to zero. In the literature,
this property is referred to as ‘sparsity’, though in a strict sense, sparsity would require
most coefficients to be equal to zero. In the rest of the thesis, we shall stick to this
usage of the word ‘sparsity’ even though we imply compressibility. It is known that the
coefficients in the wavelet or DCT transform domain are decorrelated from one another
[107]. It should be noted that the smaller coefficients usually correspond to the higher
frequency components of the signal which are often dominated by noise. To perform
denoising, the smaller coefficients are modified (typically, those coefficients whose
magnitude is below some λ are set to zero, in a process termed ‘hard thresholding’), and
the patch is reconstructed by inversion of the transform. This procedure is repeated for
every patch. If the patches are chosen to be non-overlapping, one can observe seam
artifacts at the patch boundaries. Furthermore, the thresholding of the coefficients is
116
also known to produce ringing artifacts around image edges or salient features. Artifacts
of both types can be remedied by performing the aforementioned three steps in a sliding
window fashion from pixel to pixel. This yields an overcomplete transform as each
pixel now acquires multiple hypotheses from overlapping patches. These hypotheses
are aggregated (typically by simple averaging) together to yield a final estimate. This
process of averaging of multiple hypotheses has been reported to consistently yield
superior results [108], [109], and is termed ‘translation invariant denoising’, or ‘cycle
spinning’ [108].
The performance of transform-based techniques is affected by the following
parameters: the choice of basis, the choice of a thresholding mechanism, a method
for aggregation of overlapping estimates and the patch size. We discuss these points
below.
6.4.1 Choice of Basis
Somewhat surprisingly, it has been observed that the sliding window DCT
outperforms most wavelet bases [109]. However, given a library of orthonormal bases,
the choice of the best one (from the point of view of denoising) from amongst these, is
largely an open problem in signal processing. In many existing approaches, the image
patch (of size n1 × n2) is represented as a matrix and the bases for representation are
obtained from the outer product of the bases that represent the rows with the bases
that represent the columns [109], [108]. This is called as a separable representation. In
other cases, the image patch is represented as a 1D vector of size n1n2 using a basis
of size n1n2 × n1n2. In the separable case, it has been observed that the transform may
be biased towards images whose salient features are aligned with the Cartesian axes.
If the local image geometry deviates from these axes, the transform may not be able
to represent them compactly enough. This has been remedied by using non-separable
bases such as the steerable wavelet [110], or the curvelet transform [111], which are
designed by taking image geometry into account.
117
6.4.2 Choice of Thresholding Scheme and Parameters
The most common thresholding method is hard thresholding, given as follows:
T (c ;λ) =
c if |c | ≥ λ
0 if |c | < λ.(6–11)
Another popular method, known as soft thresholding, not only nullifies coefficients
smaller than the threshold but also reduces the value of coefficients that are larger than
the threshold. Mathematically, soft thresholding is expressed as follows:
T (c ;λ) =
c − λ if |c | > λ, c > 0
c + λ if |c | > λ, c < 0
0 if |c | < λ
(6–12)
There exist several other thresholding schemes (or rather, schemes for modification of
transform coefficients). These methods can be interpreted as the result of minimizing
different types of risk functions. For example, the hard thresholding scheme (sometimes
termed as the best subset selection problem) is the result of minimizing the hard
threshold penalty, soft thresholding has an interpretation in terms of minimizing
the L1 penalty, whereas minimization of the smoothly clipped absolute deviation
(SCAD) leads to a thresholding scheme that lies intermediate between hard and soft
thresholding (see Figures 1 and 2 and Section (2.1) of [112]). Almost all these methods
of thresholding lead to monotonic functions of the coefficient magnitude. Despite the
several sophisticated thresholding functions available, the best denoising results that
have been reported using wavelet transforms are the ones with hard thresholding,
with a translation invariant approach [62]. The choice of the parameter λ has been
studied in detail in the community. For instance, in [113], the authors prove that under
a hard thresholding scheme, the choice λ = σn√2 logN is optimal from a statistical
risk standpoint, under zero mean Gaussian noise of standard deviation σn, where
N is the size (i.e. number of pixels) of the image/image patch (see Theorem (4) and
118
Equation (31) of [113]). In the experiments to be presented in Chapter 7, we have
observed empirically that the threshold λ = 3σ produces excellent denoising results for a
Gaussian noise model with 8×8 patches, which approximately tallies with the result from
[113]. This is in tune with an empirically observed fact that the coefficients of a Gaussian
random matrix of standard deviation σ when projected on an orthonormal basis are less
than 3σ with a very high probability.
6.4.3 Method for Aggregation of Overlapping Estimates
The most common approach for aggregation is a simple averaging of (or a median
operation on) all the hypotheses generated for the pixel.
6.4.4 Choice of Patch Size
The patch size choice presents the classical bias-variance tradeoff. Very small
patches allow preservation of finer details of the image but may overfit (undersmooth),
whereas larger patch sizes perform better in smoothing larger homogeneous regions
but may oversmooth some subtle details. Very little work exists on optimal patch
size selection. In fact, the patch size need not be constant throughout the image
and can vary as per local geometry. Some papers such as [114] propose the use of
multi-scale approaches by combining estimates at different scales. However the optimal
combination of such estimates is still a problem, much like that of optimal aggregation of
overlapping estimates. We present a correlation-coefficient criterion for the automated
selection of a single global patch size in Chapter 7.
A common criticism of transform-domain thresholding techniques (especially hard
thresholding) is their inability to distinguish high frequency information from noise. Some
authors try to remedy this by observing that there exist dependencies that arise in a
transform coefficients at the same spatial location but at different scales [115], or at
adjacent spatial locations [116]. These dependencies are exploited by using multivariate
thresholding methods. For instance in [115], bivariate shrinkage rules are developed,
which exploit the interdependency between coefficients at two adjacent scales leading
119
to superior image denoising performance. Another popular wavelet-based denoising
technique which exploits interdependency of the coefficients is the BLS-GSM (Bayesian
least squares for Gaussian scale mixtures) developed in [117]. This method assumes
that the distribution of a neighborhood of wavelet coefficients (defined as coefficients at
adjacent scales, orientations or locations) can be modeled as a Gaussian scale mixture
(a positive hidden variable multiplied by a Gaussian random variable). Assuming a
suitable prior on this hidden random variable, and given a set of wavelet coefficients
from a noisy image, one can form an estimate of the true wavelet coefficient given its
neighbors using a Bayesian least squares method.
It should be noted that estimates using coefficient thresholding schemes are shown
to be maximum a posteriori (MAP) estimates of the true signal coefficients given those
of the degraded signal, by making suitable assumptions on the statistics of wavelet
coefficients of clean natural images [116], [118]. Typically, the generalized Gaussian
family yields an excellent prior for the densities of natural image wavelet coefficients
[119]. This prior can be written as follows:
p(z ;σp, p) ∝ e−| z
σp|p (6–13)
A Gaussian prior (p = 2) is known to yield the empirical Wiener estimate for the
coefficients of the true image, a Laplacian prior (p = 1) corresponds to the soft
thresholding scheme and the hard thresholding scheme is approximated by smaller
values of p [116]. Doubting the validity of these priors for every natural image in
question, the authors of [120] learn a minimum mean square error (MMSE) estimator
for the true wavelet coefficients given the corresponding noisy coefficients. For this
purpose, they build a training set of patches from clean natural images and their
degraded version (assuming a fixed noise model). Following this, they solve a simple
regression problem to optimally perturb coefficients of the degraded patches so as
to yield values close to those of the corresponding clean patches. For overcomplete
120
representations, the authors of [120] report that the regression procedure produces
non-monotonic thresholding functions, a deviation from all earlier thresholding schemes
driven by image priors.
6.5 Non-local Techniques
These techniques which were popularized by the recent ‘non-local means
(NL-Means)’ algorithm, published in [2] and [121], exploit the fact that natural images
(and especially textures) often contain several patches that are very similar to each
other (as measured in the L2 sense, for instance). NL-Means obtains a denoised image
by minimizing a penalty term on the average weighted distance between an image
patch and all other patches in the image, where the weights are dependent on the
squared difference between the intensity values in the patches. This is expressed below
mathematically:
I = argminIE(I ) (6–14)
E(I ) =−1β
∑xi ,yi
log
[∑xj ,yj
exp
(−β‖I (0−)patch(xi , yi)− I
(0−)patch(xj , yj)‖
2
)](6–15)
where I (0−)patch(xi , yi) is a patch centered at pixel (xi , yi) of the image I excluding the central
pixel. Taking the derivative of E(I ) with respect to any pixel value I (xi , yi) and setting it to
zero, yields us the following update equation:
I (xi , yi) =
∑xj ,yjwj I (xj , yj)∑xj ,yjwj
(6–16)
wj = exp(−β‖I (0−)patch(xi , yi)− I
(0−)patch(xj , yj)‖
2). (6–17)
It can be observed from the previous equation that NL-Means is a pixel-based algorithm
and indeed can be interpreted as a spatially varying convolution in which the convolution
mask is derived using non-local image similarity.
Usually, just one update step yields good results [2]. However, for higher noise
levels, the algorithm can certainly be iterated several times [122]. The implicit assumption
121
in NL-Means is that patches that are similar in a noisy image will also be similar in the
original image, as noise is i.i.d. The essential principle of image self-similarity underlying
NL-Means is the same as the one used in fractal image coding methods [123]. The
NL-Means algorithm can also be interpreted as a minimizer of the conditional entropy
of a central pixel value given the intensity values in its neighborhood [124], [125], and
hence it is rooted in similar principles as the famous Efros-Leung algorithm for texture
synthesis [126]. The conditional entropy is estimated from the conditional density
which is obtained using only the noisy image in [124] or an external patch database in
[125]. Other denoising algorithms that exploit image self-similarity include [127] or the
long-range correlation method proposed in [128] and [129]. A variational formulation
for the NL-Means technique is presented in [130]. The concept of non-local similarity
is typically used only in the context of translations, but can also be extended to handle
changes in rotation, scale or affine transformations, as also changes in illumination.
Such models have been studied in [131], [132].
Critique: The performance of the NL-Means algorithm will be affected in those
regions of an image which do not have similar patches elsewhere in the image. The
performance of the technique is also dependent on the parameter β and the patch size.
Indeed, for large values of β, for large patch sizes or if the algorithm is iterated several
times, the residuals produced may discernible image features (see Figure 9 of [122]).
It is easy to interpret one iteration of NL-means as the product of a row-stochastic
matrix A1 of size N × N with the noisy image (represented as column vector). Here
N is the number of pixels. The entries of A1 are given by the weights from Eqn. 6–17.
If NL-Means is executed iteratively, the weight matrix will change. Let us denote the
weight-matrix at the i th iteration by Ai . Therefore after multiple iterations, the resulting
image is obtained from the product of the matrix Aπ with the original vectorized image,
where Aπ is given by
Aπ =
n∏i=1
Ai . (6–18)
122
It has been proved recently [133] that the limiting product of any sequence of row-stochastic
matrices yields a matrix with all rows identical to one another. When such a matrix is
multiplied with the image vector, it invariably produces a flat image. This theorem is
mentioned in Appendix B. This proves that the limit of the NL-Means algorithm is a flat
image.
The aforementioned non-local formulation has led to the development of the BM3D
(block matching in three dimensions) method [134] which is considered the current
state of the art in image denoising with excellent performance shown on a variety of
images. This method operates at the patch level and for each reference patch in the
image, it collects a group of similar patches. In the particular implementation in [134],
similarity is defined in terms of the Euclidean distance between pre-filtered patches.
These similar patches are then stacked together to form a 3D array. The entire 3D array
is projected onto a 3D transform basis, where coefficients below a selected threshold
value are set to zero. The filtered patches are then reconstructed by inversion of the
transform. This process is repeated over the entire image in a sliding window fashion.
At each step, all patches in the group are filtered and the multiple hypotheses generated
for a pixel are averaged. The authors term the collective filtering of a group of patches
as ‘collaborative filtering’ and claim that the group of patches exhibit greater sparsity
collectively than each individual patch in the group, citing that as the reason for the state
of the art performance of the BM3D method. In the specific implementation in [134],
the 3D transform is implemented in the following way. First, the individual patches are
filtered by projection onto a 2D transform basis (in this case the 2D DCT basis) followed
by hard thresholding of the coefficients. Once all these patches in the group are filtered
individually, each pixel stack (consisting of the corresponding pixels from all the patches)
is again filtered by means of a 1D Haar transform. The multiple hypotheses appearing at
any pixel are averaged to produce the final smoothed image.
123
The denoising results using the BM3D method are truly outstanding. However
the method is complex with several tunable parameters such as patch size, transform
thresholds, similarity measures, etc. Therefore, it may not be very easy to isolate
the exact effect of each component on the denoising performance. Furthermore, the
stacking together of similar patches to form a 3D array imposes a signal structure in the
third dimension. In fact, one would expect the ordering of the individual patches in the
3D array to affect the filter performance.
Another very competitive (albeit computationally expensive) approach for image
denoising, which makes use of non-local similarity, is the total least squares regression
method introduced in [135]. In this method, for each reference patch in the noisy image,
a group of similar patches is created. The reference patch is then expressed as a linear
combination of the similar patches and the coefficients of this linear combination are
obtained using total least squares regression. As compared to a simple least squares
regression, the total least squares regression accounts for the fact that the noise exists
in the reference patch as well as the other patches in the group. The computational
complexity is cubic in the number of patches in the group, which is a drawback of this
approach.
6.6 Use of Residuals in Image Denoising
There exists some research which tries to make use of the properties of the residual
to drive or constrain the image filtering process. Under the assumption of a noise model,
the overall idea is to drive the denoising technique in such a way that the residual
possesses the same characteristics as the noise model.
6.6.1 Constraints on Moments of the Residual
One of the earliest among these is an approach from [86] which assumes a
Gaussian (i.i.d.) noise model of known σ and tries to impose constraints on the statistics
of the residual (mean and variance) in each iteration of the filtering process. Starting
from a noisy image I0, their algorithm tries to find a smoothed image I (both defined
124
on a domain Ω) that minimizes the energy functional given in Equation 6–6. The
corresponding Euler-Lagrange has a Lagrange multiplier which is computed by gradient
projection, taking care to ensure that the constraints are not violated [86]. A similar
approach has also been independently proposed in [136].
6.6.2 Adding Back Portions of the Residual
In traditional denoising, the filtering algorithm is run (for some K iterations) to
produce a smoothed image, and the residual is ignored. In the approach by Tadmor,
Nezzar and Vese (called ‘TNV’) [137], a smoothed image J1 is obtained from a noisy
image J0 by minimizing an energy functional containing two terms: the total variation
of J1, and the mean square difference between J0 and J1 integrated over the domain
(data fidelity term). This is called as the first step of the algorithm. The residual J0 − J1,
however, is not discarded. Instead the same filtering algorithm is now again run on the
residual, in a second step. This decomposes J0 − J1 into the sum of a smoothed image
J2 and another residual J0 − J1 − J2. J2 is added back to the denoised output of the
first step, i.e. to J1. This procedure is repeated some K times, yielding a final ‘denoised
image’ J1 + · · ·+ JK . As K → ∞, the authors of [137] prove that the original noisy image
is obtained again. In practice, an upper bound is imposed on K as a free parameter. A
similar algorithm has also been developed by Osher et al. [138] with a modified data
fidelity term.
Critique: For both techniques, K is a crucial free parameter. Also when the
smoothed residual is added back at every step, some noise may also get added to
the signal. In experimental results published in a comprehensive survey of image
denoising algorithms [2], the residuals obtained on the Lena image using methods from
[137] and [138] are not totally devoid of image features.
6.6.3 Use of Hypothesis Tests
This approach proposed in [139] assumes that the exact noise model is known
a priori and that the underlying image is piecewise flat. To filter a noisy image, the
125
algorithm tries to approximate it locally (i.e. in the neighborhood of some radiusW
around each point) by a constant value in such a way that the residual satisfies the noise
hypothesis. In fact, it chooses the maximum value ofW for which the local distribution
of the residual in a small neighborhood around any image pixel is close to the assumed
noise distribution. Here, ‘closeness’ is defined using one of the canonical hypothesis
tests. This procedure is repeated at every point in the image domain. It should be
noted that this problem is difficult in two or more dimensions, whereas in 1D it can be
solved easily using a dynamic programming method like a segmented least squares
approach [140]. Another related paper [141] presents an algorithm that is similar to the
TNV approach described earlier, with one important change: the ‘denoised’ residual is
added back only at those points (x , y) such that the residual in a neighborhood around
(x , y) violates the hypothesis that it consists of a set of samples from the assumed
noise distribution. Experimental results are demonstrated with a very simple isotropic
Gaussian smoothing algorithm with a decrease in the amount of features that are visible
in the residual.
6.6.4 Residuals in Joint Restoration of Multiple Images
The authors of [142] observe that when multiple images of an object acquired,
the noise affecting the individual images is often independent across the images, even
if the noise model is not independent of the underlying signal. They exploit this in a
denoising framework that enforces the individual residuals for each of the images to be
independent of one another. The particular independence measure chosen is the sum
of pairwise mutual information values. An iterative optimization procedure is proposed.
Critique: As argued in section 6.6.1, mere statistical constraints do not guarantee
‘noiseness’ of the residual, especially if more complicated image models are to be
considered. A bigger problem is that merely satisfying the properties of the residual is
not guaranteed to lead to adequate restoration of the image geometry. In fact, a direct
enforcement of noise-like properties of the residual can lead to serious undersmoothing.
126
The properties of the residuals, can however be used for automatically finding individual
smoothing parameters, as will be discussed in Chapter 8.
6.7 Denoising Techniques using Machine Learning
In the transform domain methods discussed in Section 6.4, a fixed transform basis
is chosen for signal representation. There exist several papers which attempt to tune
the transform basis based on the statistics of image features or patches. For instance
in [118], the authors use noise-free training data to learn independent components
of the training vectors. The learned ICA basis is then used to denoise noisy image
patches using a maximum likelihood model, leading to a soft shrinkage operation. In
this particular case, the learned basis is orthonormal. However, there has been recent
interest in learning overcomplete bases (also called dictionaries), where the number
of vectors in the dictionary exceeds their dimension. This has largely been pioneered
by works such as [143], [144], [145]. These approaches are of interest because the
inherent redundancy of the vectors in the dictionary leads to more compact (sparser)
representation of natural signals. In fact, these papers specially tune the dictionaries in
such a way that natural image patches possess sparse representations when projected
onto the dictionary.
In the more recent literature, the KSVD algorithm [146], [147] has gained popularity
in the image denoising community. In this technique, starting from overlapping patches
from a noisy image, an overcomplete dictionary as well as sparse representations of
the patches in that dictionary are learned in an alternating minimization framework.
The algorithm has produced excellent results on denoising [146]. The name KSVD
stems from the fact that the K columns of the dictionary are updated one at a time
using a singular value decomposition (SVD) operation. A multi-scale variant of this
algorithm (known as MS-KSVD) learns dictionaries to represent patches at two or more
scales leading to further redundancy [114]. This algorithm has yielded state of the art
performance, on par with the BM3D algorithm [134] described in the previous section
127
[114]. However KSVD and MS-KSVD both require an expensive iterated optimization
procedure for which no convergence proof has been established so far. The alternating
minimization framework is subject to local minima [114] and requires parameters such
as level of sparsity. Some of these parameters are chosen to be a direct function of
the noise variance. However in successive iterations of the optimization, the image
is partially smoothed, and this therefore affects the quality of subsequent parameter
updates which are affected by the changes in the noise variance (see sections 3 and 4
of [146]).
In the KSVD approach, a single overcomplete dictionary is learned for the entire
image. As opposed to this, the authors of [148] perform a clustering step on the patches
from the noisy image and then represent the patches from each cluster separately
using principal components analysis (PCA). In practice, the clustering step (K-means) is
performed on coarsely pre-filtered patches and the learned PCA bases are necessarily
of lower rank. The denoised intensity values are produced by means of a kernel
regression framework from [65]. The entire procedure is iterated for better performance.
The authors call this as the KLLD (K locally learned dictionaries) approach [148]. The
idea of using a union of different orthonormal (PCA) bases for each cluster (as opposed
to a single complex basis for the union of all clusters) is interesting. However, the
method has free parameters such as the pre-filtering procedure, the clustering algorithm
and the number of clusters.
It should be noted that both KSVD and KLLD use non-local patch similarity in
learning the bases. Hence, they can also be classified as non-local approaches
described in Section 6.5. The sparse dictionary based methods have gained popularity
not only in image denoising but also other restoration problems such as super-resolution
[149].
128
6.8 Common Problems with Contemporary Denoising Techniques
There are common issues concerning most contemporary denoising techniques
which we briefly review in this section.
6.8.1 Validation of Denoising Algorithms
There is no clear consensus on the methods for validation of the performance of
denoising algorithms. Given two denoising algorithms A and B (of size N pixels) and
their output on a noisy input image, the primary requirement is that of a valid quality
measure for comparing their relative performance. The quality measure decides the
proximity between the denoised image and the true image (i.e. the clean image, devoid
of any degradation). The most common quality measure is the mean squared error
(MSE) defined as follows
MSE(A,B) =1
N
N∑i=1
(Ai − Bi)2 (6–19)
and the peak signal to noise ratio (PSNR) which is computed as follows from the MSE
PSNR(A,B) = 10 log102552
MSE(A,B). (6–20)
The lower the MSE (or higher the PSNR), the better the performance of the denoising
algorithm.
Now, the ideal quality measure should be in tune with what we as humans perceive
to be a ‘better’ image. This is a tricky issue as it is affected by several factors including
the type of display system. While the MSE is a very intuitive measure (and indeed is
also a metric), it is not necessarily in tune with perceptual quality because it weighs
errors in every pixel equally. However, the human eye is hardly sensitive to minor errors
in high-frequency textured regions such as the fur of the mandrill in Figure 6-1, or
carpet texture. Therefore, even if an image contains perturbations in the high-frequency
textured portions of an image and consequently has high MSE, it may still be regarded
129
as a good quality image from a perceptual point of view. Several such limitations of the
MSE/PSNR have been documented in [150] with numerous examples.
Furthermore, the authors of [150], [151] propose a new quality measure termed
the structured similarity index (SSIM) which measures the similarity between the
corresponding patches of images A and B. The similarity is measured in terms of
the proximity of the mean values of the patches, their variances and also a structural
similarity in terms of the correlation coefficient. Given two patches A(i) and B(i) from
images A and B respectively, this is represented as follows:
SSIM(A(i),B(i)) =2µaµbµ2a + µ
2b
× 2σaσbσ2a + σ
2b
× σabσaσb
(6–21)
=2µaµbµ2a + µ
2b
σabσ2a + σ
2b
(6–22)
where µa and µb are the mean value of patches A(i) and B(i) respectively, σa and σb are
their respective standard deviations and σab is the covariance between the patches. For
comparison between the complete images, the measure is defined as
SSIM(A,B) =1
NP
∑i
SSIM(A(i),B(i)) (6–23)
where NP is the number of non-overlapping patches. In practice, the statistics from
all patch locations are not weighed equally, but using a symmetric Gaussian window
of some chosen (small) standard deviation [151]. The SSIM is known to correlate well
with the human visual system [151], however it is unstable when any of the terms in
the denominators approach zero, and requires appropriate scale selection. While there
exists a multi-scale equivalent [63] (denoted as MSSIM) which combines SSIM values
at different image scales, the choice of scale for measurement of the statistics is still
an open issue. In the experimental results reported in Chapter 7, we perform validation
using PSNR, using SSIM with the default parameter settings used by the authors of [63]
(for instance window size of 11× 11).
130
6.8.2 Automated Filter Parameter Selection
While research on image denoising has been very extensive, the literature on
automated methods for selecting appropriate filter parameters is not very large. Most
techniques select the best parameter retroactively in terms of optimizing a full-reference
image quality measure. We defer further discussion on this topic to Chapter 8.
A B
C
Figure 6-1. Mandrill image: (A) with no noise, (B) with noise of σ = 10, (C) with noise ofσ = 20; the noise is hardly visible in the textured fur region (viewed bestwhen zoomed in the pdf file)
131
CHAPTER 7BUILDING UPON THE SINGULAR VALUE DECOMPOSITION FOR IMAGE
DENOISING
7.1 Introduction
This chapter describes two new algorithms for gray-scale image denoising. Our
methods are largely based upon the classical technique of singular value decomposition
(SVD), a popular concept in linear algebra. The SVD was first applied for image
filtering and compression applications in [152] and [153]. On a stand-alone basis, its
performance on filtering leaves much to be desired. However during this thesis, we
have explored several ideas which build upon the SVD, leading to simple and elegant
techniques with excellent performance. Many of the intermediate ideas that were
explored, failed to produce good results in terms of denoising performance. While the
vast majority of contemporary research literature focuses only on positive results, we
choose to adopt a different philosophy. We shall present negative results (and wherever
possible, analyze the reasons for the negative results) in addition to the positive ones
that are on par or better than the state of the art. We hope that this will provide readers
of this thesis with better insight and open up ideas for future research.
We observe that a principled denoising technique can be motivated by the following
considerations. What constitutes a good model for the images being dealt with? What
is known about the noise model? What properties distinguish a clean image from
one containing pure noise? We make the following assumptions in the theoretical
description and experimental results. We assume a gray-scale image in the intensity
range [0, 255] defined on a discrete rectangular domain Ω. In the techniques that we
investigate, we exploit different well-known properties of natural images. We assume
a zero mean i.i.d. (independent and identically distributed) Gaussian noise model of a
fixed standard deviation σ, as the common degradation process. We do not cover the
case of signal-dependent noise or those without precise probabilistic characteristics
(such as noise induced by lossy compression algorithms) in this thesis.
132
7.2 Matrix SVD
The matrix SVD is a popular technique in linear algebra with a wide variety of
applications in signal processing, such as filtering, compression and least-squares
regression to name a few. Given a matrix A of size m × n defined on the field of real
numbers, there always exists a factorization of the following form [154]
A = USV T (7–1)
where U is a m × m orthonormal matrix, S is a m × n diagonal matrix of positive
‘singular’ values and V is a n × n orthonormal matrix. Conventionally, the entries of S
are arranged in descending order of magnitude. Moreover, if the singular values happen
to be unique, the matrix SVD is always unique, modulo sign changes on the columns of
U and V . The columns of V (called the right singular vectors) are the eigenvectors of
ATA, whereas the columns of U (called the left singular vectors) are the eigenvectors of
AAT and the singular values in S turn out to be the square roots of the eigenvalues of
ATA (equivalently AAT ). A geometric interpretation of the SVD (assuming real vector
spaces) is presented in [154]. The matrix SVD has beautiful mathematical properties
such as providing a principled method for the nearest orthonormal matrix, and the best
lower-rank approximation to a matrix (both in the sense of the Frobenius norm) [154].
7.3 SVD for Image Denoising
It is well-known that the singular values of natural images follow an exponential
decay rule [155]. This property also holds true for Fourier coefficient magnitudes. In
fact, the SVD bases have a frequency interpretation. The smaller singular values of the
image correspond to higher frequencies and the large values correspond to the lower
frequency components. This property of the SVD has been used both in denoising [152]
as well as in compression [153].
Now, consider a noisy image A (a degraded version of an underlying clean
image Ac ) affected by additive Gaussian noise of standard deviation σ. Filtering is
133
accomplished by computing the decomposition A = USV T and then nullifying the
smaller values of A, which effectively discards higher frequency components (which are
known to correspond mostly to noise) [152]. An example of this procedure is illustrated
in Figure 7-1, where all singular values smaller than some k th singular value were set to
zero. It is clearly seen that low rank truncation (i.e. if the index k is chosen to be small)
produces blurry images and increasing the rank adds in image details but introduces
more and more noise. Taking this sub-par performance into account, this decomposition
is instead performed at the level of image patches. Indeed, small patches capture local
information which can be compactly represented with small-sized bases. The SVD is
computed in a sliding window fashion and filtered versions of overlapping patches are
averaged in order to produce a final filtered image. The averaging is useful for removing
seam artifacts at patch boundaries and also brings in multiple hypotheses. These results
are shown in Figure 7-2 for different settings: (1) rank 1 and rank 2 truncation of each
patch, (2) nullification of patch singular values below a fixed threshold of σ√2 logN
(where N is the number of image pixels), and (3) truncation of singular values in such
a way that the residual at each patch has a standard deviation of σ (i.e. a standard
deviation equal to that of the noise).
7.4 Oracle Denoiser with the SVD
Despite the improvement in results with the patch-based method (as seen upon
comparing Figures 7-2 and 7-1), the filtering performance is still far from desirable.
The main reason for this is that the singular vectors are unable to adequately separate
signal from noise. There are two key observations we make here. Firstly, let Q and
Qn be corresponding patches from a clean image and its noisy version respectively.
Given the decomposition Qn = UnSnV Tn , the projection of Q (the true patch) onto the
bases (Un,Vn) is given as SQ = UTn QVn. This matrix SQ is non-diagonal and hence
contains more non-zero elements than Sn. Note the fact that SQ is ‘denser’ than Sn.
Despite this, if we could somehow change the entries in Sn to match those in SQ , we
134
would now have a perfect denoising technique. Nevertheless, SVD-based filtering
techniques emphasize low-rank truncation or other methods of increasing the sparsity
of the matrix of singular values. We shall dwell more on this point in Section 7.5 in
the context of filtering with SVD bases as well as universal bases such as the DCT.
The second important observation is that the additive noise doesn’t just affect the
singular values of the patch but the singular vectors (which are the eigenvectors of the
row-row and column-column correlation matrices of the patch) as well. Bearing this in
mind, it is strange that SVD-based denoising techniques do not seek to manipulate the
orthonormal bases and instead focus only on changing the singular values. We now
perform the following experiment which starts with a noisy image and assumes that
the true singular vectors of the clean patch underlying every noisy patch in the image
are known or provided to us by an oracle. The denoising technique now proceeds as
follows:
1. Let the SVD for a patch Q(i) from a clean image be Q(i) = USV T . Project the noisypatch Q(i)n onto these bases to produce a matrix SQ(i) = UTQ(i)V .
2. Set to zero all elements in SQ(i) such that |SQ(i)| < λσ.
3. Produce the denoised version of Q(i)n by inverting the projection.
4. Repeat the above procedure in sliding window fashion and average all thehypotheses at every pixel to yield a denoised image.
We term this method as the ‘oracle denoiser’. Given 8 × 8 patches, we choose the
threshold of λ = 3 for the following reasons. Firstly, zero mean Gaussian random
variables with standard deviation σ (i.e. belonging to N (0,σ)) have value less than 3σ
with high probability, and projections of matrices of Gaussian random variables onto
orthonormal bases also obey this rule (experimentally, this probability was observed to
be very close to 1). Secondly, λ = 3 comes close to the idea threshold of λ =√2 log n2
from [113] for patches of size n × n.
135
Sample experimental results with the above technique are shown in Figure 7-3
for two noise levels: 20 and 40. The resulting PSNR values of this ideal denoiser far
exceed the state of the art methods such as BM3D [134]. Clearly, this experiment is
not possible in practice, however it serves as a benchmark, drives home an important
deficiency of contemporary SVD filtering approaches, and chalks out a path for us to
explore: manipulating the SVD bases of a noisy patch, or somehow using bases that are
‘better’ than the SVD bases of the noisy patch, may be the key to improving denoising
performance.
7.5 SVD, DCT and Minimum Mean Squared Error Estimators
As has been described in Section 6.4, nullification of the smaller coefficient values
from the projection of a noisy patch onto a basis is actually a MAP estimator of the
coefficients of the true patch. The MAP estimator is driven by sparsity-promoting image
priors which hold for image ensembles but not necessarily for every individual image.
We therefore explore minimum mean square error (MMSE) estimators for estimation of
the true projection coefficients.
7.5.1 MMSE Estimators with DCT
The idea of using MMSE estimators is inspired by the work in [120]. However, there
is one major difference between the approach from [120] and the one we present here.
In [120], the authors learn a generic rule to optimally perturb the DCT coefficients of
an ensemble of noisy image patches so as to reduce the mean squared error with the
DCT coefficients of their corresponding underlying clean patches. A different rule is
learned for each DCT coefficient (the number of coefficients is equal to the patch-size)
or for each sub-band, though all the rules are common across patches. However, we
have observed experimentally that the optimal rules for patches of different geometric
structures differ significantly from one another (see Figure 7-5). Therefore, we move
away from the notion of a single set of rules for the entire ensemble and instead learn
a different set of rules for each training patch. We make the definition of the word ‘rule’
136
more precise in the following. Consider the i th patch from a database PD of N patches.
We shall denote the patch as Ii . Let its size be n × n and let its k th DCT coefficient be
Ii(k)
where 1 ≤ k ≤ n2. Let us denote Jij as the j th noisy instance of patch Ii where
1 ≤ j ≤ M, and let Jij(k)
be its k th DCT coefficient. Then for each 1 ≤ k ≤ n2 and for
each patch 1 ≤ i ≤ N, we may seek a perturbation εki such that
εki = minε
M∑j=1
(Jij(k)+ ε− Ii
(k))2. (7–2)
Unfortunately, the values of corresponding DCT coefficients belonging to multiple noisy
instances of a patch show considerable variance, which prevents the learning of any
meaningful perturbation rule. To alleviate this problem, we quantize the values of each
DCT coefficient into a fixed number of bins, say B. Thus for the k th coefficient of the i th
patch, we no more learn a single scalar value, but a set of B perturbation values εkib,
one for each bin. This can be mathematically expressed as
εkib = minε
M∑j=1
δb(Jij(k))(Jij
(k)+ ε− Ii
(k))2 (7–3)
where
δb(Jij(k)) =
1 if bm(1)ik −m(2)ikB
c = b
0 otherwise.(7–4)
In the above equation, we define the following terms:
m(1)ik = maxj Jij
(k)(7–5)
m(2)ik = minj Jij
(k). (7–6)
Note that the quantization of the coefficients is motivated by the fact that the perturbation
of the coefficients owing to corruption by Gaussian noise shows some regularity, since
random variables from N (0,σ) lie within a bounded interval [−3σ, +3σ] with very high
probability.
137
Now, given a noisy image (which does not appear in the database PD), we divide
it into patches. For each noisy patch P, we search for its nearest neighbor from the
training patch database PD. Let the index of this nearest neighbor be s. We now apply
the corresponding rules already learned, i.e. the perturbations εksb (1 ≤ k ≤ n2),
to denoise the patch P. As per this rule, the k th coefficient of P, denoted by P(k), is
changed to
P(k) = P(k) + εksb (7–7)
where b is the bin for which δb(P(k)) = 1.
It is quite possible that the value of a particular DCT coefficient P(k) falls outside
the range [m(2)sk ,m(1)sk ]. In such cases we follow the heuristic method of applying the
perturbation from the bin that lies closest to the value P(k). From Equation 7–3, we also
see an implicit assumption that the perturbation values are constant within any bin.
While more sophisticated perturbation functions (say, linear within any bin) are possible,
we stick to piecewise constant functions for simplicity.
7.5.2 MMSE Estimators with SVD
We have previously motivated the fact that using better SVD bases can help in
improving denoising results. Suppose that for each patch Ii in the patch database PD,
we compute its SVD as Ii = UiSiV Ti . We conjecture that the bases (Ui ,Vi) can serve as
effective denoising filters. Again, let Jij be the j th noisy instance of patch Ii (1 ≤ j ≤ M).
The projection of Jij onto (Ui ,Vi) is Sij = UTi JijVi . We seek to learn rules εkib for values
of Jij(k)
quantized into B bins just as in Equation 7–3. Now, given a patch P from a noisy
image, let its nearest neighbor from the database be patch Is . We then project P onto
(Us ,Vs) giving us the matrix Ss = UTs PVs and we modify the coefficients in Ss using the
perturbation rules εksb (1 ≤ k ≤ n2, 1 ≤ b ≤ B) already learned for Is . The perturbation
is carried out in the same way as in Equation 7–7.
138
7.5.3 Results with MMSE Estimators Using DCT
7.5.3.1 Synthetic patches
We first experiment with a set of 15 synthetically generated patches (all of size
8 × 8) of different geometric structures. We generated 500 noise instances of each
patch from N (0, 20). A quantization of 20 bins was used for every DCT coefficient.
The synthetic patches are shown in Figure 7-4. The statistics of the mean squared
errors between the true and reconstructed patches (namely the average, maximum
and median reconstruction errors, all measured across the different noise instances)
are shown in Table 7-1 for two methods: the MAP estimator which sets to zero all DCT
coefficients whose absolute value is below 3σ (to which we shall henceforth refer to as
the MAP estimator), and the MMSE estimator described previously. Clearly, the MMSE
errors are consistently lower. For some patches (such as the X-shaped patch in Figure
7-4), we obtained perturbation functions that were not strictly monotonic, as can be seen
in Figure 7-8.
7.5.3.2 Real images and a large patch database
Next, we built a corpus of 12000 patches of size 8 × 8 taken from the first five
images of the Berkeley database [61], all converted to gray-scale. The size of each
image was about 320 × 480. We generated 500 noise instances of each patch from
N (0, 20). The perturbation values were learned as indicated in Equation 7–3 for a
quantization of 30 bins per coefficient. During training, we again consistently observed
lower reconstruction errors for the MMSE estimator than the MAP estimator. Next,
given a noisy image, we divided it into non-overlapping patches and denoised each
patch as per the perturbation functions learned for the nearest neighbor (in the corpus)
corresponding to each patch. The reconstruction results with this MMSE method as
well as the MAP estimator are shown in Figures 7-6 and 7-7. A quick glance reveals
that reconstruction with the MAP estimator exhibits considerably more ringing artifacts
than the MMSE estimator. But owing to the non-overlapping nature of the patches, both
139
the MMSE and MAP reconstructions show patch seam artifacts. These seam artifacts
can be eliminated by denoising overlapping patches and then averaging the results as
shown in Figures 7-6 and 7-7. Surprisingly, we obtain lower PSNR values for the MMSE
method with overlap than for MAP with overlap. We ascribe this drop in performance of
the MMSE estimator to two factors: errors in the results of the nearest neighbor search
for noisy adjacent patches (the accuracy of which will be affected by noise), and much
more importantly, errors due to the limited patch representation in the database. Indeed,
the nearest neighbor from the database may not be close enough to produce an MMSE
estimator that produces a reconstruction close enough to the true underlying patch.
7.5.4 Results with MMSE Estimators Using SVD
We now explore what happens if similar experiments are performed on SVD bases
(which are properties of individual patches) rather than on universal bases.
7.5.4.1 Synthetic patches
We ran the experiment on the same 15 synthetic patches as in Section 7.5.3.1,
with 500 noisy instances of each patch drawn from N (0, 20). A quantization of 20
bins was used for every SVD coefficient. The synthetic patches are shown in Figure
7-4. The statistics of the mean squared errors between the true and reconstructed
patches (namely the average, maximum and median reconstruction errors, all measured
across the different noise instances) are shown in Table 7-2 for two methods: the MAP
estimator, and the MMSE estimator described previously. The MMSE errors are again
consistently lower than the MAP errors. For some patches (such as the X-shaped patch
in Figure 7-4), we again obtained perturbation functions that were not strictly monotonic,
as can be seen in Figure 7-4. Notice that the errors with MMSE estimators on SVD are
much lower than those with DCT (compare Tables 7-1 and 7-2), the reason being that
in this experiment, we have access to the SVD bases of the true underlying patches
(whereas the DCT was a universal basis).
140
7.5.4.2 Real images and a large patch database
We used the same corpus of patches generated in Section 7.5.3.2. The SVD
bases were computed for all 12000 patches. Perturbation rules were learned to change
the values of the projection matrix to optimize average MSE across noise instances
and these rules were stored. Next, patches from a given noisy image (again, different
from any of the training images) were projected onto the SVD bases of the nearest
neighbor in the corpus. The coefficients were manipulated with the MAP rule as well
as the learned MMSE rules to produce two separate outputs. To our surprise, the
performance of the MMSE estimator was very poor. The MAP estimator with SVD
performed reasonably well but not as well as the one applied on DCT bases. These
results are shown in Figure 7-9 on the Barbara image which was subjected to noise
from N (0, 20) (starting PSNR 21.5). The PSNR values with MMSE on SVD, MAP on
SVD and the oracle estimator were 25.2, 28.85 and 36.6 respectively. Based on this,
we draw the following conclusions. The MMSE errors were very low during training but
high during testing. This clearly indicates an overfitting problem when dealing with SVD
bases, which was much more severe than while dealing with DCT bases. Consider that
we are given an arbitrary training database, and an arbitrarily chosen noisy image for
testing. It is highly unlikely that we could find an exact match in the database for every
image patch. The rules that were learned on the noisy instances of the exact same
patch do not seem to apply very well to other ‘similar’ patches.
However, we wish to emphasize that there is still merit in the idea of attempting to
manipulate the SVD bases. This is evidenced by the improvement in the performance of
the MAP estimator applied on projections onto the SVD bases of the nearest neighbor
from the database, over that of the same estimator applied to the SVD bases of the
noisy patch itself.
141
7.6 Filtering of SVD Bases
We have observed that the SVD bases of adjacent patches (i.e. patches with their
top-left corners at adjacent pixels) from clean natural images tend to exhibit greater
similarity than those from noisy versions of those images. The similarity is quantified in
terms of the angles between unit vectors from corresponding columns of the U matrices
(or those of the V matrices) of the adjacent patches. This observation is clearly a
property of natural image patches (and not a mere consequence of the fact that we
computed SVD bases of matrices that had several rows or columns in common). With
this in mind, we explored the effect of smoothing the U and V bases of adjacent patches
from the image using some averaging techniques. There are three ways this could be
done:
1. Smooth (say by some sort of averaging scheme) the corresponding columns fromthe U matrices of adjacent patches, and the corresponding columns from the Vmatrices of adjacent patches.
2. Smooth (say by some sort of averaging scheme) the outer products UiV Ti (1 ≤ i ≤n, 1 ≤ j ≤ n), i.e. outer products of the corresponding columns from the U and Vbases computed from adjacent patches.
3. Run a diffusion PDE defined specifically for orthonormal matrices on the U basesand also on the V bases (independently).
There are mathematical complications that arise in the first method: the averaging really
ought to be done by respecting the geometry of the space of orthonormal matrices.
However, the orthonormal matrices with determinant +1 are disjoint from those with
determinant -1. This is problematic from the point of view of computing intrinsic
averages. Furthermore, independent averaging of the U and V matrices ignores the
inherent coupling between them (as given a patch P, they are eigenvectors of PTP and
PPT respectively). Taking averages of outer products of corresponding columns from U
and V helps bring in this dependence. However, it still ignores the dependence between
the different outer products themselves.
142
Ignoring the above mathematical issues, we computed Euclidean averages. As the
resultant matrices were no more orthonormal, we orthonormalized them using a QR
decomposition. While computing averages of outer products (in method 2), there are
considerable complications in forcing the averaged outer-product to lie in the space of
matrices of the form v1vT2 where |v1| = |v2| = 1, which were ignored in our experiments.
We performed image denoising experiments by first smoothing the bases computed
from 8 × 8 patches using either of the three techniques, projecting the patches onto
the bases, applying the MAP rule on the coefficients of the projection matrix and
reconstructing the patch by inverting the transform. In case of the diffusion PDE defined
for orthonormal matrices, we used the following isotropic heat equation defined in [156]
for matrix U ∈ SO(p × p):dUk
dt= −Lk +
p∑i=1
(Li .Uk)U i (7–8)
where
Lk = Ukxx + Ukyy (7–9)
and Uk stands for the k th column of U.
Note that coupling between the U and V matrices can be imposed indirectly by
introduction of a data fidelity constraint on the patch P in addition to the smoothness
term on the U and V matrices, and then executing alternating PDEs (Euler-Lagrange
equations) on U and V . However, experimental results on averaging of the SVD bases
were in general not satisfactory. Similar experiments were repeated with nonlocal
averaging of similar U and V matrices from different regions of the image, and there was
no improvement in the results. We conjecture that the smoothness of U and V bases
from adjacent patches may not be a strong enough property of natural images.
7.7 Nonlocal SVD with Ensembles of Similar Patches
We now present an algorithm for image denoising using a non-local extension of the
SVD. We call this algorithm non-local SVD or NL-SVD.
143
We know that the SVD of a matrix P ∈ Rm×n is given as P = USV T where the
columns of U consist of the eigenvectors of the matrix
Cr = PPT (7–10)
where the element of Cr from the i th row and j th column is given as
Crij =∑k
PikPjk =< Pi ,Pj > (7–11)
where Pi and Pj stand for the i th and j th rows of P respectively. Similarly the columns of
V consist of the eigenvectors of the matrix
Cc = PPT (7–12)
where the element of Cc from the i th row and j th column is given as
Ccij =∑k
PkiPkj =< PTi ,P
Tj > (7–13)
where PTi and PTj stand for the i th and j th columns of P respectively. Note that Cr and
Cc are the row-row and column-column correlation matrices of P respectively. We also
know that the SVD gives us the optimal low-rank decomposition of P. In other words, the
optimal solution to
E(P) = ‖P − P‖2 (7–14)
subject to the constraint
rank(P) = k k < m, k < n (7–15)
is given by
P = ‖Uk SV Tk ‖2 (7–16)
where Uk and Vk are the first k columns of U and V respectively and S contains the k
largest singular values of S . This is often called as the Eckhart-Young theorem [154].
144
Given the inadequate performance of the local patch SVD, we continue our search
for ‘better’ bases to represent each patch. With this in mind, we now explore what would
happen if we were to consider a non-local generalization of the SVD. Given a patch P
from the noisy image, we look for other patches in the image that are ‘similar’ to P. We
will give a precise definition of similarity later in Section 7.7.1. Let us consider that there
are K such similar patches (including P) which we label as Pi where 1 ≤ i ≤ K .
Next, we ask the following question: what single pair of orthonormal matrices Uk and Vk
will provide the best rank-k approximation to all the patches Pi? In other words, what
(Uk ,Vk) minimizes the following energy?
E(Uk , Si,Vk) =K∑i=1
‖Pi − UkSiV Tk ‖2 (7–17)
where
UTk Uk = I (7–18)
V Tk Vk = I (7–19)
∀i ,Si ∈ Rk×k . (7–20)
The solution to this problem is given by an iterative minimization (starting from random
initial conditions) presented in [157]. Note that the matrices Si in this case are not
diagonal. Note also that the basis Uk ,Vk does not correspond to the individual SVD
bases but to a basis pair that is common to all the chosen patches. Related work in
[155] presents an alternating minimization framework with the additional (heuristically
driven) constraint that all the matrices Si are diagonal. This constraint is imposed at
every step of the alternating minimization framework. An approximate solution to the
energy function in Equation 7–17 is presented in [158]. This solution, which is called
as the 2D-SVD, can be computed in closed form and obviates the need for expensive
iterative optimizations. The 2D-SVD for the patch collection Pi is given as follows.
145
Consider the row-row and column-column correlation matrices
Cr =
K∑i=1
PiPTi (7–21)
Cc =
K∑i=1
PTi Pi . (7–22)
Then Uk contains the first k eigenvectors of Cr corresponding to the k largest eigenvalues
of Cr , and Vk contains the first k eigenvectors of Cc corresponding to the k largest
eigenvalues of Cc . The precise error bounds for the approximate solution w.r.t. the true
global solution are derived in [158].
We use this non-local SVD framework in a denoising algorithm and we shall show
later that this produces results competitive with the start of the art. We start off by
dividing the given noisy image into patches. For each ‘reference’ patch, we collect
patches similar to it and obtain the common basis for them using the non-local SVD
method. However, this leaves open the problem of deciding on the best rank k for the
bases, which need not be constant across patches of different geometric structure.
We obviate the need for selection of this parameter by following a different approach.
We compute the full-rank orthonormal bases U and V , i.e. we choose k = n for n × n
patches. Now the given noisy patch P is projected onto the pair (U,V ) producing the
matrix S (P) = UTPV . Essentially, we can write the entries of P as
Pij =∑kl
S(P)kl UkiVlj (7–23)
which is equivalent to a linear combination of outer products of the form UkV Tl (1 ≤ k ≤
n, 1 ≤ l ≤ n). We conjecture that this formulation has an interpretation in terms of 2D
spatial frequencies wherein the smaller coefficient values correspond to higher values
of at least one of the frequencies. Therefore, we choose to nullify the coefficients with
smaller values (as decided by a threshold). Given such a ‘filtered’ projection matrix S (P),
we reconstruct the patch. This operation is repeated on overlapping patches in a sliding
146
window fashion and the overlapping hypotheses are aggregated by averaging leading to
a final filtered image. Crucial to the performance of this filter is the choice of a notion of
patch similarity and also the choice of thresholds for removing smaller coefficients. We
discuss these choices below.
7.7.1 Choice of Patch Similarity Measure
Given a reference patch P ref in a noisy image, we can compute its K nearest
neighbors from the image, but this requires a choice of K which may not be the same
across different image patches. Hence, we revert to a distance threshold τd and select
all patches Pi such that the total squared difference between P ref and Pi is below τd .
Note that we have throughout assumed a fixed and known noise model - N (0,σ). If we
were to assume that P ref and Pi were different noisy versions of the same underlying
patch, we observe that the following random variable has a χ2 density with z = n2
degrees of freedom:
x =
n2∑k=1
(P refk − Pik)2
2σ2. (7–24)
The cumulative of a χ2 random variable with z degrees of freedom is given by the
expression
F (x ; z) = γ(x
2,z
2) (7–25)
where γ(x , a) stands for the incomplete gamma function defined as follows:
γ(x , a) =1
Γ(a)
∫ xt=0
e−tta−1dt (7–26)
with Γ(a) being the Gamma function defined as
Γ(a) =
∫ ∞
0
e−tt(a−1)dt. (7–27)
We observe that if z ≥ 3, for any x ≥ 3z , we have F (x ; z) ≥ 0.99. Therefore for a
patch-size of n × n and under the given σ, we choose the following threshold for the total
147
squared difference between the patches:
τd = 6σ2n2. (7–28)
Thus if two patches are noisy versions of the same clean patch, this threshold will
pick them with a very high probability. But the converse is not true, and therefore we may
end up collecting patch pairs that satisfy the threshold but are quite different structurally.
To eliminate such ‘false positives’, we observe that if P ref and Pi are noisy versions of
the same patch, the values in P ref − Pi belong to N (0,√2σ). This motivates us to use a
hypothesis test, in this particular case the one-sided Kolmogorov-Smirnov (K-S) test. To
avoid having to choose a fixed significance level, we use the p-values output by the K-S
tests as a weighting factor in the computation of the correlation matrices. Therefore we
rewrite them as follows:
Cr =
K∑i=1
pKS(Pref ,Pi)PiP
Ti (7–29)
Cc =
K∑i=1
pKS(Pref ,Pi)P
Ti Pi (7–30)
with pKS(P ref ,Pi) being the p-value for the K-S test to check how well the values in
P ref − Pi conform to N (0,√2σ). This thus gives us a robust version of the 2D-SVD.
There is a difference between our approach and robust versions of PCA, such as the
L1-norm (robust) PCA in [159]. We do not need to choose an arbitrary robust norm,
but use a weighting function directed by a hypothesis test instead. This is akin to
computation of fuzzy covariance matrices in fuzzy robust PCA [160].
In practice, we observed that the threshold τd = 6σ2n2 was too conservative.
That is, most patches Pi which differed from the reference patch by more than 3σ2n2,
yielded p-values pKS(P ref ,Pi) that were very close to zero. Hence we used the less
conservative bound τd = 3σ2n2 in our experiments. This also led to some improvement
in computational speed. We implemented a variant of our method in which only the
148
threshold τd was used for patch selection, and the hypothesis test was entirely ignored.
Surprisingly, we did not experience any significant drop in performance on our datasets
if the hypothesis test was neglected. Nonetheless in all reported results, we still used
the hypothesis test because it is a principled way of mitigating the effect of false
positives. An example of the phenomenon of false positives is illustrated in Figure
7-10. The two images in Figure 7-10 are structurally very different (containing graylevels
of 10 and 40), and yet the MSE between their noisy versions (σ = 20) is only 4075 which
falls below the threshold of 3σ2 = 4800. However the KS-test yields a p-value very close
to 0, thereby providing a better indication of structural dissimilarity.
It should be further noted that even the bound τd ≤ 3σ2n2 is quite conservative. It
can be refined using the fact that the χ2 density can be approximated as N (n2, n√2) if
n2 is large. This results follows from the central limit theorem and holds good for n2 ≥ 64.
This gives us the following refined bound:
τd = (n2 +
√2× 2.362n)2σ2 = 2(n2 + 3.29n)σ2 (7–31)
from the inverse cumulative for N (n2,√2n) at 0.99.
7.7.2 Choice of Threshold for Truncation of Transform Coefficients
As our noise model is N (0,σ), we observe that the corresponding random variables
in n × n patches have magnitude less than σ√2 log n2 with very high probability,
as also the entries of the corresponding projection matrices (onto orthonormal
bases/basis-pairs). Hence we assume that coefficients less than this threshold have
been produced due to noise. This threshold happens to be the universally optimal
threshold for wavelet denoising with hard thresholding [113] (also see Section 6.4), and
holds specifically for i.i.d. Gaussian noise for any given orthonormal basis. While hard
thresholding may lead to elimination of some useful high-frequency information, this loss
is compensated through the redundancy from overlapping patches [108].
149
7.7.3 Outline of NL-SVD Algorithm
The NL-SVD is outlined here below:
1. Divide the image into overlapping patches.
2. For each patch P ref (called as ‘reference patch’), find patches Pi from the imagethat are similar to it in the sense explained in Section 7.7.1.
3. Compute the weighted row-row and column-column correlation matrices Cr and CcPi as per Equation 7–29.
4. Find the eigenvectors of Cr to give orthonormal matrix U and those of Cc to giveorthonormal matrix V .
5. Project P ref onto (U,V ) to give S ref = UTP refV .
6. Set all small entries of S ref to zero, as discussed in Section 7.7.2.
7. Reconstruct P ref using P ref = US refV T and accumulate the pixel values to theappropriate location in the image.
8. Repeat above steps for all image patches.
9. Aggregate all the hypotheses and average them to produce the final filtered image.
7.7.4 Averaging of Hypotheses
Note that the procedure for averaging of hypotheses produced for a patch is
common to contemporary patch-based algorithms not only for image denoising
applications [134], [146], [108] (where it is called as ‘translation invariant denoising’),
but also for several other applications such as texture synthesis [161]. We have
experimented with other aggregation procedures such as finding the median of all
available hypothesis values, re-filtering of pixel values or learning weights for weighted
linear combinations. While being more computationally expensive, none of these
procedures improved the performance beyond simple averaging.
7.7.5 Visualizing the Learned Bases
We now present two examples of the bases learned to show the effect of the
structure of the patch and to visualize the corresponding bases. The first example (in
Figure 7-11) is a patch of size 8× 8 containing oriented texture from the Barbara image.
150
The patches similar to it (as measured in the noisy version of that image) are shown
alongside, as also the learned bases. The bases that we visualize are actually 64 outer
products of the form UiV Tj (1 ≤ i , j ≤ 8). We present a second example which contains
a high-frequency fur texture from the mandrill image, in Figure 7-12. In Figure 7-13, we
show outer-products of 8 × 8 DCT bases for comparison with those in Figure 7-11 and
Figure 7-12.
7.7.6 Relationship with Fourier Bases
It is a well-known result that the principal components of natural image patches
(in this case, just rows or columns from image patches) are the Fourier bases [107].
Furthermore, this property is a consequence of the translation invariance property of the
covariance between natural images. In fact, it is proved in [162] (see section 5.8.2) that
under the assumption of translation invariance, the eigenvectors of the covariance matrix
of natural image patches turn out to be sinusoidal functions of different frequencies.
The aforementioned fact can be experimentally observed by computing the principal
components of a large ensemble of patches of fixed size - the results are very close to
DCT bases (the real components of the Fourier bases). We computed the row-row and
column-column covariance matrices of 8 × 8 patches sampled at every 4 pixels from
all the 300 images of the Berkeley database [61] converted to gray-scale (i.e. a total of
2.88× 106 patches). The eigenvectors of these matrices were very similar to DCT bases
as measured by the angles between corresponding basis vectors: 0.2, 4, 4, 6.8, 5.6, 6, 4
and 3 degrees.
For NL-SVD, the consequence of this result is as follows. If for every reference
patch, the correlation matrices were computed from several patches without attention
to similarity, we could get a filter very similar to the sliding window DCT filter (modulo
asymptotics, and barring the difference due to robust PCA).
151
7.8 Experimental Results
We now describe our experimental results. For our noise model (i.e. additive and
i.i.d. N (0,σ)), we pick σ ∈ 5, 10, 15, 20, 25, 30, 35. We perform experiments on
Lansel’s benchmark dataset [62] consisting of 13 commonly used images all of size
of 512 × 512. We pit NL-SVD against the following: NL-Means [2], KSVD [146], our
implementation of a 3D-DCT algorithm (see Section 7.8.5), BM3D [134] and the oracle
denoiser from Section 7.4. For comparison at each noise level, we use PSNR values
as well as SSIM values at patch-size 11 × 11 (as per the implementation in [63]). (For
the definition of SSIM, refer to Section 6.8.1.) All these metrics were measured by first
writing the images into a file in one of the standard image formats (usually, pgm) and
then reading them back into memory. Though this introduces minor quantization effects
(and usually reduces the PSNR/SSIM values slightly for all methods), we follow this
approach as it represents realistic digital storage of images.
In the case of BM3D and NL-Means, we used the software provided by the authors
online. For KSVD, we used the results already reported by the authors on the denoising
benchmark [62]. These results were available only for noise levels upto and including
σ = 25. For BM3D, we report results on both stages of their algorithm: the intermediate
stage, as well as the final stage which performs empirical Wiener filtering on the output
of the earlier stage. We refer to these stages as ‘BM3D1’ and ‘BM3D2’ respectively.
To each of the above algorithms, the noise σ is specified as input (which is useful
for optimal parameter selection in their provided softwares). For NL-SVD, we used 8 × 8
patches in all experiments and a search window radius of 20 around each point. The
search window radius is not a free parameter as it affects only computational efficiency
and not accuracy. In fact larger sizes of the search window did not improve the results
in our experiments. There are no other free parameters in our technique, apart from the
patch-size which is also true of all other patch-based algorithms in the field. Later, in
Section 7.9, we present a criterion for patch-size selection by measuring the correlation
152
coefficient between patches from the residual image (i.e. difference between noisy and
denoised images). For NL-means, we used 9 × 9 patches throughout, with a search
window radius of 20. For BM3D implementation, we used the default settings of all the
various parameters as obtained from the authors’ software (their selected patch-size is
again 8 × 8). The results for KSVD have been reported by the authors themselves, and
hence we assume that the optimal parameter settings were already used for generating
those results.
7.8.1 Discussion of Results
From the PSNR results presented in Tables 7.12, 7.12, 7-7, 7-9, 7-12, 7-14 and
7-16, and the corresponding SSIM results in Tables 7-4, 7-6, 7-8, 7-10, 7-13, 7-15 and
7-17, we make several observations. NL-SVD is consistently superior to NL-Means in
terms of PSNR and SSIM. These tables also contain results of the HOSVD algorithm,
our second technique, which we shall be presenting later in Section 7.10. In all the
tables at the end of the chapter, we have used numbers to refer to image names to save
space. The numbers and the corresponding names are as follows: 13 - airplane, 12 -
For a thorough introduction to multi-linear algebra and the HOSVD, we refer the reader
to [165]. An interesting application of the HOSVD to face recognition is presented in
[166].
7.10.2 Application of HOSVD for Denoising
We now describe how the HOSVD is applied for joint denoising of multiple image
patches. For each reference patch in the noisy image, all patches similar to it are
collected and represented as a 3D array Z ∈ Rp×p×K , where the patches have size p × p
161
and K is the number of similar patches in the ensemble (note that K is spatially varying).
A patch P is said to be similar to the reference patch if ‖P − Pref ‖2 ≤ τ 2d where τd is
defined earlier in Section 7.7.1. The HOSVD of Z is given as follows
Z = S ×1 U(1) ×2 U(2) ×3 U(3) (7–36)
where the orthonormal matrices U(1) ∈ Rp×p, U(2) ∈ Rp×p and U(3) ∈ RK×K can be
computed from the SVD of the unfoldings Z(1), Z(2) and Z(3) respectively. The exact
equations are as follows:
Z(1) = U(1) · S(1) · (U(2) ⊗ U(3))T (7–37)
Z(2) = U(2) · S(2) · (U(3) ⊗ U(1))T (7–38)
Z(3) = U(3) · S(3) · (U(1) ⊗ U(2))T . (7–39)
However, the complexity of the SVD computations for K × K matrices is O(K 3).
To prevent the computations from getting unwieldy, we put an upper cap on the number
of allowed similar patches, i.e. we impose the constraint that K ≤ 30. The patches
from Z are then projected onto the HOSVD transform. The parameter for thresholding
the transform coefficients is picked to be σ√2 log p2K , again as per the rule from [113].
The stack Z is then reconstructed after inverting the transform thereby filtering all
the individual patches. Note that unlike NL-SVD (see Section 7.7.3), we filter all the
individual patches in the ensemble and not just the reference patch. This afforded
additional smoothing on all the patches which was required due to the upper limit of
K ≤ 30 unlike the case with NL-SVD. Again, the reference patch is moved in a sliding
window fashion and the hypotheses appearing at each pixel are averaged to produce
the final filtered image.
7.10.3 Outline of HOSVD Algorithm
The HOSVD for denoising is outlined here below:
1. Divide the image into overlapping patches of size p × p.
162
2. For each patch P ref (called as ‘reference patch’), find patches Pi from the imagethat are similar to it in the sense explained in Section 7.7.1.
3. Stack the similar patches in a 3D array Z ∈ Rp×p×K .
4. Compute the unfoldings Z(1), Z(2) and Z(3) and then compute their SVD to yield thematrices U(1), U(2) and U(3) respectively.
5. Compute any one unfolding of the tensor S , say S(1).
6. Set to zero all entries of S(1) that are smaller (in absolute value) than σ√2 log p2K .
7. Reconstruct the entire stack using Equation 7–37 which filters every patch in theensemble.
8. Repeat above steps for all image patches.
9. Aggregate all the hypotheses and average them to produce the final filtered image.
We would like to emphasize that there are two key differences between our HOSVD
algorithm and BM3D. Firstly, we learn a spatially varying basis whereas BM3D uses
universal bases (2D-DCT or biorthogonal wavelets depending upon the noise level,
followed by a Haar basis in the third dimension). Secondly, as BM3D stacks together
similar patches and performs a Haar transform in the third dimension, it thus implicitly
treats the patches as a signal in the third dimension. On the other hand, our HOSVD
method does not impose any such ‘signalness’ in the third dimension. In fact, scrambling
the order of the patches in the third dimension will produce the same values of the
projection coefficients, except for corresponding permutation operations. Indeed, both
HOSVD and NL-SVD do not treat the patches as signals in any dimension unlike bases
such as the DCT. The SVD of a patch is itself invariant to row and column permutations.
However this is not a problem, because it is unlikely to encounter patches from real
images that are row/column permutations of one another. On the other hand, the
ordering of patches in the third dimension (the choice of which is a free parameter) may
potentially alter the output of a denoising algorithm such as BM3D, whereas our method
will still remain invariant to this change.
163
7.11 Experimental Results with HOSVD
The PSNR results for HOSVD are presented in Tables 7.12, 7.12, 7-7, 7-9, 7-12,
7-14 and 7-16. The corresponding SSIM results can be found in Tables 7-4, 7-6,
7-8, 7-10, 7-13, 7-15 and 7-17. From these tables, it can be observed that HOSVD
is superior to KSVD, NL-Means, 3D-DCT and NL-SVD. Indeed, it is also superior to
BM3D1 at higher noise levels (σ ≥ 20) on most images in terms of PSNR/SSIM values,
though it lags slightly behind BM3D2. The average difference between the PSNR values
produced by HOSVD and BM3D2 at noise levels 10, 20 and 30 is 0.346, 0.281 and
0.343 respectively (see Tables 7.12, 7-9 and 7-14).
A comparison between NL-SVD and HOSVD reveals that the latter outperforms
the former on the weaker or finer edges or textures. We have observed that the images
denoised by HOSVD sometimes tend to have a faint grainy appearance. The reason
for this is that HOSVD smoothes an ensemble of patches by projection onto a common
basis followed by truncation of transform coefficients. We have observed experimentally
that this tends to slightly under-smooth the patches, when compared to patches that
are smoothed individually as in techniques like NL-SVD. The undermsoothing is
compensated for by the averaging operations and the filtering of all patches from the
stack. The faint grainy appearance can be mitigated by running a linear smoothing
filter, such as a PCA in the third dimension from patch stacks from the filtered output
of HOSVD (similar to the Wiener filter idea implemented in BM3D2 but with a learning
component involving PCA on the stack of corresponding pixels from similar patches),
which seems to improve subjective visual quality (in our opinion). We leave a rigorous
testing of this issue for future work.
7.12 Comparison of Time Complexity
We now present a time complexity analysis of all the competing algorithms. For
this, assume that the number of image pixels is N and the average time to compute
similar patches per reference patch is TS . Let us assume that the average number of
164
patches similar to the reference patch is K . Let the size of the patch be n × n. The time
complexity of NL-SVD is then O([TS + Kn3]N) because the eigendecomposition of a
n × n matrix is O(n3) and multiplication of two n × n matrices is also an O(n3) operation.
The BM3D implementation in [134] requires O(Kn3) time for the 2D transforms and
O(K 2n2) time for the 1D transforms, if the transforms are implemented using simple
matrix multiplication. This leads to a total complexity of O([TS + Kn3 + K 2n2]N).
If algorithms such as the fast Fourier transforms are used, this complexity reduces
to O([TS + Kn2 log n + n2K logK ]N). If we assume that n is o(K) (i.e. that the
average number of ‘similar’ patches is much greater than the patch width/height, a
very reasonable assumption to make), then NL-SVD is in fact better in terms of time
complexity than BM3D. The complexity of HOSVD is obtained as follows. Given a
patch stack of size n × n × K , the size of two of its unfoldings is n × nK , the SVD
of which consumes O(Kn3) time. The third unfolding has size K × n2, the SVD of
which consumes O(min(K 2n2,Kn4)) time. Hence the total complexity of the method is
O([TS + Kn3 +min(K 2n2,Kn4)]N).
Note again that NL-SVD and HOSVD follow the concept of matrix based patch
representations, in tune with the philosophy followed by [155], [152], [158], [167] and
[168]. We could have represented each n × n patch as a n2 × 1 vector and built a
covariance matrix of size n2 × n2 to produce the spatially adaptive bases. In fact,
such an approach was taken in [169]. However the complexity of such a method is
O([TS + Kn4 + n6]N) which is greater than ours. The KSVD technique also follows
similar vector-based patch representation and the K learned bases have size n2 × 1
(with K >> n2). An important point to be mentioned is that the SVD is a characteristic
of a matrix/patch. There is no analog to the SVD for the vectorial representation of the
patch.
165
A B C
D E F
Figure 7-1. Global SVD Filtering on the Barbara image: (A) clean Barbara image, (B)noisy image with Gaussian noise of σ = 20 (PSNR = 22.11), (C) filteredimage with rank1 truncation (PSNR = 14.7), (D) filtered image with rank 10truncation (PSNR = 20.17), (E) filtered image with rank 100 truncation(PSNR = 24.3), (F) filtered image with rank 200 truncation (PSNR = 23.03)
166
A B C
D E F
Figure 7-2. Patch-based SVD filtering on the Barbara image, (A) clean Barbara image,(B) noisy image with Gaussian noise of σ = 20 (PSNR = 22.11), (C) filteredimage with rank 1 truncation in each patch (PSNR = 23.9), (D) filtered imagewith rank 2 truncation in each patch (PSNR = 25.05), (E) filtered image withnullification of singular values below 3σ in each patch (PSNR = 23.42), (F)filtered image with truncation of singular values in each patch so as to matchnoise variance (PSNR = 25.8)
167
A B C
D E
Figure 7-3. Oracle filter with SVD, (A) clean Barbara image, (B) noisy image withGaussian noise of σ = 20 (PSNR = 22.11), (C) filtered image withnullification of all the values in the projection matrix below 3σ in each patch(PSNR = 36.9), (D) noisy image with Gaussian noise of σ = 40 (PSNR =22.11), (E) filtered image with nullification of all the values in the projectionmatrix below 3σ in each patch (PSNR = 31.34)
Figure 7-4. Fifteen synthetic patches
168
200400600400450500
−2000200−200
0200
−400−300−200−300−280−260
1002003000
200400
1002003000
200400
−400−2000−300−250−200
−200−1000−200−100
0
200300400200300400
−200−1000−200−100
0
−1000100−100
0100
−1000100−100
0100
−1000100−50
050
−1000100−50
050
−1000100−100
0100
−1000100−100
0100
−1000100−50
050
−400−300−200−300−280−260
−1000100−100
0100
−2000200−200
0200
−1000100−50
050
−2000200−100
0100
−2000200−200
0200
−1000100−100
0100
−2000200−200
0200
100200300100200300
−1000100−100
0100
−1000100−50
050
−1000100−100
0100
−1000100−100
0100
−1000100−50
050
−1000100−50
050
−1000100−100
0100
1002003000
200400
−1000100−100
0100
−2000200−100
0100
−1000100−100
0100
−1000100−100
0100
−1000100−100
0100
−1000100−50
050
−2000200−200
0200
−400−2000−300−250−200
−1000100−100
0100
−2000200−200
0200
−1000100−100
0100
−2000200−100
0100
−2000200−200
0200
−2000200−200
0200
−2000200−100
0100
−200−1000−400−200
0
−1000100−100
0100
−1000100−100
0100
−1000100−50
050
−1000100−50
050
−2000200−200
0200
−1000100−100
0100
−1000100−50
050
200300400200300400
−1000100−50
050
−2000200−200
0200
−2000200−200
0200
−2000200−200
0200
−2000200−200
0200
−1000100−100
0100
−2000200−200
0200
A
400500600506507508
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−5−4−3
−100 0 100−5−4−3
−100 0 100−5−4−3
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−5−4−3
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−4−3−2
400500600504506508
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−5−4−3
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−5−4−3
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
400500600504506508
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−5−4−3
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−4−3−2
400500600504506508
−100 0 100−6−4−2
−100 0 100−5−4−3
−100 0 100−5−4−3
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
B
Figure 7-5. Threshold functions for DCT coefficients of (A) the sixth and (B) the seventhpatch from Figure 7-4
169
A B C
D E F
Figure 7-6. DCT filtering with MAP and MMSE methods, (A) clean Barbara image, (B)noisy image with Gaussian noise of σ = 20 (PSNR = 22.11), (C) filteredimage with MMSE estimator on non-overlapping 8× 8 patches (PSNR =26.26), (D) filtered image with MAP estimator on non-overlapping 8× 8patches (PSNR = 26.19), (E) filtered image with MMSE estimator onoverlapping 8× 8 patches (PSNR = 28.03), (F) filtered image with MAPestimator on overlapping 8× 8 patches (PSNR = 29.94)
170
A B C
D E F
Figure 7-7. DCT filtering with MAP and MMSE methods, (A) clean Barbara image, (B)noisy image with Gaussian noise of σ = 20 (PSNR = 22.11), (C) filteredimage with MMSE estimator on non-overlapping 8× 8 patches (PSNR =27.12), (D) filtered image with MAP estimator on non-overlapping 8× 8patches (PSNR = 26.9), (E) filtered image with MMSE estimator onoverlapping 8× 8 patches (PSNR = 29.1), (F) filtered image with MAPestimator on overlapping 8× 8 patches (PSNR = 29.94)
171
700800900808810812
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
400600800554556558
−100 0 100−4−3−2
−100 0 100−4−3−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−5−4−3
−100 0 100−4−3−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−4−3−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−4−3−2
−100 0 100−4−3−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−4−3−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−4−3−2
−100 0 100−3.5
−3−2.5
−100 0 100−6−4−2
−100 0 100−5−4−3
−100 0 100−6−4−2
−100 0 100−4−3−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−4−3−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−4−3−2
−100 0 100−5−4−3
−100 0 100−5−4−3
−100 0 100−4−3−2
−100 0 100−4−3−2
−100 0 100−4−3−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−4−3−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−4−3−2
−100 0 100−4−3−2
−100 0 100−6−4−2
−100 0 100−4−3−2
A
400500600504506508
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−4−3−2
−100 0 100−6−4−2
−100 0 100−5−4−3
−100 0 100−6−4−2
400500600500505510
−100 0 100−4−3−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
400500600504506508
−100 0 100−5−4−3
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−4−3−2
−100 0 100−4−3−2
−100 0 100−4−3−2
400500600506507508
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−4−3−2
−100 0 100−10
−50
−100 0 100−4−3−2
−100 0 100−4−3−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−4−3−2
−100 0 100−6−4−2
−100 0 100−4−3−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−4−3−2
−100 0 100−4−3−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−4−3−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−4−3−2
−100 0 100−4−3−2
−100 0 100−6−4−2
−100 0 100−5−4−3
−100 0 100−5−4−3
−100 0 100−6−4−2
−100 0 100−6−4−2
−100 0 100−6−4−2
B
Figure 7-8. Threshold functions for coefficients of (A) the sixth and (B) the seventh patchfrom Figure 7-4 when projected onto SVD bases of patches from thedatabase
172
A B C
D E
Figure 7-9. SVD filtering with MAP and MMSE methods, (A) clean Barbara image, (B)noisy image with Gaussian noise of σ = 20 (PSNR = 22.11), (C) filteredimage with MMSE estimator with SVD bases of patches from the database,on overlapping 8× 8 patches (PSNR = 25.2), (D) filtered image with MAPestimator with SVD bases of patches from the database, on overlapping8× 8 patches (PSNR = 28.85), (E) filtered image with MAP estimator withSVD bases of the true patches, on overlapping 8× 8 patches (PSNR = 36.6)
A B C D
Figure 7-10. Motivation for robust PCA: though the patches are structurally different, thedifference between the two noisy patches falls below the threshold of 3σ2n2
173
A B
C
Figure 7-11. Barbara image, (A) reference patch, (B) patches similar to the referencepatch (similarity measured on noisy image which is not shown here), (C)correlation matrices (top row) and learned bases
174
A B
C
Figure 7-12. Mandrill image, (A) reference patch, (B) patches similar to the referencepatch (similarity measured on noisy image which is not shown here), (C)correlation matrices (top row) and learned bases
175
Figure 7-13. DCT bases (8× 8).
176
A B C
D E F
G
Figure 7-14. Barbara image: (A) clean image, (B) noisy version with σ = 20, PSNR = 22,(C) output of NL-SVD, (D) output of NL-Means, (E) output of BM3D1, (F)output of BM3D2, (G) output of HOSVD
Figure 7-16. Boat image: (A) clean image, (B) noisy version with σ = 20, PSNR = 22,(C) output of NL-SVD, (D) output of NL-Means, (E) output of BM3D1, (F)output of BM3D2, (G) output of HOSVD
Figure 7-18. Stream image: (A) clean image, (B) noisy version with σ = 20, PSNR = 22,(C) output of NL-SVD, (D) output of NL-Means, (E) output of BM3D1, (F)output of BM3D2, (G) output of HOSVD
Figure 7-20. Fingerprint image: (A) clean image, (B) noisy version with σ = 20, PSNR =22, (C) output of NL-SVD, (D) output of NL-Means, (E) output of BM3D1,(F) output of BM3D2, (G) output of HOSVD
Figure 7-22. For σ = 20, denoised Barbara image with NL-SVD (A) [PSNR = 30.96] andDCT (C) [PSNR = 29.92]. For the same noise level, denoised boat imagewith NL-SVD (B) [PSNR = 30.24] and DCT (D) [PSNR = 29.95].
185
A B
C D
Figure 7-23. (A) Checkerboard image, (B) Noisy version of the image with σ = 20, (C)Denoised with NL-SVD (PSNR = 34) and (D) DCT (PSNR = 27). Zoom infor better view.
186
A B
C
Figure 7-24. Absolute difference between true Barbara image and denoised imageproduced by (A) NL-SVD, (B) BM3D1, (C) BM3D2. All three algorithmswere run on image with noise σ = 20.
A B C
Figure 7-25. A zoomed view of Barbara’s face for (A) the original image, (B) NL-SVDand (C) BM3D2. Note the shock artifacts on Barbara’s face produced byBM3D2.
187
A B C
D E F
G
Figure 7-26. Reconstructed images when Barbara (with noise σ = 20) is denoised withNL-SVD run on patch sizes (A) 4× 4, (B) 6× 6, (C) 8× 8, (D) 10× 10, (E)12× 12, (F) 14× 14 and (G) 16× 16.
188
A B C
D E F
G
Figure 7-27. Residual images when Barbara (with noise σ = 20) is denoised withNL-SVD run on patch sizes (A) 4× 4, (B) 6× 6, (C) 8× 8, (D) 10× 10, (E)12× 12, (F) 14× 14 and (G) 16× 16.
189
Table 7-1. Avg, max and median error on synthetic patches from Figure 7-4 with MAPand MMSE estimators for DCT bases
Figure 8-2. Images with Gaussian noise with σ2n = 5× 10−3 denoised by NL-Means.Parameter selected for optimal noiseness measures: (a): P, (c) K , (e) CC ,(g) MI; and optimal quality measures: (i) L1, (k) SSIM, (m) MI betweendenoised image and residual. Corresponding residuals in(b),(d),(f),(h),(j),(l),(n); Zoom in pdf file for better view
213
A B C D
E F G H
I J K L
M N
Figure 8-3. Images with Gaussian noise with σ2n = 5× 10−3 denoised by NL-Means.Parameter selected for optimal noiseness measures: (a): P, (c) K , (e) CC ,(g) MI; and optimal quality measures: (i) L1, (k) SSIM, (m) MI betweendenoised image and residual. Corresponding residuals in(b),(d),(f),(h),(j),(l),(n); Zoom in pdf file for better view
We have presented contributions to two major problems fundamental to image
processing: probability density estimation and image denoising. The contributions to
probability density estimation are as follows:
1. Development of a new PDF estimator for images which accounts for the fact thatthe image is not just a bunch of samples, but a discrete version of an underlyingcontinuous signal.
2. Extension of the above concept for joint PDFs of two or more images, defined on2D or 3D domains.
3. Extension of the above concepts to develop three different biased densityestimators that favor the higher gradient regions or points of a single image (in2D/3D), a pair of images (in 2D/3D) or a triple of images (in 3D).
4. Application of all the above PDF estimators to affine image registration.
5. Application of all unbiased PDF estimators to filtering of grayscale and colorimages, chromaticity fields and grayscale video, in a mean-shift framework.
6. Development of density estimators for unit-vector data such as chromaticity andhue in color images by making explicit use of the fact that they are obtained astransformation of color measurements that can be assumed to lie in Euclideanspace.
The contributions to image denoising are as follows:
1. We have developed a non-local image denoising algorithm (NL-SVD) after aseries of experiments on the patch SVD. Our technique learns SVD bases for anensemble of patches that are similar to a reference patch located at each pixel.These spatially adaptive bases are shown to produce excellent performance onimage denoising, comparable to the state of the art.
2. Our method has parameters which are obtained in a principled manner from thenoise model. The method is thus elegant and efficient as it does not need anycomplicated optimization procedure.
3. We have extended the NL-SVD technique to perform joint filtering of imagepatches, leading to the HOSVD based filtering technique that yields even betterimage quality values.
220
4. We have also presented a new statistical criterion for automated filter parameterselection and used it to obtain the smoothing parameter in the NLMeans algorithmwithout reference to the true image.
9.2 Future Work
Future work on the probability density estimator has been outlined in Section 3.4.
Here, we leave behind pointers to possible future extensions of our work in image
denoising.
9.2.1 Trying to Reach the Oracle
The ultimate aim of several of the procedures reported in Chapter 7 was to obtain
the SVD bases of the underlying patch. The bases obtained by NL-SVD and HOSVD
yield excellent performance but are still far behind the oracle denoiser. Is it possible
to obtain the true bases or bases that are very close to the true bases? Are there
other bases that would yield equivalent performance? These questions remain open
problems.
9.2.2 Blind and Non-blind Denoising
In many contemporary denoising algorithms [2], [146], [134], one assumes
knowledge of the true noise variance as this allows principled selection of various
parameters. However, the noise variance is often not known in practice and this is called
as a ‘blind denoising scenario’. In such cases, one can use knowledge about the sensor
device in getting an idea of the noise variance. However, environmental factors too can
affect image quality, and in such cases, one cannot merely use sensor properties. In
practice, the noise variance can be estimated from the noisy data available. One of
the most commonly used techniques for noise variance estimation computes the Haar
wavelet transform of the image. The maximum absolute deviation of the HH sub-band
(high frequency components in both x and y directions) is considered to be a reasonable
estimate of the noise variance [181]. Three training-based methods are presented in
[182]: two which make use of a Laplacian prior for natural images, and another which
measures noise variance from the variance of homogenous regions in a noisy image.
221
A statistical criterion for distinguishing between homogenous regions and regions with
edges/oriented texture is presented in [183]. Development of a robust noise variance
estimator and using it in conjunction with the denoising method presented in this thesis,
is an interesting direction for future work. Furthermore, one can also side-step the
problem of noise variance estimation as follows: our denoising algorithm can be run
assuming several different values for the noise standard deviation σ. This affects the
critical parameters for transform-domain thresholding or measurement of similarity
between patches. After denoising, one can compute one of the noiseness measures
discussed in the previous chapters and select the σ value that produced the ‘noisiest’
residual.
9.2.3 Challenging Denoising Scenarios
Our denoising algorithm has been tested thoroughly on (i.i.d. and additive)
zero-mean Gaussian noise at different values of σ. Most contemporary algorithms
from the literature have also been tested only on Gaussian noise. This model is known
to hold true for thermal noise and also for film grain noise under some conditions [184].
However, there exist several other noise models such as the negative exponential model
which affects images acquired through synthetic aperture radar, Poisson noise which
is a valid model for images acquired with cameras having low shutter speed or under
poor illumination, or speckle noise in ultrasound [184]. The patch similarity measure,
the relative behavior of the true signal and the noise instances in the transform domain,
and the choice of norms or energy criteria to optimize for suitable denoising bases, are
all affected by the assumed noise model. In the case of distributions like Poisson which
are not really additive, characterization of the behavior of the noise instances in the
transform domain poses a difficult problem. To complicate matters further, the noise
affecting the image may be intensity dependent or drawn from noise distributions that
are spatially varying. In fact, the Poisson model is one such, the noise induced by lossy
compression algorithms is another. All these problems present rich avenues for future
222
research. Ultimately, actual camera noise is the cumulative effect of several factors:
shutter speed, ambient illumination, stability of the camera taking the picture, motion
of the objects in the scene, the behavior of the electronic circuitry inside the camera,
and the lossy compression algorithm to store the images. A careful study of all these
factors and the interplay between them is an important open problem in practical image
processing.
223
APPENDIX ADERIVATION OF MARGINAL DENSITY
In this section, we derive the expression for the marginal density of the intensity of a
single 2D image. We begin with Eq. (2–27) derived in Section 2.2.1:
p(α) =1
A
∫I (x ,y)=α
∣∣∣∣∣∣∣∂x∂I
∂y∂I
∂x∂u
∂y∂u
∣∣∣∣∣∣∣ du. (A–1)
Consider the following two expressions that appear while performing a change of
variables and applying the chain rule:
[dx dy
]= [ dI du ]
∂x∂I
∂y∂I
∂x∂u
∂y∂u
. (A–2)
[dI du
]= [ dx dy ]
∂I∂x
∂u∂x
∂I∂y
∂u∂y
= [ dx dy ] Ix uxIy uy
. (A–3)
Taking the inverse in the latter, we have
[dx dy
]=
1
Ixuy − Iyux
uy −ux
−Iy Ix
[ dI du ]. (A–4)
Comparing the individual matrix coefficients, we obtain
∣∣∣∣∣∣∣∂x∂I
∂y∂I
∂x∂u
∂y∂u
∣∣∣∣∣∣∣ =Ixuy − ux Iy(Ixuy − Iyux)2
=1
Ixuy − Iyux. (A–5)
Now, clearly the unit vector ~u is perpendicular to ~I , i.e. we have the following:
uy =Ix√I 2x + I
2y
, and (A–6)
ux =−Iy√I 2x + I
2y
. (A–7)
224
This finally gives us
∣∣∣∣∣∣∣∂x∂I
∂y∂I
∂x∂u
∂y∂u
∣∣∣∣∣∣∣ =1√I 2x + I
2y
. (A–8)
Hence we arrive at the following expression for the marginal density:
p(α) =1
A
∫I (x ,y)=α
du√I 2x + I
2y
. (A–9)
This is the same expression as in Eq. (2–28).
225
APPENDIX BTHEOREM ON THE PRODUCT OF A CHAIN OF STOCHASTIC MATRICES
The specific theorem from [133] on the product of a chain of stochastic matrices is
produced here (verbatim) for completeness:
Let Ω be an arbitrary set and let for each ω ∈ Ω,
Pω =
pω11 ... pω1N
. ... .
. ... .
. ... .
pωN1 ... pωNN
. (B–1)
be a row-stochastic matrix, i.e. a matrix with∑j p
ωij = 1 and pωij ≥ 0 for all (i , j). Then
suppose all matrices Pω satisfy the condition that there exists a constant c > 0 such that∑j c
ωjmin ≥ c where cωjmin denotes the minimum value of the elements in the j th column
of Pω. Let ω = ω1,ω2, ... be an arbitrary sequence of elements from Ω. Then the limit
M ω = limn→∞ PωnPωn−1...Pω1 exists and is a matrix with identical rows given as
Mω =
µω1 ... µωN
. ... .
µω1 ... µωN
. (B–2)
Moreover if M ωn = limn→∞ P
ωnPωn−1...Pω1, then for any i ,
1
2
N∑j=1
|M ωn (i , j)− µω
j | ≤ (1− c)n, n ≥ 0. (B–3)
for some probability vector µω1,µω2, ...,µωN. The convergence rate is thus upper
bounded by (1− c)n.
226
REFERENCES
[1] Tina Is No Acronym (TINA) Image Database, Available from http://www.tina-vision.net/ilib.php, 2008, University of Manchester and University of Sheffield, UK.
[2] A. Buades, B. Coll, and J.-M. Morel, “A review of image denoising algorithms, witha new one,” Multiscale modelling and simulation, vol. 4, no. 2, pp. 490–530, 2005.
[3] B. Silverman, Density Estimation for Statistics and Data Analysis. London, UK:Chapman and Hall, 1986.
[4] J. Simonoff, Smoothing Methods in Statistics. Berlin,Germany: Springer Verlag,1996.
[5] C. Bishop, Pattern Recognition and Machine Learning. Springer Verlag, 2006.
[6] D. Herrick, G. Nason, and B. Silverman, “Some new methods for wavelet densityestimation,” Sankhya, vol. 63, pp. 394–411, 2001.
[7] A. Peter and A. Rangarajan, “Maximum likelihood wavelet density estimation withapplications to image and shape matching,” IEEE Trans. Image Process., vol. 17,no. 4, pp. 458–468, April 2008.
[8] D. Donoho, I. Johnstone, G. Kerkyacharian, and D. Picard, “Density estimation bywavelet thresholding,” Ann. Stat., vol. 24, pp. 508–539, 1996.
[9] A. Rajwade, A. Banerjee, and A. Rangarajan, “New method of probability densityestimation with application to mutual information based image registration,”in IEEE Conf. Computer Vision and Pattern Recognition, vol. 2, 2006, pp.1769–1776.
[10] ——, “Continuous image repesentations avoid the histogram binning problemin mutual information based image registration,” in IEEE Int. Symp. BiomedicalImaging, 2006, pp. 840–843.
[11] ——, “Probability density estimation using isocontours and isosurfaces:applications to information-theoretic image registration,” IEEE Trans. PatternAnal. Mach. Intell., vol. 31, no. 3, pp. 475–491, 2009.
[12] T. Kadir and M. Brady, “Estimating statistics in arbitrary regions of interest,” inBritish Mach. Vision Conf., 2005, pp. 589–598.
[13] N. Joshi and M. Brady, “Nonparametric mixture model based evolution of levelsets,” in Int. Conf. Computing: Theory and Applications (ICCTA), 2007, pp.618–622.
[14] E. Hadjidemetriou, M. Grossberg, and S. Nayar, “Histogram preserving imagetransformations,” Int. J. Comput. Vis., vol. 45, no. 1, pp. 5–23, 2001.
[15] J. Boes and C. Meyer, “Multi-variate mutual information for registration,” in Med.Image Comput. Computer-Assisted Intervention, ser. LNCS, vol. 1679. Springer,1999, pp. 606–612.
[16] J. Zhang and A. Rangarajan, “Multimodality image registration using an extensibleinformation metric,” in Inf. Process. Med. Img., ser. LNCS, vol. 3565. Springer,2005, pp. 725–737.
[17] M. de Berg, M. van Kreveld, M. Overmars, and O. Schwarzkopf, ComputationalGeometry: Algorithms and Applications. Berlin, Germany: Springer Verlag, 1997.
[18] L. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removalalgorithms,” Physica D, vol. 60, pp. 259–268, 1992.
[19] D. L. Collins et al., “Design and construction of a realistic digital brain phantom,”IEEE Trans. Med. Imag., vol. 17, no. 3, pp. 463–468, 1998.
[20] J. Pluim, J. Maintz, and M. Viergever, “Mutual information based registration ofmedical images: A survey,” IEEE Trans. Med. Imag., vol. 22, no. 8, pp. 986–1004,2003.
[21] H. Chen, M. Arora, and P. Varshney, “Mutual information-based image registrationfor remote sensing data,” J. Remote Sensing, vol. 24, no. 18, pp. 3701–3706,2003.
[22] P. Viola and W. Wells, “Alignment by maximization of mutual information,” Int. J.Comput. Vis., vol. 24, no. 2, pp. 137–154, 1997.
[23] F. Maes, A. Collignon, D. Vandermeulen, G. Marchal, and P. Suetens,“Multimodality image registration by maximization of mutual information,” IEEETrans. Med. Imag., vol. 16, no. 2, pp. 187–198, 1997.
[24] F. Maes, D. Vandermeulen, and P. Suetens, “Medical image registration usingmutual information,” Proc. IEEE, vol. 91, no. 10, pp. 1699–1722, 2003.
[25] M. Rao, Y. Chen, B. Vemuri, and F. Wang, “Cumulative residual entropy: A newmeasure of information,” IEEE Trans. Inf. Theory, vol. 50, no. 6, pp. 1220–1228,2004.
[26] F. Wang and B. Vemuri, “Non-rigid multi-modal image registration usingcross-cumulative residual entropy,” Int. J. Comput. Vis., vol. 74, no. 2, pp.201–215, 2007.
[27] F. Wang, B. Vemuri, M. Rao, and Y. Chen, “Cumulative residual entropy, a newmeasure of information and its application to image alignment,” in IEEE Int. Conf.Computer Vision, 2003, pp. 548–553.
228
[28] J. Beirlant, E. Dudewicz, L. Gyorfi, and E. C. van der Meulen, “Nonparametricentropy estimation: An overview,” Int. J. Math. Stat. Sci., vol. 6, no. 1, pp. 17–39,June 1997.
[29] P. Viola, “Alignment by maximization of mutual information,” Ph.D. dissertation,Massachussets Institute of Technology, 1995.
[30] C. Yang, R. Duraiswami, N. Gumerov, and L. Davis, “Improved fast Gausstransform and efficient kernel density estimation,” in IEEE Int. Conf. ComputerVision, vol. 1, 2003, pp. 464–471.
[31] M. Leventon and W. Grimson, “Multi-modal volume registration using joint intensitydistributions,” in Med. Image Comput. Computer-Assisted Intervention, vol. 1496,1998, pp. 1057–1066.
[32] T. Downie and B. Silverman, “A wavelet mixture approach to the estimation ofimage deformation functions,” Sankhya Series B, vol. 63, pp. 181–198, 2001.
[33] B. Ma, A. Hero, J. Gorman, and O. Michel, “Image registration with minimumspanning tree algorithm,” in IEEE Int. Conf. Image Process., vol. 1, 2000, pp.481–484.
[34] J. Costa and A. Hero, “Entropic graphs for manifold learning,” in IEEE AsilomarConf. Sign., Sys. and Comp., vol. 1, 2003, pp. 316–320.
[35] M. Sabuncu and P. Ramadge, “Gradient based optimization of an EMST imageregistration function,” in IEEE Int. Conf. Acoust., Speech, Sig. Proc., vol. 2, 2005,pp. 253–256.
[36] N. Dowson, R. Bowden, and T. Kadir, “Image template matching using mutualinformation and NP-Windows,” in Int. Conf. Pattern Recognition, vol. 2, 2006, pp.1186–1191.
[37] B. Karacali, “Information theoretic deformable registration using local imageinformation,” Int. J. Comput. Vis., vol. 72, no. 3, pp. 219–237, 2007.
[38] P. Thevenaz and M. Unser, “Optimization of mutual information for multiresolutionimage registration,” IEEE Trans. Image Process., vol. 9, no. 12, pp. 2083–2099,2000.
[39] T. Cover and J. Thomas, Elements of Information Theory. New York, USA: WileyInterscience, 1991.
[40] J. Zhang and A. Rangarajan, “Affine image registration using a new informationmetric,” in IEEE Conf. Computer Vision and Pattern Recognition, vol. 1, 2004, pp.848–855.
[41] W. Feller, “On the Kolmogorov-Smirnov limit theorems for empirical distributions,”The Annals of Mathematical Statistics, vol. 19, no. 2, pp. 177–189, 1948.
229
[42] R. Shekhar and V. Zagrodsky, “Mutual information-based rigid and nonrigidregistration of ultrasound volumes,” IEEE Trans. Med. Imag., vol. 21, no. 1, pp.9–22, 2002.
[43] L. Devroye, L. Gyorfi, and G. Lugosi, A Probabilistic Theory of Pattern Recogni-tion. Berlin, Germany: Springer Verlag, 1996.
[44] P. Perona and J. Malik, “Scale-space and edge detection using anisotropicdiffusion,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 12, no. 7, pp. 629–639,1990.
[45] D. Tschumperle and R. Deriche, “Vector-valued image regularization with PDEs :A common framework for different applications,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 27, no. 4, pp. 506–517, 2005.
[46] B. Tang and G. Sapiro, “Color image enhancement via chromaticity diffusion,”IEEE Trans. Image Process., vol. 10, pp. 701–707, 1999.
[47] P. Saint-Marc, J. Chen, and G. Medioni, “Adaptive smoothing: a general tool forearly vision,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 13, no. 6, pp. 514–520,1991.
[48] K. Plataniotis and A. Venetsanopoulos, Color image processing and applications.New York, USA: Springer Verlag, 2000.
[49] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color images,” in IEEEInt. Conf. Computer Vision, 1998, pp. 839–846.
[50] Y. Cheng, “Mean shift, mode seeking and clustering,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 17, no. 8, pp. 790–799, 1995.
[51] D. Comaniciu and P. Meer, “Mean shift: a robust approach toward feature spaceanalysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 5, pp. 603–619,2002.
[52] A. Rajwade, A. Banerjee, and A. Rangarajan, “Image filtering driven by levelcurves,” in Int. Conf. Energy Min. Methods Computer Vision Pattern Recognition,2009, pp. 359–372.
[53] T. Chan and J. Shen, Image Processing and Analysis: Variational, PDE, wavelets,and stochastic methods. SIAM, 2005.
[54] D. Barash and D. Comaniciu, “A common framework for nonlinear diffusion,adaptive smoothing, bilateral filtering and mean shift,” Image Vis. Comput., vol. 22,pp. 73–81, 2004.
[55] A. Buades, B. Coll, and J.-M. Morel, “Neighborhood filters and PDEs,” NumerischeMathematik, vol. 105, no. 1, pp. 1–34, 2006.
230
[56] R. Subbarao and P. Meer, “Discontinuity preserving filtering over analyticmanifolds,” in IEEE Conf. Computer Vision and Pattern Recognition, 2007, pp.1–6.
[57] J. van de Weijer and R. van den Bloomgard, “Local mode filtering,” in IEEE Conf.Computer Vision and Pattern Recognition, vol. 2, 2001, pp. 428–436.
[58] N. Sochen, R. Kimmel, and R. Malladi, “A general framework for low level vision,”IEEE Trans. Image Process., vol. 7, pp. 310–318, 1998.
[59] D. Comaniciu, “An algorithm for data-driven bandwidth selection,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 25, pp. 281–288, 2003.
[60] D. Comaniciu, V. Ramesh, and P. Meer, “The variable bandwidth mean shiftand data-driven scale selection,” in IEEE Int. Conf. Computer Vision, 2001, pp.438–445.
[61] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmentednatural images and its application to evaluating segmentation algorithms andmeasuring ecological statistics,” in IEEE Int. Conf. Computer Vision, vol. 2, 2001,pp. 416–423.
[62] S. Lansel, “About DenoiseLab,” Available from http://www.stanford.edu/∼slansel/DenoiseLab/documentation.htm, 2006.
[63] Z. Wang, E. Simoncelli, and A. Bovik, “Multi-scale structural similarity for imagequality assessment,” in IEEE Asilomar Conf. Signals, Sys. Comp., 2003, pp.1398–1402.
[64] O. Subakan, J. Bing, B. Vemuri, and E. Vallejos, “Feature preserving imagesmoothing using a continuous mixture of tensors,” in IEEE Int. Conf. ComputerVision, 2007, pp. 1–6.
[65] H. Takeda, S. Farsiu, and P. Milanfar, “Kernel regression for image processing andreconstruction,” IEEE Trans. Image Process., vol. 16, no. 2, pp. 349–366, 2007.
[66] K. Mardia and P. Jupp, Directional Statistics. Chichester, UK: Wiley Interscience,2000.
[67] P. Kim and J. Koo, “Directional mixture models and optimal estimation of themixing density,” Can. J. Stat., pp. 383–398, 1998.
[68] A. Banerjee, I. Dhillon, J. Ghosh, and S. Sra, “Clustering on the unit hypersphereusing von Mises-Fisher distributions,” J. Mach. Learning Res., vol. 6, pp.1345–1382, 2005.
[69] T. McGraw, B. Vemuri, R. Yezierski, and T. Mareci, “Von Mises-Fisher mixturemodel of the diffusion ODF,” in IEEE Int. Symp. Biomedical Imaging, 2006, pp.65–68.
[70] A. Prati, S. Calderara, and R. Cucchiara, “Using circular statistics for trajectoryshape analysis,” in IEEE Conf. Computer Vision and Pattern Recognition, June2008, pp. 1–6.
[71] K. Hara, K. Nishino, and K. Ikeuchi, “Multiple light sources and reflectanceproperty estimation based on a mixture of spherical distributions,” IEEE Int. Conf.Computer Vision, vol. 2, pp. 1627–1634, Oct. 2005.
[72] C. Han, B. Sun, R. Ramamoorthi, and E. Grinspun, “Frequency domain normalmap filtering,” ACM Trans. Graphics, vol. 26, no. 3, pp. 28–37, 2007.
[73] O. Eugeciouglu and A. Srinivasan, “Efficient nonparametric density estimationon the sphere with applications in fluid mechanics,” SIAM Journal on ScientificComputing, vol. 22, no. 1, pp. 152–176, 2000.
[74] A. Papoulis, Probability, Random Variables and Stochastic Processes. McGrawHill, 1984.
[75] H. Schaeben, “Normal orientation distributions,” Textures and Microstructures,vol. 19, pp. 197–202, 1992.
[76] A. Bijral, M. Breitenbach, and G. Grudic, “Mixture of Watson distributions: Agenerative model for hyperspherical embeddings,” in AI and Statistics, 2007, pp.1–8.
[77] T. Downs and A. L. Gould, “Some relationships between the normal and vonMises distributions,” Biometrika, vol. 54, no. 3, pp. 684–687, 1967.
[78] B. Presnell, S. Morrison, and R. Littell, “Projected multivariate linear models fordirectional data,” J. Am. Stat. Assoc., vol. 93, no. 443, pp. 1068–1077, 1998.
[79] T. Gevers and H. Stokman, “Robust histogram construction from color invariantsfor object recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 10, pp.113–118, 2003.
[80] B. Pelletier, “Kernel density estimation on Riemannian manifolds,” Stat. Prob.Letters, vol. 73, pp. 297–304, 2005.
[81] D. Donoho and T. Weissman, “Recent trends in denoising tutorial, ISIT 2007,”Available from http://www.stanford.edu/∼slansel/tutorial/summary.htm, 2007.
[82] B. M. ter Haar Romeny, Geometry-driven diffusion in computer vision. Utrecht,Netherlands: Kluwer, 1994.
[83] J. Weickert, Anisotropic Diffusion in Image Processing. Stuttgart, Germany:Teubner, 1998.
[84] M. Black, G. Sapiro, D. Marimont, and D. Heeger, “Robust anisotropic diffusion,”IEEE Trans. Image Process., vol. 7, no. 3, pp. 421–432, 1998.
[85] F. Catte, P. Lions, J. Morel, and T. Coll, “Image selective smoothing and edgedetection by nonlinear diffusion,” SIAM J. Numer. Anal., vol. 29, no. 1, pp.182–193, 1992.
[86] L. Rudin and S. Osher, “Total variation based image resoration with free localconstraints,” in IEEE Int. Conf. Image Process., 1994, pp. 31–35.
[87] T. Le, R. Chartrand, and T. Asaki, “A variational approach to reconstructingimages corrupted by Poisson noise,” J. Math. Imag. Vis., vol. 27, pp. 257–263,2007.
[88] G. Gilboa, N. Sochen, and Y. Zeevi, “Image enhancement and denoising bycomplex diffusion processes,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26,no. 8, pp. 1020–1036, 2004.
[89] D. Seo and B. Vemuri, “Complex diffusion on scalar and vector valued imagegraphs,” in Int. Conf. Energy Min. Methods Comput. Vision Pattern Recognition,2009, pp. 98–111.
[90] Y. You and M. Kaveh, “Fourth order partial differential equations for noise removal,”IEEE Trans. Image Process., vol. 9, no. 10, pp. 1723–1730, 2000.
[91] M. Hajiaboli, “An anisotropic fourth-order partial differential equation for noiseremoval,” in Scale Space and Variational Methods in Computer Vision, 2009, pp.356–367.
[92] P. Mrazek, “Monotonicity enhancing nonlinear diffusion,” J. Visual Commun. ImageRepresentation, vol. 13, no. 1, pp. 313–323, 2000.
[93] A. Savitzky and M. Golay, “Smoothing and differentiation of data by simplified leastsquares procedures,” Anal. Chem., vol. 36, no. 8, pp. 1627–1639, 1964.
[94] W. Press, A. Teukolsky, W. Vetterling, and B. Flannery, Numerical recipes in C(2nd ed.): the art of scientific computing. New York, NY, USA: CambridgeUniversity Press, 1992.
[95] J. Fan and I. Gijbels, Local polynomial modeling and its application. London, UK:Chapman and Hill, 1996.
[96] S. M. Smith and J. M. Brady, “SUSAN - a new approach to low level imageprocessing,” Int. J. Comput. Vis., vol. 23, pp. 45–78, 1995.
[97] V. Katkovnik, A. Foi, K. Egiazarian, and J. Astola, “Directional varying scaleapproximations for anisotropic signal processing,” in Eur. Signal Process. Conf.,2004, pp. 1–6.
[98] K. Fukunaga and L. Hostetler, “The estimation of the gradient of a densityfunction, with applications in pattern recognition,” IEEE Trans. Inf. Theory, vol. 21,no. 1, pp. 32–40, 1975.
233
[99] M. Elad, “On the origin of the bilateral filter and ways to improve it,” IEEE Trans.Image Process., vol. 11, no. 10, pp. 1141–1151, 2002.
[100] N. Sochen, R. Kimmel, and A. Bruckstein, “Diffusions and confusions in signal andimage processing,” J. Math. Imag. Vis., vol. 14, pp. 195–209, 2001.
[101] O. Subakan, “Continuous mixture models for feature preserving smoothing andsegmentation,” Ph.D. dissertation, University of Florida, 2009.
[102] B. Jian, B. Vemuri, E. Ozarslan, P. Carney, and T. Mareci, “A novel tensordistribution model for the diffusion-weighted MR signal,” Neuroimage, vol. 37,no. 1, pp. 164–176, 2007.
[103] D. Tschumperle, “Fast anisotropic smoothing of multi-valued images usingcurvature-preserving PDEs,” Int. J. Comput. Vis., vol. 68, no. 1, pp. 65–82, 2006.
[104] R. Frankot and R. Chellappa, “A method for enforcing integrability in shape fromshading algorithms,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 10, no. 4, pp.439–451, 1988.
[105] A. Agrawal and R. Raskar, “Short course (ICCV 2007): Gradient domainmanipulation techniques in vision and graphics,” http://www.umiacs.umd.edu/∼aagrawal/ICCV2007Course/index.html, 2007.
[106] H. Wang, Y. Chen, T. Fang, J. Tyan, and N. Ahuja, “Gradient adaptive imagerestoration and enhancement,” in Int. Conf. Image Proc., 2006, pp. 2893–2896.
[107] P. Hancock, R. Baddeley, and L. Smith, “The principal components of naturalimages,” Network: Computation in Neural Systems, vol. 3, pp. 61–72, 1992.
[108] R. Coifman and D. Donoho, “Translation-invariant denoising,” Yale University, Tech.Rep., 1995.
[109] L. Yaroslavsky, K. Egiazarian, and J. Astola, “Transform domain image restorationmethods: review, comparison and interpretation,” in SPIE Proceedings Series,Nonlinear Processing and Pattern Analysis, 2001, pp. 1–15.
[110] W. Freeman and E. Adelson, “The design and use of steerable filters,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 13, no. 9, pp. 891–906, 1991.
[111] J.-L. Starck, E. Candes, and D. Donoho, “The curvelet transform for imagedenoising,” IEEE Trans. Image Process., vol. 11, no. 6, pp. 670–684, 2002.
[112] J. Fan and R. Li, “Variable selection via nonconcave penalized likelihood and itsoracle properties,” J. Am. Stat. Assoc., vol. 96, no. 456, pp. 1348–1360, December2001.
[113] D. Donoho and I. Johnstone, “Ideal spatial adaptation by wavelet shrinkage,”Biometrika, vol. 81, pp. 425–455, 1993.
[114] J. Mairal, G. Sapiro, and M. Elad, “Learning multiscale sparse representations forimage and video restoration,” Multiscale Modeling and Simulation, vol. 7, no. 1, pp.214–241, 2008.
[115] L. Sendur and I. Selesnick, “Bivariate shrinkage functions for wavelet-baseddenoising exploiting interscale dependency,” IEEE Trans. Signal Process., vol. 50,no. 11, pp. 2744–2756, 2002.
[116] E. Simoncelli, “Bayesian denoising of visual images in the wavelet domain,” inLecture Notes in Statistics, vol. 141, 1999, pp. 291–308.
[117] J. Portilla, V. Strela, M. Wainwright, and E. Simoncelli, “Image denoising usingscale mixtures of Gaussians in the wavelet domain,” IEEE Trans. Image Process.,vol. 12, no. 11, pp. 1338–1351, 2003.
[118] A. Hyvarinen, P. Hoyer, and E. Oja, “Image denoising by sparse code shrinkage,”in Intelligent Signal Processing, 1999, pp. 1–6.
[119] J. Huang and D. Mumford, “Statistics of natural images and models,” IEEE Conf.Computer Vision and Pattern Recognition, vol. 1, pp. 1541–1548, 1999.
[120] Y. Hel-Or and D. Shaked, “A discriminative approach for wavelet denoising,” IEEETrans. Image Process., vol. 17, no. 4, pp. 443–457, 2008.
[121] A. Buades, B. Coll, and J.-M. Morel, “Nonlocal image and movie denoising,” Int. J.Comput. Vis., vol. 76, no. 2, pp. 123–139, 2008.
[122] T. Brox, O. Kleinschmidt, and D. Cremers, “Efficient nonlocal means for denoisingof textural patterns,” IEEE Trans. Image Process., vol. 17, no. 7, pp. 1083–1092,2008.
[123] M. Ghazel, G. Freeman, and E. Vrscay, “Fractal image denoising,” IEEE Trans.Image Process., vol. 12, no. 12, pp. 1560–1578, 2003.
[124] S. Awate and R. Whitaker, “Unsupervised, information-theoretic, adaptive imagefiltering for image restoration,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28,no. 3, pp. 364–376, 2006.
[125] K. Popat and R. Picard, “Cluster-based probability model and its application toimage and texture processing,” IEEE Trans. Image Process., vol. 6, no. 2, pp.268–284, 1997.
[126] A. Efros and T. Leung, “Texture synthesis by nonparametric sampling,” in IEEE Int.Conf. Computer Vision, 1999, pp. 1033–1038.
[127] J. D. Bonet, “Noise reduction through detection of signal redundancy,” MIT, AI Lab,Tech. Rep., 1997.
235
[128] D. Zhang and Z. Wang, “Restoration of impulse noise corrupted images usinglong-range correlation,” IEEE Signal Process. Letters, vol. 5, no. 1, pp. 4–6, 1998.
[129] ——, “Image information restoration based on long-range correlation,” IEEE Trans.Circuit Syst. Video Technol., vol. 12, no. 5, pp. 331–341, 2002.
[130] S. Kindermann, S. Osher, and P. Jones, “Deblurring and denoising of images bynonlocal functionals,” SIAM Interdisc. J., vol. 4, no. 4, pp. 1091–1115, 2005.
[131] M. Ebrahimi and E. Vrscay, “Self-similarity in imaging, 20 years after ‘fractalseverywhere’,” in Int. Workshop Local Non-Local Approx. Image Process., 2008,pp. 165–172.
[132] O. Kleinschmidt, T. Brox, and D. Cremers, “Nonlocal texture filtering with efficienttree structures and invariant patch similarity measures,” in Int. Workshop LocalNon-Local Approx. Image Process., 2008, pp. 1–8.
[133] O. Stenflo, “Perfect sampling from the limit of deterministic products of stochasticmatrices,” Electronic Commun. Prob., vol. 13, pp. 474–481, 2008.
[134] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse3-d transform-domain collaborative filtering,” IEEE Trans. Image Process., vol. 16,no. 8, pp. 2080–2095, 2007.
[135] K. Hirakawa and T. Parks, “Image denoising using total least squares,” IEEETrans. Image Process., vol. 15, no. 9, pp. 2730–2742, 2006.
[136] L. Dascal, M. Zibulevsky, and R. Kimmel, “Signal denoising by constraining theresidual to be statistically noise-similar,” Technion, Israel, Tech. Rep., 2008.
[137] E. Tadmor, S. Nezzar, and L. Vese, “A multiscale image representation usinghierarchical (BV,L2) decompositions,” Multiscale modelling and simulation, vol. 2,pp. 554–579, 2004.
[138] S. Osher, M. Burger, D. Goldfarb, J. Xu, and W. Yin, “An iterative regularizationmethod for total variation-based image restoration,” Multiscale modelling andsimulation, vol. 4, pp. 460–489, 2005.
[139] J. Polzehl and V. Spokoiny, “Image denoising: a pointwise adaptive approach,”Ann. Stat., vol. 31, no. 1, pp. 30–57, 2003.
[140] J. Kleinberg and E. Tardos, Algorithm Design. Boston, USA: Addison-WesleyLongman, 2005.
[141] D. Brunet, E. Vrscay, and Z. Wang, “The use of residuals in image denoising,” inInt. Conf. Image Anal. Recognition, 2009, pp. 1–12.
236
[142] Y. Chen, H. Wang, T. Fang, and J. Tyan, “Mutual information regularized bayesianframework for multiple image restoration,” in IEEE Int. Conf. Computer Vision,2005, pp. 190–197.
[143] B. Olshausen and D. Field, “Emergence of simple-cell receptive-field properties bylearning a sparse code for natural images,” Nature, vol. 381, no. 6583, p. 607609,1996.
[144] M. Lewicki, T. Sejnowski, and H. Hughes, “Learning overcompleterepresentations,” Neural Computation, vol. 12, pp. 337–365, 1998.
[145] M. Lewicki and B. Olshausen, “A probabilistic framework for the adaptation andcomparison of image codes,” J. Opt. Soc. Am., vol. 16, pp. 1587–1601, 1999.
[146] M. Elad and M. Aharon, “Image denoising via learned dictionaries and sparserepresentation,” in IEEE Conf. Computer Vision and Pattern Recognition, vol. 1,2006, pp. 17–22.
[147] M. Aharon, M. Elad, and A. Bruckstein, “The K-SVD: an algorithm for designing ofovercomplete dictionaries for sparse representation,” IEEE Trans. Signal Process.,vol. 54, no. 11, pp. 4311–4322, 2006.
[148] P. Chatterjee and P. Milanfar, “Clustering-based denoising with locally learneddictionaries,” IEEE Trans. Image Process., vol. 18, no. 7, pp. 1438–1451, 2009.
[149] J. Yang, J. Wright, T. Huang, and Y. Ma, “Image super-resolution as sparserepresentation of raw image patches,” in IEEE Int. Conf. Comp. Vis. Pattern Rec.,2008, pp. 1–8.
[150] Z. Wang and A. Bovik, “Mean squared error: Love it or leave it? A new look atsignal fidelity measures,” IEEE Signal Process. Mag., vol. 26, no. 1, pp. 98–117,2009.
[151] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment:From error visibility to structural similarity,” IEEE Trans. Image Process., vol. 13,no. 4, pp. 600–612, 2004.
[152] H. Andrews and C. Patterson, “Singular value decompositions and digital imageprocessing,” IEEE Trans. Acoust., Speech and Signal Process., vol. 24, no. 1, pp.425–432, 1976.
[153] ——, “Singular value decomposition (SVD) image coding,” IEEE Trans. Commun.,vol. 24, no. 4, pp. 425–432, 1976.
[154] L. Trefethen and D. Bau, Numerical Linear Algebra. SIAM: Society for Industrialand Applied Mathematics, 1997.
[155] A. Rangarajan, “Learning matrix space image representations,” in Int. Conf.Energy Min. Methods Computer Vision Pattern Recognition, 2001, pp. 153–168.
237
[156] D. Tschumperle and R. Deriche, “Orthonormal vector sets regularization withPDEs and applications,” Int. J. Comput. Vis., vol. 50, pp. 237–252, 2002.
[157] J. Ye, “Generalized low rank approximations of matrices,” Mach. Learning, vol. 61,no. 1, pp. 167–191, 2005.
[158] C. Ding and J. Ye, “Two-dimensional singular value decomposition (2DSVD) for 2Dmaps and images,” in SIAM Int. Conf. Data Mining, 2005, pp. 32–43.
[159] N. Kwak, “Principal component analysis based on L1-norm maximization,” IEEETrans. Pattern Anal. Mach. Intell., vol. 30, no. 9, pp. 1672–1680, 2008.
[160] G. Heo, P. Gader, and H. Frigui, “RKF-PCA: Robust kernel fuzzy PCA,” NeuralNetworks, vol. 22, no. 5-6, pp. 642–650, 2009.
[161] A. Efros and W. Freeman, “Image quilting for texture synthesis and transfer,” inSIGGRAPH: Annual Conf. Computer graphics and interactive techniques, 2001,pp. 341–346.
[162] A. Hyvarinen, J. Hurri, and P. Hoyer, Natural Image Statistics: A ProbabilisticApproach to Early Computational Vision. Springer, Heidelberg, 2009.
[163] A. Rosenfeld and A. Kak, Digital Picture Processing. Orlando, USA: AcademicPress, 1982.
[164] L. Tucker, “Some mathematical notes on three-mode factor analysis,” Psychome-trika, vol. 31, no. 3, pp. 279–311, 1966.
[165] L. de Lathauwer, “Signal processing based on multilinear algebra,” Ph.D.dissertation, Katholieke Universiteit Leuven, Belgium, 1997.
[166] M. Vasilescu and D. Terzopoulos, “Multilinear analysis of image ensembles:Tensorfaces,” in Int. Conf. Pattern Recognition, 2002, pp. 511–514.
[167] K. Gurumoorthy, A. Rajwade, A. Banerjee, and A. Rangarajan, “Beyond SVD:Sparse projections onto exemplar orthonormal bases for compact imagerepresentation,” in Int. Conf. Pattern Recognition, 2008, pp. 1–4.
[168] ——, “A method for compact image representation using sparse matrix and tensorprojections onto exemplar orthonormal bases,” IEEE Trans. Image Process.,vol. 19, no. 2, pp. 322–334, 2010.
[169] D. Muresan and T. Parks, “Adaptive principal components and image denoising,”in IEEE Int. Conf. Image Process., 2003, pp. 101–104.
[170] A. Rajwade, A. Rangarajan, and A. Banerjee, “Automated filter parameterselection using measures of noiseness,” in Can. Conf. Comput. Robot Vision,2010, pp. 86–93.
238
[171] J. Weickert, “Coherence enhancing diffusion filtering,” Int. J. Comput. Vis., vol. 31,no. 3, pp. 111–127, 1999.
[172] P. Mrazek and M. Navara, “Selection of optimal stopping time for nonlineardiffusion filtering,” Int. J. Comput. Vision, vol. 52, no. 2-3, pp. 189–203, 2003.
[173] J.-F. Aujol, G. Gilboa, T. Chan, and S. Osher, “Structure-texture imagedecomposition–modeling, algorithms and parameter selection,” Int. J. Comput.Vis., vol. 67, no. 1, pp. 111–136, 2006.
[174] G. Gilboa, N. Sochen, and Y. Zeevi, “Estimation of optimal PDE-based denoisingin the SNR sense,” IEEE Trans. Image Process., vol. 15, no. 8, pp. 2269–2280,Aug. 2006.
[175] I. Vanhamel, C. Mihai, H. Sahli, A. Katartzis, and I. Pratikakis, “Scale selection forcompact scale-space representation of vector-valued images,” Int. J. Comput. Vis.,vol. 84, no. 2, pp. 194–204, 2009.
[176] D. Donoho and I. Johnstone, “Adapting to unknown smoothness via waveletshrinkage,” J. Am. Stat. Assoc., vol. 90, no. 432, pp. 1200–1224, 1995.
[177] A. Dvoretzky, J. Kiefer, and J. Wolfowitz, “Asymptotic minimax character of thesample distribution function and of the classical multinomial estimator,” Ann. Math.Stat., vol. 27, no. 3, pp. 642–669, 1956.
[178] Wikipedia, “Dvoretzky Kiefer Wolfowitz inequality,” Available from http://en.wikipedia.org/wiki/Dvoretzky-Kiefer-Wolfowitz inequality, 2010.
[179] J. Kiefer, “K-sample analogues of the Kolmogorov-Smirnov and Cramer-VonMises tests,” Ann. Math. Stat., vol. 30, no. 2, pp. 420–447, 1959.
[180] F. Anscombe, “The transformation of Poisson, binomial and negative-binomialdata,” Biometrika, vol. 35, pp. 246–254, 1948.
[181] D. Donoho, “Denoising by soft thresholding,” IEEE Trans. Inf. Theory, vol. 41,no. 3, pp. 613–627, 1995.
[182] A. D. Stefano, P. White, and W. Collis, “Training methods for image noise levelestimation on wavelet components,” EURASIP J. Appl. Signal Process., vol. 2004,no. 16, pp. 2400–2407, 2004.
[183] X. Zhu and P. Milanfar, “A no-reference image content metric and its application todenoising,” in IEEE Int. Conf. Image Process., 2010, pp. 1–4.
[184] C. Boncelet, “Image noise models,” in Handbook of Image and Video Processing,A. Bovik, Ed. New York, USA: Academic Press, 2005, pp. 397–410.