A Micro-Mirror Array based System for Compressive Sensing ... · ing of hyperspectral data section we will develop a novel sampling system for hyper-spectral data that is based on

A Micro-Mirror Array based System for

Compressive Sensing of Hyperspectral Data

Yehuda PfefferDept. of Electrical Engineering

Technion - Israel Instituteof Technology

Michael ZibulevskyDept. of Computer ScienceTechnion - Israel Institute

of Technology

January 18, 2010

Abstract

We introduce a system for hyperspectral imaging, which includes amicro-mirror array that projects subsets of image pixels onto a prism (ordiffraction grating), followed by a CCD-type sensor. This system allowsgeneralized sampling schemes termed as Compressed Sensing (CS). Weacquire only a fraction of the samples that are required to obtain thefull-resolution signal (hyperspectral cube in our case), and by means ofnon-linear optimization recover the underlying signal. We use a priorknowledge about the signal sparsity in some fixed dictionary, and alsoits limited total variation. In the practical setting developed here, thefeasible sampling is not ideal for CS due to practical limitations, and thesensed signal does not necessarily meet the strict sparsity demands of CStheory. Therefore we introduce additional measurements of full-resolutionimage using small number of filters similar to RGB. As a result, we obtaina feasible system for hyperspectral imaging that enables faster acquisitioncompared to traditional sampling systems.

1 Introduction

Acquisition of hyper-spectral (HS) data takes some time. Not all the datacube is acquired at once - it is taken either line by line, or by sequentiallysampling the fourier domain (See subsubsection 3.2.1 for details). Thelong acquisition time limits the usability of HS imaging - moving object,for example cannot be captured properly by HS camera compared to colorimaging. There is also an issue of the large amount of data involved in HSimaging when that data needs to be transmitted. Compression algorithmssolve the data handling problem successfully, but the lengthy acquisitiontime cannot be solved by compression algorithms such as JPEG [41], sinceJPEG needs all the data in order to do the compression. Thus, acquisitiontime can be shortened in two ways: First, by reducing the acquisition timeof each line. This will result in lower SNR if the illumination intensity

1

Technion - Computer Science Department - Technical Report CS-2010-01 - 2010

will not increase accordingly, which is the case when we use natural light.The second way of reducing acquisition time is by reducing the spatialand/or spectral resolution of the HS cube, obviously resulting in undesir-able lower resolution data.

In this report we will try to use ideas from the field of CompressiveSampling, also termed Compressed Sensing (CS) [1], [3]. By using CS wefollow the second approach - sampling less data in order to achieve fasteracquisition and compression. The CS theory tells us that for many signalswe can sub-sample the signal and still be able to reconstruct the originalsignal with good accuracy. The signal has to be either sparse or its rep-resentation in some fixed dictionary be sparse. We term both cases assparse sinal. The compressed sampling must be done under certain condi-tion, and the reconstruction algorithm is based on nonlinear optimization.The details will be given in subsection 2.2. Compressed sensing is usefulwhen we have some strong limitations on the amount of data that we canacquire. Examples include MRI [20], where the signal acquisition speedis limited by physical and physiological constraints, analog-to-digital [19]converters where technology and hardware costs limits the maximum rate,and medical imaging techniques involving ionizing radiation such as CTand PET, in which we are interested in minimizing the amount of radia-tion.

In the Methods section we will display the main theoretical resultsin CS, and then demonstrate its use on one dimension signals. We willalso experiment using CS for 2D image acquisition in subsection 2.3, bothon synthetic data and real-life images. The use of CS on images is notstrait forward, and this subsection expands both acquisition and the re-construction schemes beyond the classic results. In the Compressed sens-ing of hyperspectral data section we will develop a novel sampling systemfor hyper-spectral data that is based on compressive sensing. We willsimulate its use, and develop novel reconstruction schemes that will takeadvantage on the special structure of the HS data cube.

2 Methods

2.1 Compressive Sensing - Introduction

Compressive Sensing (CS), [1], [3] is an emerging field that is based onthe understanding that for certain types of signals, it is possible and prac-tical to sample M ¿ N samples, were N is the signal length for discretedata, or the minimal number of samples required by the Nyquist theorem.Throughout this report the signal x ∈ RN is sampled by the sampling (orsensing) matrix Φ ∈ RM×N , obtaining the measurement vector y = Φx.The classic results, [4], [5], are established for sparse or compressible sig-nals - the sampled signal x must be either sparse or compressible w.r.t aknown dictionary. However, [10] gives a more general description of theidea behind CS: if the under-sampled signal x lies in a low-dimensionalnon-linear manifold, then it is possible to reconstruct x from the M mea-

2


surement vector y assuming the sampling is done under certain limita-tions. If the non-linear manifold is the set of sparse signals (possibly insome fixed and known dictionary), then there are well defined limitationson the sampling matrix [6], [14]. If, however, the non-linear manifold isthe set of signals that have low Total-Variation (TV, [13]) as in [18], thenthere are no theoretical results that justify the success of CS, althoughthere are practical results. Other such restrictions can be imposed onthe reconstruction process, restrictions that are specific to the sampledsignal. In this work we focus on efficient ways to sample and reconstructhyperspectral (HS) data. The special structure of the HS data, alongwith a novel sampling scheme allow us to achieve surprisingly good re-sults on data that does not meet the theoretical sparsity demands underthe current CS framework.

2.2 Compressive Sensing - Main results

Some definitions: Let x ∈ RN be the signal to be sampled. Let Φ ∈ RM×N

be the sampling matrix, and let y = Φx, y ∈ RM be the measurementvector. We consider the noise-free case just defined, as well as noisy mea-surement vector: y = Φx + n, were n ∈ RM is the vector of measurementnoise.Sparsity and compressibility: an s-sparse signal x has no more than sentries different from 0. Formally: ‖x‖0 ≤ s, were ‖ · ‖0 is the `0 pseudo-norm, which counts the non-zero entries of a vector. An s-compressiblesignal is a signal that most of its energy is concentrated around no morethan s entries. A sparse approximation to x, xs, is the closest s-sparsevector to x under the `2 norm. xs is obtained from x by setting all theentries in x that are not among the s largest to zero.Perfect reconstruction of sparse signals: given that x is s-sparse, it is pos-sible to reconstruct x if Φ is such that every 2s columns of Φ are linearlyindependent. The reconstruction is done by solving:

minx{‖y − Φx‖ s.t. ‖x‖0 ≤ s} (1)

Direct solution of (1) requires exhaustive search over all the s-sparse Nover s many combinations, which is not a very practical reconstructionmethod - efficient schemes will be presented later. The solution of (1) iscovered in the field of sparse approximations [14]. The perfect reconstruc-tion property is easy enough to prove [7]: if there are two different solu-tions to (1), x1 6= x2, were both x1 and x2 are s-sparse, then x = x1 − x2

is 2s sparse and Φx = 0. This means that there are 2s (or less) columnsof Φ that are linearly dependent, in contradiction to the assumption. scolumns of a Random measuring matrices are linearly independent for anys ≤ N with probability one - the probability that any sub-matrix will havean exactly zero eigenvalue is zero. However, ill-conditioned sub-matriceswill not allow reconstruction in any practical setting. Also, the solutionof (1) is not feasible for large problems. Practical reconstruction schemes,compressible (and not strictly sparse) signals and noisy measurements re-quire stronger conditions on Φ. It was shown in [8], [14] that solving the

3


convex `1 optimization problem,

minx{‖x‖1 s.t. y = Φx} (2)

will recover x, given that x is sparse and that Φ obeys the RestrictedIsometry Property displayed below w.r.t to x’s sparsity.

2.2.1 Dictionaries

Most of the interesting signals are not sparse or even compressible in thesignal’s domain (time or space), but have some sparse representation insome fixed dictionary, for example wavelets [24], [39]. A dictionary is aset of K atoms, ai ∈ RN , i = 1, . . . , K. A dictionary can be an orthog-onal basis, such as wavelets, Fourier transform or Discrete Cosine Trans-form (DCT) (in which case K = N), or an overcomplete dictionary whenK > N , for example wavelets packet, undecimated wavelets. Overcom-plete dictionaries have greater ”sparsifying” abilities [14]. Let Ψ be thedictionary matrix with the dictionary atoms as its columns. In the case ofspecific orthogonal transform, Ψ is the inverse transform matrix. Sparsityin a known dictionary can be easily incorporated into the CS scheme [21]:Assume that vector z is the sparse coefficients of x’s representation inthe domain of Ψ: x = Ψz, and z is sparse. Then, the measurements areobtained by y = ΦΨz. Define: A = ΦΨ. A will take the place of Φ, and zwill take the place of x in (1) and (2). There is, however, a fundamentaldifference: the aforementioned properties of Φ must now apply to A. Inthe following sub-sections, the requirements for Φ will be defined formally,and we shall see that while there are well known results for random Φ,the addition of the dictionary Ψ can result in an A matrix having verydifferent properties - especially when using non-orthogonal overcompletedictionaries.In order to proceed, the notions of Restricted Isometry, Coherence andBabel function will now be defined.

2.2.2 Restricted Isometry Property (RIP)

The requirement on Φ are formalized through the idea of the RestrictedIsometry Property (RIP), which is a way to quantify the “orthogonality”of any sub-matrix of Φ.Definition [7], [22]: We say that a matrix Φ satisfies the Restricted Isom-etry Property of order s ∈ N, s < N , if there exist an Isometry Constant,0 < δs < 1 such that

(1− δs) ‖x‖22 ≤ ‖Φx‖22 ≤ (1 + δs) ‖x‖22 (3)

holds for all s-sparse vectors. δs will be the smallest number that sat-isfies (3). Orthogonal matrices have δs = 0 for all s. δs < 1 allows forreconstruction for any signal x, as explained in the previous subsection.Stricter conditions on δs are required for practical recovery using (2). Theisometry constant δs of a given matrix is hard to find, but there are knownresults for δs of random matrices.

4


2.2.3 Restricted isometry - Example

Gaussian random matrix: Let Φ be an M × N random matrix, who’sentries are iid gaussian random variables with zero mean and variance1/M :

Φi,j ∼ N

(0,

1

M

)

Bernoulli random matrix: Let Φ be an M × N random matrix, who’sentries are iid ±α Bernoulli random variables:

Φi,j =

{1√M

w.p. 12

− 1√M

w.p. 12

Lemma 3.1 in [8] establishes that: Let r = s/N , and H be the entropyfunction:

H(q) = −q log q − (1− q) log(1− q), 0 < q < 1

And letf(r) =

√N/M ·

(√r +

√2H(r)

)

For each ε > 0 the restricted isometry constant δs of a gaussian matrixobeys:

P(δs > [1 + (1 + ε) f (r)]2 − 1

) ≤ 2e−εNH(r)/2 (4)

For the Gaussian case. The proof is based on deviation bounds of gaussianrandom matrices’ extrmal singular values.A recent proof of the RIP based on the Johnson-Lindenstrauss lemmain [22] gives us a different result: Follow the notation from the last result,and choose the lowest c1 such that: s ≤ c1M/ log(N/s). Define

c2 = c0

(δ

2

)− c1

[1 +

1 + log(12/δs)

log(N/s)

]

were we use

c0(ε) =ε2

4+

ε3

6

The RIP with isometry constant δs will hold with probability ≥ 1−e−c2M .This result is valid for both the Gaussian and the Bernoulli ensemble dueto the fact that the proof is based on the concentration inequality [22],which holds in both cases (and other cases as well).Lets look on a numerical example: let N = 21000, M = 7000, s = 6. Wealso set δs = 1 −

√(2) - this specific value will become important later.

The RIP holds w.p. 1− 4.3 · 10−3, according to the first result, and w.p.of 1− 3.5 · 10−5 according to the second results (the highest one is closerto the truth, as both results are correct). While the second result givestighter bound on the probability of success, the first result may hold whenthe second one is meaningless, for example N = 18000, M = 6000, s = 6,were we get for the first result that the RIP holds w.p. 1 − 0.044, and ameaningless number in the second result (a negative c2).The compression rate M/N in the example is 1/3, but for the RIP to(provably) hold for such compression rate, the dimensionality of the prob-lem must be very large and the signal must be very sparse - here s = 6, so

5


x will have to be s/2 = 3 sparse for provable reconstruction by `1 meth-ods [7]. Most signals are not sparse enough to meet the demands above.However, numerical experiments that will be displayed in subsubsection2.2.7 demonstrate that CS reconstruction schemes give good results forpractical synthetic and real-world problems. Two explanations for thisare: 1) The proofs of the RIP are based on proving that (3) holds for acertain combination of s columns of Φ, and then using the union boundto prove the RIP on all possible combinations, which might be very loose.2) There are many (N over s) combinations, and even if for some of them(3) does not hold, the success probability is still very high. This meansthat even if the RIP doesn’t hold in general, for most cases we still getgood results.

2.2.4 Coherence and the Babel function

The coherence [11], [14], of a dictionary A ∈ RN×K with unit-normcolumns {aj}K

j=1 is defined as

µ = maxk 6=l

|〈ak, al〉| = maxi6=j

| (A∗A)i,j |

The coherence of orthogonal matrices is of course 0. For general overcom-plete dictionaries, a lower bound on the coherence is

µ ≥√

K −N

N(K − 1)

A union of up to N +1 concatenated orthonormal bases (multi-ONB) hascoherence µ ≥ N−1/2, and the equality is attainable.

While the coherence is easy enough to calculate, it is a blunt instru-ment - it only reflects the extermal correlation of the dictionary. A moredelicate tool is the Babel function, [14]:

µ1(s) = maxΛ:|Λ|=s

maxj /∈Λ

∑

k∈Λ

|〈aj , ak〉|

And the following connection holds: µ1(s) ≤ sµ. When a dictionary havean analytic description, it might be possible to calculate its Babel function.

We now have the mathematical tools needed to discuss the reconstruc-tion algorithms under the CS scheme.

2.2.5 Reconstruction - greedy algorithms

The solution of (1) can be approximated by using sub-optimal greedy al-gorithms, most notably Matching Pursuit (MP) and Orthogonal MatchingPursuit (OMP) [11], [14]. A novel reconstruction algorithm dedicated toCS reconstruction is Tropp’s CoSaMP [?]. Tropp shows in [14] that OMPwill solve (1) correctly whenever x is exactly s-sparse and

µ1 (s− 1) + µ1 (s) < 1

6


In fact, Basis Pursuit (BP, (2)) will also give the correct solution to (1)under the same condition (again, [14]). In terms of coherence, OMP (andBP) is correct whenever

s ≤ 1

2

(µ−1 + 1

)

When x is not sparse, the sparse approximation can be recovered if µ1(s) <12:

‖x− xOMPs ‖ ≤ √

1 + C‖x− xs‖were xOMP

s is the result of the OMP algorithm after s steps, and

C <s (1− µ1(s))

(1− 2µ1(s))2

CoSaMP algorithm: CoSaMP [?] is a greedy algorithm using the RIPproperty of matrix Φ: while Φ∗Φ is not equal to the unity matrix, it is notso far from it (the diagonal elements are 1’s, and the off-diagonal elementsare smaller than the coherence µ). Thus, x = Φ∗Φx (which is obtainedby x = Φ∗y) is a reasonable approximation. Similarly to OMP, CoSaMPiteratively improves the reconstruction, working on the residual on eachstep. CoSaMP requires δ4s ≤ 0.1, or δ2s ≤ 0.025 to provably achieveaccurate reconstruction.

2.2.6 Recovery by `1 minimization

Two recent results [7] bound the reconstruction error obtained by solvingan `1 optimization problem for 1) the noiseless case, and 2) the noisycase. Those results also unify the treatment of exactly sparse signalsand compressible signals. Noiseless recovery: Assume that the IsometryConstant of Φ, δ2s obeys δ2s <

√2 − 1. Let x∗ be the solution to (2).

Then x∗ obeys:‖x∗ − x‖1 ≤ C0‖x∗ − xs‖1 (5)

and‖x∗ − x‖2 ≤ C0s

−1/2‖x∗ − xs‖1 (6)

C0 is given explicitly below. (6) means that for exactly sparse signals,exact recovery is promised. This is a dramatic improvement over previousresults [5]. Noisy recovery: In the noisy recovery setting we have y =Φx + n, were n is the noise vector and ‖n‖ ≤ ε. (2) will be replaced by:

minx{‖x‖1 s.t. ‖y − Φx‖2 ≤ ε} (7)

Let x∗ be the solution of (7). If δ2s <√

2− 1 and ‖n‖ ≤ ε, x∗ obeys:

‖x∗ − x‖2 ≤ C0s−1/2‖x∗ − xs‖1 + C1ε (8)

C0 and C1 constants: let

α ≡ 2√

1 + δ2s

1− δ2s, ρ ≡

√2δ2s

1− δ2s

C0 =2(1 + ρ)

1− ρ, C1 =

2α

1− ρ

7


2.2.7 Examples - synthetic sparse signals

Let us consider the following example: N = 512, M = 128. The signalx ∈ RN in this example is s-sparse, with its non-zero elements chosen atrandom uniformly. The value of the non-zero elements is realized from theLaplacian distribution, and its absolute value is limited to be higher than1. The purpose of the experiments is to demonstrate and test differentreconstruction methods and different sampling schemes. The samplingscheme and reconstruction method will be tested on signals having vari-able sparsity and noise level.

Different reconstruction schemes The first experiment tests theperformance of the reconstruction method - OMP, CoSaMP and convexrelaxation (`1). Following [5] and the simulations in [34], the samplingmatrix, Φ ∈ RM×N is a random gaussian matrix who’s rows have beenorthogonalized. Different levels of noise, as well as different levels of spar-sity (s ∈ {5, 10, 15, 20, 25, 30, 35, 40, 50}) are tested. The results of the `1reconstruction are displayed in figure 2.2.7 a. The results of the OMPreconstruction are displayed in figure 2.2.7 b. The average run time of asingle `1 reconstruction is 2 seconds, while the run time of OMP is 0.05seconds. The qualities of `1 and OMP reconstructions are comparable,with OMP better than `1. This a surprising result, as `1 considers betterthan OMP. The reason for the apparent superiority of OMP over `1 mightbe the specific setting of the problem, or incomplete convergence of the`1 optimization. Either way, the methods seems at least comparable, andOMP is about 40 times faster, does not require parameters calibrationand is very simple to implement. The next experiments will be based onOMP. Still, when the time will come for solving the hyperspectral recon-struction problem, the `1 method will be tested again. We learn fromthe graphs that when using OMP reconstruction, a signal having up to25 non-zero entries will be reconstructed perfectly given the noise STD is0.05. This is much better compared to the theoretical bounds.

Different sampling matrices Here we test the influence of the sam-pling scheme on the performance of the reconstruction. We test randomgaussian, orthogonalized gaussian, random ±1 and structured ±1 matri-ces. We also test a 0 − 1 version of the ±1 matrices. All the samplingmatrices rows have unit norm. The signals are realized as in the previoussubsection, and the noise STD is 0.025. OMP is used for reconstruction.The reconstruction performance are displayed in figure 2.2.7. The perfor-mance of the 0− 1 versions of the ±1 matrices are not displayed becausethey were consistently worse than their equivalent ±1 matrices. That phe-nomenon is due to the multiplication of a matrix with sparse vector - theinner product of a row with half of its entries set to zero with a randomsparse vector is likely to be small, resulting in smaller measurement vectory, which makes the noise more prominent.

By inspection of figure 2.2.7, we come to the conclusion that struc-tured sampling matrices perform better than random matrices, both inthe Gaussian case and in the binary case. This is coherent in a waywith [10], although we use different structuring. Gaussian and binary

8


5 10 15 20 25 30 35 40 45 500

0.2

0.4

0.6

0.8

1

||x||0

Suc

cess

rat

e

b. Reconstruction by OMP, L=20

5 10 15 20 25 30 35 40 45 500

0.2

0.4

0.6

0.8

1

||x||0

Suc

cess

rat

ea. Reconstruction by L1, L=20

STD=0STD=0.005STD=0.01STD=0.025STD=0.05STD=0.075STD=0.1

STD=0STD=0.005STD=0.01STD=0.025STD=0.05STD=0.075STD=0.1

Figure 1: CS of synthetic sparse data, `1 (a) and OMP (b) reconstructions.Each line stands for different noise level. The x-axis indicates the sparsity of thesignals, and the y-axis indicates the probability of success in CS reconstruction.Success is defined as identifying the correct s non-zero entries of the signal.

9


5 10 15 20 25 30 35 40 45 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

||x||0

Suc

cess

rat

e

Reconstruction by OMP, L=50

Phi:Gaussian structuredPhi:GaussianPhi:HadamardPhi:Rademacher

Figure 2: CS of synthetic sparse data, OMP reconstruction. Each line standsfor different sampling matrix. The x-axis indicates the sparsity of the signals,and the y-axis indicates the probability of success in CS reconstruction. Successis defined as identifying the correct s non-zero entries of the signal.

10


matrices have similar performance. It is important to note that whendictionaries will be involved, the results may change.

2.2.8 Overcomplete dictionaries and CS

When we want to use CS for signals that are sparse in some dictionary(possibly overcomplete), we would like to know what happens to the RIPof the composed matrix A = ΦΨ. [21] gives an answer that covers both theorthogonal basis and the overcomplete dictionary cases. By analyzing theway random matrices operate on vectors according to the concentrationinequality, theorem 2.2 in [21] establishes that: Let Φ ∈ RM×N be a sam-pling matrix satisfying the concentration inequality, and let Ψ ∈ RN×K

be a dictionary with restricted isometry δs (Ψ). Assume that

M ≥ Cδ−2 (s log (K/s) + log (2e(1 + 12/δ)) + t)

for some 0 < δ < 1 and t > 0. Then with probability 1−e−t the composedmatrix A = ΦΨ has restricted isometry constant

δs(ΦΨ) ≤ δs(Ψ) + δ (1 + δs(Ψ))

Were C ≤ 23.143 for both the Gaussian and Bernoulli sampling matrices.When Ψ is an orthogonal matrix, we have δs(Ψ) = 0 and we get thefamiliar result from [22]. In terms of coherence and babel function, Lemma2.3 in [21]gives us the following connection for any dictionary:

δs ≤ µ1(s− 1) ≤ (s− 1)µ

2.2.9 Examples - CS with different dictionaries

We repeat the experiments from subsubsection 2.2.7, with signals that aresparse in different dictionaries or bases. Denote Φ as the sampling matrix,and Ψ as the inverse of the sparsifying transform, or as the dictionary - thecolumns are the unit norm atoms. Since the RIP of the composite matrixA = ΦΨ determine the probability of successful CS reconstruction, wecannot assume that a sampling matrix that worked well in the experimentson sparse signals will perform well here. Again, we use OMP and setthe noise STD to 0.025. The x signals will be realized as follows: z, thecoefficients vector is realized exactly as x is realized in subsubsection 2.2.7,and x = Ψz. In the first stage Ψ will be orthogonal (or bi-orthogonal),and later we will also examine overcomplete dictionaries.

Orthogonal bases The first set of dictionaries tested were orthogonalbases - db1 (Haar), db2 and Bi-Orthogonal 9-7 wavelets, and DCT base.The results are displayed in figure 3 in subplots a, b, c and d accordingly.

Indeed, the results are different when using dictionaries - in the DCTand Haar WT cases, the Hadamard based sampling matrices show veryweak performance. This means that if the signals are sparse under eitherthe DCT or Haar transforms, a Hadamard based sampling matrix cannotbe used. The last result is not surprising, as we already know that thesampling matrix and the dictionary should be as incoherent as possible,and random rows from the Hadamard matrix are similar to columns from

11


10 20 30 40 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

||x||0

time

[sec

]

d. OMP, base: DCT, L=50

10 20 30 40 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

||x||0

time

[sec

]

c. OMP, base: WT_bi79, L=50

10 20 30 40 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

||x||0

time

[sec

]

b. OMP, base: WT_db2, L=50

10 20 30 40 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

||x||0

time

[sec

]

a. OMP, base: WT_db1, L=50

Phi:Gaussian struc.

Phi:Gaussian

Phi:Hadamard

Phi:RademacherPhi:Gaussian struc.

Phi:Gaussian

Phi:Hadamard

Phi:Rademacher

Phi:Gaussian struc.

Phi:Gaussian

Phi:Hadamard

Phi:Rademacher

Phi:Gaussian struc.

Phi:Gaussian

Phi:Hadamard

Phi:Rademacher

Figure 3: CS of synthetic sparse data, different bases: db1 (Haar) wavelets(a), db2 wavelets (b), Bi-Orthogonal 9-7 wavelets (c), DCT base (d). OMPreconstruction. Each line stands for different sampling matrix. The x-axisindicates the sparsity of the signals, and the y-axis indicates the probabilityof success in CS reconstruction. Success is defined as identifying the correct snon-zero entries of the signal.

12


1 2 3 40

0.2

0.4

0.6

0.8

1

Gaussian Struc. (1) Gaussian (2) Hadamrs(3) Rademacher (4)

cohe

renc

e

Coherence of the composite Φ Ψ marices

WT db1 (Haar)WT db2Bi−Orth 9−7DCTIdentity (I)

Figure 4: Coherence of the composite matrix ΦΨ, for different sampling matricesand different bases. Each group of bars stands for a specific sampling matrix,see legends.

13


the inverse Haar transformation matrix in their 0 − 1 structure, and tocolumns from the inverse DCT transformation matrix in their oscillatorystructure. A detailed comparison of coherence of ΦΨ can be found in figure4. When Ψ is the identity matrix, the coherence of Φ is calculated. Thecombinations of the different bases with the Hadamard based samplingmatrix exhibit high coherence, although the coherence of the Hadamardsampling matrices themselves is the lowest. The other sampling matrices- Gaussians, Rademacher exhibit the same coherence with or withoutthe bases.The general conclusion from the last experiment is that it isbetter to use a randomized matrix rather than a structured one. Thisis because a random matrix is not likely to be similar to a dictionary,as dictionaries and bases are usually structured. Moreover, while thesampling of the signal is done once and the sampling matrix cannot bechanged afterwards, the dictionary used for reconstruction does not haveto be chosen in advance. It is possible to try several dictionaries forreconstruction until the result is satisfying. Therefore, it is advantageousto use a sampling matrix that will allow the use of as many dictionaries aspossible, and this is why random sampling matrices are best. Of course,if we have a system designed for a specific type of signal, a signal thatis sparse w.r.t a known dictionary, we should use a sampling matrix thatperforms best with that base, See [10].

Overcomplete dictionaries Now we use two overcomplete dictio-naries: the first is a combination of two orthogonal transforms - db2wavelets and DCT. The second is dual-tree complex wavelets [35]. Theresults are displayed in figure 2.2.9, and the coherence is displayed in fig-ure 2.2.9. The reconstruction results are considerably worse than whenusing orthogonal bases for the same sparsity. The coherence graph alsoshows that the ΦΨ matrices are more coherent than before. Still, theresults for the WT db2 + DCT dictionary are reasonable, and one shouldkeep in mind that in many cases, a signal does not have sparse represen-tation in an orthogonal basis, and does have a sparse representation inan overcomplete dictionary [23]. Hence, overcomplete dictionaries will beused further in this work.

2.3 Imaging and compressed sensing

Before we describe the main part of this work, compressive sensing ofhyperspectral data, let us examine a simpler problem, the problem of CSof regular images (grey scale). In some cases it makes sense to use CS forimaging - for example in such cases when an imager that works in somespecial wavelength is very expensive or doesn’t exist. See for example [18].The sensing part is implemented using a Micro-Mirror Array (MMA). Thereconstruction requires the ability to calculate Φx in an efficient way - evenfor a moderate resolution image of 256×256, Φ is a 16, 384×65, 536 matrix- too big to use or store. Two ways to reduce the sampling compexity arefirst: to use a randomly chosen coefficients of a Fourier transform of thesignal [3], which is naturally extended to any linear transformation. Thesecond way is to use sparse sampling matrices [16]. Those are ratherstrict limitations - while the mirror array can realize a general sampling

14


5 10 15 20 25 30 35 40 45 500

0.2

0.4

0.6

0.8

1

||x||0

time

[sec

]

b. OMP, Dictionary: Dual−Tree, L=50

5 10 15 20 25 30 35 40 45 500

0.2

0.4

0.6

0.8

1

||x||0

time

[sec

]

a. OMP, Dictionary: WTdb2+DCT, L=50

Phi:GaussianPhi:Rademacher

Phi:GaussianPhi:Rademacher

Figure 5: CS of synthetic sparse data, different dictionaries: db2 wavelets andDCT (a), and dual-tree complex wavelets (b). OMP reconstruction. Each linestands for different sampling matrix. The x-axis indicates the sparsity of thesignals, and the y-axis indicates the probability of success in CS reconstruction.Success is defined as identifying the correct s non-zero entries of the signal.

15


1 20

0.2

0.4

0.6

0.8

1

Gaussian (1) Rademacher (2)

cohe

renc

e

Coherence of the composite Φ Ψ marices

WT db2 + DCTDual Tree WTIdentity (I)

Figure 6: Coherence of the composite matrix ΦΨ, for different sampling matricesand different dictionaries. Each group of bars stands for a specific samplingmatrix, see legends.

16


matrix, it allows a {0, 1} sampling matrix more naturally. The importantcase of {0, 1} sampling matrix doesn’t fit into most of the common lineartransforms, and we saw in 2.2.7 that the use of a sub-sampled Hadamardtransform sampling matrix results in worse reconstruction performance,compared to a random ±1 matrix. In this subsection we present a novelsampling scheme for 2D data, a scheme that is more general than usingfast implemented linear transforms, but nearly as efficient in many cases.

2.3.1 Separable sampling matrix

Let X ∈ R√

N×√N be our N elements image, and let x ∈ RN be thecolumn stacking of X. A compressed sampling of x by a Φ ∈ RM×N

sampling matrix with M = N/α (α is the compression ratio) requires N2

α

multiplications. Let Φ1 and Φ2 be such that Φ = Φ1 ⊗ Φ2, and Φ is asubset IM of M rows from Φ. Let Y = Φ1XΦ2, and let y be the columnstacking of Y , and y = Φx. Then we have yIM = y. The complexity

of calculating Y = Φ1XΦ2: Φ1, Φ2 ∈ R√

N×√N , so it takes√

N√

N√

Nmultiplications to compute Φ1X, and an overall of 2N1.5 multiplications.A net saving of 1

2α

√N - if α = 4 then in the 256 × 256 example above

the separable scheme works 32 times faster. It also solves the storageproblem.

The separable sampling scheme does allow efficient sampling of im-ages compared to direct matrix-vector multiplication, but as not everysampling matrix is separable, the question of the performance loss due tothe separable sampling scheme arise. Even if we do not limit ourselvesto ±1 matrices, the question remains. i.i.d gaussian sampling matricesare considered good for CS, but using separable sampling with two i.i.dgaussian matrices is equivalent to sampling with the Kronecker productof the two matrices, which is no longer i.i.d or Gaussian. An interestingcase is the {0, 1} binary matrix - a Kronecker product of two binary ma-trices has large areas of zeros, which result in data loss and inability toreconstruct the image. The question of choosing matrices for separablesampling (both ±1 and general) is left for future work.

2.3.2 Separable Rademacher sampling matrix

We still need to show a way to use a separable sampling matrix withthe MMA sampling device, in the special case when only binary {0, 1}sampling is allowed. The Kronecker product of two binary matrices isproblematic, but a {−1, 1} matrix can be used instead in the followingway: The Kronecker product of two {−1, 1} matrices is also a {−1, 1}matrix. First, we realize Φ1, Φ2 entries from the Rademacher distribution- ±1 w.p. 0.5 each. Φ is set by choosing M random rows (the set IM )from Φ1 ⊗ Φ2. Denote ΦB = 0.5 (Φ + 1), were 1 is a matrix or vectorof ones, according to the context. ΦB is a binary matrix, and thus canbe used as the sampling matrix by the MMA. The MMA sample theimage to obtain yB = ΦBx. y is obtained from yB in the following way:y = 2yB − 1x. Note that 1x is just the summation of all the elementsin x, so y is obtained by using M compressed samples and an additional

17


single sample that is the sum of x’s elements, which is obtained from theMMA. A similar solution has been independently developed by [17].

2.3.3 Examples

As in subsubsection 2.2.7, we now turn to test the CS scheme on images.First, we will test synthetic images - sparse in the spatial domain first,and then sparse under orthogonal and overcomplete transformations. Inthe second part of the experiments we will use real images.There are two main differences between CS of 1D signals and of images.The first is the size of the signal, which is rather large in the case of images.The large signal size dictates the use of a separable sampling scheme (orany other scheme that does not involve direct Φx multiplication). Thesecond difference is that roughly speaking, images are not very sparse,and does contain a significant amount of energy in the smaller coefficientsof the chosen transform. While this may happen in any type of signal,in images it is almost always an inherent problem. Later on we willsee how regularization techniques help in CS reconstruction under thoseconditions. The main source of reconstruction error is expected to bethe difference between the sparse approximation and the image itself. Inthe synthetic 2D images we will add noise to x instead of Φx in order tosimulate this difference. Noise in the acquisition process will be ignoredfor now.

`1 vs. OMP, noise The signal X ∈ R64×64 represents a small 64×64image. The notation x is reserved for the column stacking of X. We nowhave that N = 4096 and again we use a 1/4 compression rate: M = 1024.Sampling is done by calculating Y = ΦT XΦ, and taking M random ele-ments from Y as y. Φ is a 64× 64 random ±1 matrix.In this first experiments we change the sparsity of the images and thenoise level. We use `1 minimization and OMP.Noise: the noise in the images case will not be added to Φx, but will bethe non-sparse part of the image. In the simulation, we take the zerocoefficients and set them to random values realized from the Laplaciandistribution, such that the energy of the noise will be a pre-determinedfraction of the signal’s energy. Thus, we use SNR as the noise level pa-rameter. The results of the first experiment are displayed in figure 7.

As in the 1D case, OMP performs better than `1, although the runtimes are now comparable when ‖x‖0 ≈ 200. We learn from figure 7that OMP will reconstruct a 4096 pixels image from 1024 measurementswhen it has 150 significant coefficients, and the energy in the remainingcoefficients is 1% of the signal energy (20db).

CS of images with dictionaries Images are rarely sparse in thespatial domain, so we should check how well can we reconstruct im-ages that are sparse in different dictionaries. The treatment of orthog-onal and overcomplete dictionaries is unified in this paragraph. Thereare many ”specialized” transform for images, such as curvelets [36], con-tourlets [38], that represent images more sparsely. In this experiment wetest orthogonal wavelets (db2) and the overcomplete dual-basis dictionary

18


0 50 100 150 200 250 300 350 4000

0.2

0.4

0.6

0.8

1

||x||0

Suc

cess

rat

e

b. Reconstruction by OMP, L=10

0 50 100 150 200 250 300 350 4000

0.2

0.4

0.6

0.8

1

||x||0

Suc

cess

rat

ea. Reconstruction by L1, L=10

SNR=Inf dbSNR=23 dbSNR=20 dbSNR=17 dbSNR=14 dbSNR=11 db

SNR=Inf dbSNR=23 dbSNR=20 dbSNR=17 dbSNR=14 dbSNR=11 db

Figure 7: CS of synthetic sparse images, `1 reconstruction (a) and OMP re-construction (b). Each line stands for different noise level (SNR). The x-axisindicates the sparsity of the signals, and the y-axis indicates the probabilityof success in CS reconstruction. Success is defined as identifying the correct snon-zero entries of the signal.

19


50 100 150 200 250 3000

0.2

0.4

0.6

0.8

1

||x||0

Suc

cess

rat

e

OMP reconstruction, different Ψ, SNR=20

Psi:None, L=20Psi:Wavelets, L=20Psi:OC1, L=20

Figure 8: CS of synthetic sparse images with dictionaries, OMP reconstruction .Each line stands for different dictionary. The x-axis indicates the sparsity of thesignals, and the y-axis indicates the probability of success in CS reconstruction.Success is defined as identifying the correct s non-zero entries of the signal.

DCT+WT(db2). Noise level is 20db. The results are displayed in figure8. In addition to figure 8, CS reconstruction was done on a 128 × 128with the wavelets db2 dictionary. The results are displayed in figure 9and will be used when we will come to deal with real world images insubsubsection 2.3.4.

The use of orthogonal bases decrease performance slightly, were theuse of overcomplete dictionary oblige the image to be more sparse. How-ever, we use overcomplete dictionaries for this reason exactly - to havevery sparse representations. Run time considerations: the linear transfor-mations add to the complexity of the reconstruction process, making itmuch slower. In the 1D case, the linear transformation was implementedas a matrix-vector multiplication, and embedded into the sampling ma-trix: A = ΦΨ, y = Ax. In the 2D case we do not have Φ explicitly.If the transformation could be implemented in a separable manner, wecould have combined the transformation and the sampling. In the cho-sen transformations this is not the case - a single level of the waveletsdecomposition is separable, but not all the levels together, for example.There is always the option to combine the first level decomposition withthe sampling (and the adjoins), and then take the appropriate part of theimage and continue with the transformation. We will not do this here tokeep things simple.

20


200 400 600 800 1000 12000

0.2

0.4

0.6

0.8

1

||x||0

Suc

cess

rat

e

OMP reconstruction, 128x128 pixels, SNR=20

Psi:Wavelets, L=5

Figure 9: CS of synthetic sparse 128× 128 image, OMP reconstruction . Eachline stands for different dictionary. The x-axis indicates the sparsity of thesignals, and the y-axis indicates the probability of success in CS reconstruction.Success is defined as identifying the correct s non-zero entries of the signal.

2.3.4 Real world images

The time has come to try compressive sampling on real-world images.As was already mentioned, images are usually not exactly sparse, butcompressible - most of their energy is concentrated around small numberof wavelets coefficients. In addition and contrary to the 2D experiments,the approximation coefficients tend to have more energy than the detailscoefficients. Also, images tend to have low total-variation [13]. Thoseproperties will be used for the reconstruction and for developing of novelsampling scheme.

The images The chosen images 128 × 128 grey-scale images. Theimages are displayed in figure 10.

Sparsity of images According to figure 9, a 600-sparse 128 × 128image can be reconstructed when the the energy of residual image (Ir =I − Ia) is about 1% of the energy of the image (SNR=20db). The spar-sity of the four test images will be tested by calculating a 600-sparseapproximation under different transforms, and calculating the SNR. Onlycritically sampled transforms were tested - different wavelets bases. Theresults are displayed in figure 11

The conclusion from the graph is that we expect to be able to recon-struct the 600-sparse approximation of the ”R”, ”moon” and the ”pep-pers” images, and that the ”cameraman” image reconstruction may notwork very well. The appropriate wavelets transforms are: R image: db1(Haar), moon: bi-orthogonal 4.4, peppers: bi-orthogonal 4.4, cameraman:

21


a. R256.bmp b. moon.tif

c. peppers.png d. cameraman.tif

Figure 10: The images that will used in subsection 2.3.4.

22


1 2 3 40

5

10

15

20

25

30

R moon peppers cameraman

SN

R

SNR of a 600−sparse approximation

db1db2bior4.4bior3.7

Figure 11: The ‘noise’ level of sparse approximations of the images. Waveletsbases.

bi-orthogonal 4.4.

Performance evaluation When it comes to images, Peak-Signal toNoise Ratio, PSNR is an accepted measure of performance. The PSNRof image I with regard to image I is defined as:

PSNR(I, I) = 10 log10

max I2

MSE(I, I)(9)

were

MSE(I, I) =1

N

∑i,j

(Ii,j − Ii,j

)2

(10)

It is also accepted to take max I as 1 (or 255), regardless of the specificimage I.

Trying compressive sensing A compressed sampling will be doneon each image (1/4 compression), and the reconstruction results will bedisplayed together in figures 12 - 15 with the original image and the bestsparse approximation. In addition, it is interesting to look at a muchsimpler compression scheme: under-sample the image uniformly, and re-store the image by interpolation (by using Matlab’s imresize function,for example). There is no hope of getting perfect reconstruction, but itwill serve as a simple benchmark so the usefulness of compressive sensingcan be assessed.Some details about the reconstruction process: OMP reconstruction was

23


a. R256.bmp b. Simple sub−sampling

SNR: 17.46, PSNR: 25.07

c. 600 coefficients approximation

SNR: 25.89, PSNR: 33.51

d. Compressive sampling, OMP reconstruction

SNR: 27.28, PSNR: 34.89

Figure 12: ‘R’ letter image (a), sub-sampling and interpolation (b), optimalsparse approximation (c), and CS reconstruction (d).

used, although for relatively large signals such as 128×128 images, `1 min-imization is faster than OMP. The results are quite similar. The resultsare displayed in figures 12 - 15.

In first glance, the results are disappointing: CS fails to perfectlyreconstruct the images, or at-least reconstruct the optimal sparse ap-proximation. The R letter image (figure 12 is the only example whereCS reconstruction is successful, and outperforms the naive uniform sub-sampling and interpolation in terms of SNR. The moon example resultis also acceptable. Generally speaking, the uniform sub-sampling schemeresult in pleasant looking but blurred images, while the CS reconstructionis not blurred, but suffers from many artifacts. There are two reasons forthe reconstruction failures: either the images are not sparse enough underthe orthogonal (or bi-orthogonal) transforms that were chosen, or the dis-tribution of the significant coefficients is not random, as was the syntheticsimulations. The question now is what can be done? Is there any hope

24


a. moon.tif b. Simple sub−sampling

SNR: 22.8, PSNR: 32.07


SNR: 25.54, PSNR: 34.8


SNR: 21.25, PSNR: 30.51

Figure 13: Moon image (a), sub-sampling and interpolation (b), optimal sparseapproximation (c), and CS reconstruction (d).

25


a. peppers.png b. Simple sub−sampling

SNR: 30, PSNR: 36.08


SNR: 26.11, PSNR: 32.19


SNR: 22.03, PSNR: 28.11

Figure 14: Peppers image (a), sub-sampling and interpolation (b), optimalsparse approximation (c), and CS reconstruction (d).

26


a. cameraman.tif b. Simple sub−sampling

SNR: 20.39, PSNR: 26.02


SNR: 19.21, PSNR: 24.84


SNR: 14.72, PSNR: 20.35

Figure 15: Camera image (a), sub-sampling and interpolation (b), optimalsparse approximation (c), and CS reconstruction (d).

27


of applying compressive sampling to data that is not sparse enough? Theanswer is yes. For example, Total Variation was used as a regularizationterm by [18]. But before we discuss improving the reconstruction process,we are going to suggest a unique sampling scheme.

2.3.5 LP-HP compressive sensing

Many images are not sparse when using critically sampled transforma-tions. Specifically, when it comes to wavelets, the approximation part ofthe multi-scale wavelets transform is usually dense - most of its coefficientare larger than zero. Of course, there are examples where large part ofthe image is dark, such as the ”moon” or the ”R” images above, but thoseare the cases were CS worked reasonably well in subsubsection 2.3.4. Asfor the other images, their approximation is quite full. Earlier in this re-port, it was mentioned that if we knew which coefficients are the largest(in absolute value), we would have sampled them directly. We do notknow that, and therefore use random sampling matrices for compressedsensing. However, it seems that we do know, at least for many images,that the approximation coefficients are expected to be high. We can thusdivide the sampling into to parts: sample the approximation coefficientsdirectly, (Low-Pass, LP stage), and then sample ”the rest of the image” ina compressed manner (High-Pass, HP stage). By ”the rest of the image”we mean an image who’s approximation part has been removed - such animage is hopefully sparser and easier to reconstruct. To be consistent, wewill limit the overall samples to M .Some notations: denote ML as the number of low-pass samples, and LH

as the number of regular CS samples. We set M = ML + MH . Letx = Ψz, and let zL be a vector who’s first ML coefficients are the sameas of z - the approximation part, and the rest are zero. Let zH = z − zL

be the details coefficients. Let xL = ΨzL, and xH = ΨzH . We get thatx = xL + xH . The sampling is done by two sampling matrices: ΦL andΦH . We have yL = ΦLx, and yH = ΦHxH . xH is not given explicitly, sowe take y = ΦHx, and yH = y − ΦHxL. Taking ΦL to be the first ML

rows of inverse Ψ will result in yL equal to the first ML elements of z.Thus, we have zL and therefor xL exactly, and yH can be obtained. Wenow solve the problem of reconstructing zH only, a problem that shouldbe easier than reconstructing z, as zH is more sparse. The reconstructionerror is due to the reconstruction of zH only, as we have zL. The propor-tion ML/M should be chosen - the extreme case, when ML/M = 1 is thenaive approximation sampling, and ML/M = 0 is ordinary compressivesensing. Not any number can be chosen for ML/M - it depends on thestructure of the transform. In 2D wavelets decomposition, ML increases(or decreases) by a factor of 4.

LP-HP CS - experiments Let us try the new LP-HP CS schemeon the data from subsection 2.3.4. Let ML/M = 1/4. The results aredisplayed in figure 16.

In the moon, peppers and cameraman images, the results are superiorto those obtained by CS in subsection 2.3.4, especially the `1 reconstruc-tion. In the R-letter image case, the result of the OMP reconstruction

28


R256.bmp CS, OMP reconstruction

PSNR: 34.96

CS, L1 reconstruction

PSNR: 31.56

moon.tif CS, OMP reconstruction

PSNR: 32.76


PSNR: 34.59

peppers.png CS, OMP reconstruction

PSNR: 30.53


PSNR: 33.14

cameraman.tif CS, OMP reconstruction

PSNR: 22.14


PSNR: 25.3

Figure 16: LP-HP compressive sensing. Left column - original 128× 128 image.Middle column - OMP reconstruction. Right column - `1 reconstruction.29


is the same as in 2.3.4. It seems that when the image is ”full”, it ismuch better to use the LP-HP scheme. When we expect to have a mostlyempty image, then we should use ordinary CS. Another incentive for us-ing LP-HP approach is when the LP part is very easy to obtain - forexample, when using compressive sensing for very fast analog-to-digitalconversion [19], sampling the signal at a lower rate is usually much easier,and thus the CS part will be used for ”fine-tuning” of the digital sampledsignal. Another example which is going to be explored in this work, is theuse of ordinary, of-the-shelf RGB camera as the LP component of a CSbased hyper-spectral imaging system.

2.3.6 CS regularization

After the sampling is done, we are free to use any technique for recon-struction. As images tend to have low Total Variation, (TV), [13], we canadd a TV term to the reconstruction process. Total-variation of an imageis defined as:

TV (I) =

∫ ∫‖(

dI(x, y)

dx,dI(x, y)

dy

)‖2dydx (11)

in the continuum, and in the discrete case we use finite differences:

TV (I) =∑i,j

‖ (Ii+1,j − Ii,j) , (Ii,j+1 − Ii,j) ‖2 (12)

The use of TV term is equivalent to demand sparsity of a finite differencetransform. This may be accurate in case of piecewise constant function,but in the general case the TV term does not relate directly to the sparsityof the image (in case of images). In CS, many times the reconstructionerrors are artifacts rather than blur, and the TV regularization schemeshould be effective in this case. The most natural way to incorporate theTV term is to use the `1 minimization scheme - minimize both the `1 normof the coefficients and the TV of the image, with the fidelity constraint:‖Az − y‖2 ≤ ε. We will need to switch from the QCQP solver to generalnon-linear solver, though. Another way to incorporate a TV term in theCS reconstruction is by generalizing OMP - see subsection 2.4.

CS with TV - optimization The optimization problem now be-comes:

minz ‖z‖1 + ωTV (Ψz)s.t. ‖Az − y‖2 ≤ ε

(13)

We can no longer use the QCQP solver, and must consider other algo-rithms. [32] is an excellent survey of optimization methods for the `1 reg-ularization of general (twice differentiable) loss function. It also containsnumerical experiments comparing different methods. Generally speak-ing, the most efficient methods for dealing with the non-differentiable `1norm are by using constraint optimization approach. In those methods,we replace the optimization variable x by x = x+ − x−, and adding theconstraints x+ ≥ 0, x− ≥ 0. The size of the new optimization problem isthus double than the original one, which will be highly undesirable when

30


we will use compressive sensing for hyperspectral data acquisition, due tothe very high dimensionality. Another group of methods, termed ”Sub-gradient” approaches may also have good performance when we have agood initial guess for the solution, in terms of its active set (non-zero co-efficients). In CS we will usually not have such a guess if the problem isnot very sparse (in which case we would guess the zero vector). The lastapproach is the differentiable approximations, which replaces the `1 normby smooth surrogate function. The smooth surrogate function methoddoes not require good initial guess, and does not change the dimension-ality of the problem. However it might lead to slow convergence whenthe surrogate function is not smooth enough. This problem is solved bycarefully adjusting the surrogate function sharpness parameter during theoptimization [32]. The slow convergence is caused because the hessianmatrix of the smooth approximation becomes ill-conditioned as the ac-curacy of smooth approximation increases. Another way around that ispresented in [31], where a method similar to augmented lagrangian [30]for unconstrained sum-max problem is developed, allowing accurate solu-tion with smooth surrogate functions. [31] may be naturally extended tothe case of constraint optimization with sum-max terms, which is exactly(13), with the `1 term being a special case of a sum-max term (see [31]for details). Following [28], the surrogate function for |z| we will use are:

hε(z) = |z| − ε log

(1 +

|z|ε

), ε > 0 (14)

and its derivatives:

h′ε(z) =|z|sign(z)

ε + |z| , h′′ε (z) =ε

(ε + |z|)2 (15)

(13) will be replaced by:

minz

∑i hε(zi) + ωTV (Ψz)

s.t. ‖Az − y‖2 ≤ ε(16)

(16) will be solved by the augmented Lagrangian method. The inneroptimization problem will be solved by minFunc [33], which uses L-BFGSin this case.

CS with TV - results The results of the regularized CS are displayedin figure 17. The results are the best so far, better than the LP-HP CS.One should note, however, that the TV prior is not suitable for all images.Also, OMP and other established greedy methods cannot be used with theTV regularization.

2.3.7 LP-HP CS with TV

The next and final step (for images) is to combine the LP-HP samplingscheme and the Total-Variation regularization. The results are displayedin figure 18.

There is a significant improvement in comparison to the LP-HP only,and smaller improvement (∼ 1 db) in comparison to TV regularization

31


R256.bmp CS, L1 + TV

PSNR: 53.27

moon.tif CS, L1 + TV

PSNR: 39.49

peppers.png CS, L1 + TV

PSNR: 35.01

cameraman.tif CS, L1 + TV

PSNR: 28.43

Figure 17: TV-regularized compressive sensing. Left column - original 128×128image. Right column - `1 + TV.

32


R256.bmp CS, OMP reconstruction

PSNR: 52.13moon.tif CS, OMP reconstruction

PSNR: 40.56peppers.png CS, OMP reconstruction

PSNR: 37.28cameraman.tif CS, OMP reconstruction

PSNR: 29.35

Figure 18: LP-HP + TV-regularized compressive sensing. Left column - original128× 128 image. Right column - LP-HP + TV (`1 reconstruction.

33


only for 3 out of four of the images. In each one of the four exampleimages, the LP-HP compressive sensing combined with total variation givegood results, both visually and in comparison to uniform sub-sampling -see figure 19.

2.3.8 CS of images - conclusion

A concluding graph of the different reconstruction methods on the dif-ferent test images is displayed in figure 19. Throughout the last subsub-section we tested the CS scheme on synthetic and real-world images. Wesaw that while real-world images are often not sparse enough to allow theuse of compressive sensing, by using total-variation regularization and anovel LP-HP CS scheme, we were able to achieve acceptable reconstruc-tion performance in all four examples. Moreover, in addition to the PSNRperformance measure, the CS reconstruction is less blurry than the sim-ple sub-sampling one. One should note that we used a separable samplingscheme with a ±1 sampling matrix. More general sampling schemes maygive even better results. We used that sampling scheme because it suitsthe physical sampling scheme that will be used in section 3.

2.4 Generalized OMP - GOMP

In the previous subsection, total-variation regularization was found to becrucial to the success of CS of images. Alas, TV regularization does notintegrate into the OMP algorithm - in OMP we do only quadratic mini-mization, and TV is not a quadratic function. Other greedy algorithms,such as ROMP [12] and CoSaMP [?] does not incorporate non quadraticfunctions as well. We are forced to use the `1 minimization framework. Inthis subsection we develop the Generalized OMP - OMP with non-linearoptimization process inside it, allowing the use of TV regularization incompressed sensing, for example. In OMP there are two minimizationstages: the first is minimizing the residue over the already selected atoms.The second is finding the next atom that matches best with the residue.In GOMP, the first minimization will be general non-linear minimization,allowing the incorporation of TV, for example. The ‘matching’ part canbe done either by simple scalar product as in OMP, or by minimizinga more complex problem that include the non-linear elements, were theminimization is done along a single atom axis each time - similar to Par-allel Coordinate Descent (PCD). Further investigation into this algorithmis left for future work

34


1 2 3 40

10

20

30

40

50

60

R moon peppers cameraman

PS

NR

[db]

Compressive sensing of images performance

OMPL

1

LP−HP (OMP)LP−HP (L

1)

TVLP−HP + TVsub−sample

Figure 19: CS of images performance (PSNR) summary - OMP/`1 reconstruc-tion, LP-HP CS, TV regularization and LP-HP CS combined with TV regular-ization. The rightmost bar in every bar group is the simple 1/4 sub-samplingand interpolation performance.

35


3 Compressed sensing of hyperspectraldata

3.1 Hyperspectral imaging

3.1.1 Background and notations

Hyperspectral (HS) imaging refers to imaging the electromagnetic (EM)waves reflectance/emitance/transmission properties of a scene or an ob-ject. A regular camera does just that over a specific range of the EMspectrum, with low spectral resolution - the famous RGB color base. InHS imaging, the spectral range may be wider (ultra-violet to deep IR),and the spectral resolution is much higher - from one hundred to over twothousands spectral bands. Thus, while the output of a regular camera is athree Nx×Ny matrices, a matrix for each of the R,G,B bands, the outputof an HS imaging system is an Nx × Ny × Nλ data cube. There are Nλ

spectral bands. RGB basis is very useful for visualization, as it matchesthe human visual system, so it can be used to capture and display a sceneappearance such that it will look similar to the real scene in the eyes ofa human observer. There is, however, literally ‘more to the picture thanmeets the eye’ - inspection of a narrow band of the EM spectrum mayreveal additional information on a scene. For example, synthetic colorshave very different spectral signature compared to natural colors. Thisfact is useful when searching for disguised men-made objects - an RGBaerial image may fail to recognize a green disguise-net as such becauseit does look like vegetation, while HS imaging system will easily discernthe artificial object due to its different spectral signature. See [43] foran algorithm for anomaly detection in HS aerials. Another uses includeinspection of forest health, and uses as a diagnostic tool in medicine arebeing researched.

3.1.2 Acquisition methods

Imaging a grey-scale image is done by projecting the light coming from ascene onto a an imager, which is usually a flat chip that detects EM energywith spatial resolution. The imager output is a 2D array of numbers.In color imaging we do the same, only this time the imager pixels arecovered with red, green and blue filters, in some layout. Each pixel ”sees”only one color, and we use spatial interpolation to obtain the R, G, andB plane. In acquiring hyperspectral data cubes, however, we obtain athree-dimensional cube, and cannot acquire it in a single shot. There arethree approaches for HS imaging (below), but all of them are based onsequentially sensing 2D images, and composing them into a single datacube. In this work we will focus on the diffraction method, but the otherapproaches are also described for completeness.

Filter-based methods This is the most intuitive method: sequen-tially imaging a scene through a changing narrow-band color filter. Ineach shot we obtain a single spectral band of the HS cube. One major

36


drawback of this method is that we lose photons in the process - the filterswork by swallowing all photons except the ones with a specific wavelength.

Interferometry An interferometer is a device that measures the in-terference pattern of a ray of light by splitting the light into two beamsand creating a phase difference between them. The beams travel slightlydifferent distance and then combined, creating interference. By graduallychanging the distance the beams travel and taking measurements, the in-verse fourier transform of the spectrum is obtained. One advantage ofinterferometry is the high spectral resolution that can be obtained.

Diffraction In diffraction based methods, the light from a single rowin the scene is projected on a prism or diffracting grid and scattered toits different spectral components. The scattered light is projected on animager, creating a 2D array who’s one dimension is spatial, and the otherspectral (the λ axis). The scene is scanned line by line (for example bya rotating mirror), creating the HS data cube. This method is especiallysuitable for acquiring aerials from airplanes and satellites - the scanningis done naturally by the progress of the vessel.

In all three methods of HS data acquisition there is a trade-off betweenacquisition light and SNR: the faster we acquire each band (or line), theless photons we get, decreasing SNR. The same problem exist in colorimaging as well, but in HS imaging we have much lower energy per band,and many more bands. The slow acquisition process is problematic whenwe try to image a changing scene. Another difficulty is the huge amountof data that needs to be stored or transmitted. In this work we propose anHS sensing device based on compressed sensing. The compressed samplingis done using unique hardware, and the reconstruction process combinesthe `1 methods with novel techniques.

3.2 A Compressed sensing based system for HSdata acquisition

3.2.1 Sensing mechanism

We use the diffraction method (subsubsection 3.1.2) combined with theMicro-Mirror Array (MMA) (see subsection 2.3) to enable compressedhyperspectral sensing. The system is based on the diffraction acquisitionmechanism, but the projection of an image ”line” is done by the MMArather than by a moving mirror. The structure of the system is as follows:an optical system focuses the image on the MMA. the MMA mirrors areplaced such that in the ”on” position, the rays coming from the lens arereflected towards another optical system, which concentrates the incomingrays on a prism (or diffracting grid). The prism and the second opticalsystem is laid such that the horizontal resolution (x-axis) is maintained,but the vertical axis can be sampled one point at a time. The ”line”that is projected by the MMA on the prism is scattered to its spectralcomponents, and the spectral image of the line is recorded by an imager.The system is portrayed in figure ??. The MMA allows us to do more

37


general sampling schemes than scanning each line at a time - we can scanmore complex patterns, and combine different pixels within each column.This way we can have a practical sampling schemes for compressive sensing- by using linear combinations of pixels with arbitrary vertical location(y). The combination may be different for every column. This way we canhave a sampling matrix that ”mixes” the signal properly. The samplingmatrix is still limited to mix only inside the columns1 - we will see thatthis has negative impact on the ability to correctly reconstruct the HScube. Compressive sensing will be done by projecting randomly chosenpixels from each column on the prism, and repeating the process for Ny/αtimes, were Ny is the number of pixels in the vertical (y) dimension, and αis the compression ratio. We sample 1/α of the samples we usually wouldin the direct sampling process. Hopefully, by using clever reconstructionschemes we will be able to reconstruct the image from the under-sampleddata. There is a unique sampling matrix for each column x - Φx. Due tothe structure of the optical system, it is not possible to use the separablesampling from subsubsection 2.3.1. Having one sampling matrix for eachpoint on the x-axis, and having the same sampling matrices in all thebands is equivalent to using sparse binary sampling images [16] that repeatthemselves. The complexity and storage issues are thus resolved, but notwithout cost - see the end of subsubsection 3.3.4. Mathematically, wehave:

yx,λ = ΦxF x,λ (17)

were F is the data cube, and F x,λ is the x column in the λ band, and yis obtained by concatenating the yx,λ vectors for all x, λ.

3.2.2 RGB image

In addition to the mixed hyperspectral data, in parallel to the main opticalsystem there is a standard RGB imaging system. As we will see, the RGB(or RGB-IR) image contribute to the accuracy of the reconstruction. Thisis due to the fact that we loose the spatial structure in compressive sensing,and try to restore it in the reconstruction process. The RGB image retainsthe spatial structure, but has low spectral resolution. It is equivalent tothe LP-HP CS approach from subsubsection 2.3.5.

3.2.3 Noise

The noise in practice is dominated by shot-noise, in addition to dark noiseand quantization noise. We will simulate a simple white Gaussian noise(the dark-noise). Noise is added both to the compressed sampling vectory, and to the RGB-IR image IRGB . The noise levels are 40 db SNR for y,and 60 db SNR for IRGB .

1Either columns or rows - the x and y axes can be switched in the discussion. Moreover,one can imagine a system which does some of the mixing in the y dimension, and then in thex dimension. Such system will use additional optics.

38


Figure 20: Hyperspectral imaging CS-based acquisition system.

39


3.3 CS of HS data - Reconstruction

3.3.1 Overview

In this subsection we will develop the reconstruction algorithm of thecompressively sensed HS data step by step. First, we will introduce thedata cube that we will work on and analyze its sparsity under differentdictionaries. Second, we will apply reconstruction algorithms to the sam-pled data, were in each algorithm we will add terms and constraints thatwill increase the accuracy of the reconstruction. The reconstruction algo-rithms we will use are:

• Simple `1 algorithm (plain CS)• CS + Total Variation term• LPHP CS + TV - RGB image• LPHP CS + TV - column means

At the end of this section we will compare the results, analyze the sourcesof error and suggest further methods to improve the reconstruction.

3.3.2 The data

Unlike color images, which are readily available, hyperspectral data isharder to come by. AVIRIS [42] project is an important source, but itseems that the aerials from AVIRIS tend to have simple spectral behavior,and were ”too easy” to reconstruct - by interpolating the RGB image aloneone could have a reasonable reconstruction. This is not good news though- if there is something small with different spectral signature, it will belost in such simplified approach. We use the data obtained from SurfaceOptics Company (SOC), which kindly allowed me to use two of their HSdata cubes. We will mainly use small part of the ‘Leaf’ data cube thatcontains a variety of materials in the scene - leafs, wall, a plastic cup withsome printing and the SOC HS camera. See figure 21. The data cube sizeis 64 × 64 × 120 (120 spectral bands), with spectral range of 412 - 908nm. Close look on the individual bands of the ‘Leaf’ data cube revealsthat they are rather noisy. For the experiments ahead it is preferred tohave ”clean” data. Therefor Total-Variation De-Noising was applied tothe data by solving: xTV = argminx{TVxy(x) + ω‖x− x‖22}, and choseω = 0.33 by visual inspection. The overall PSNR (with the noise definedas x−xTV ) of the chosen smoothed HS cube is 54 db. The TVxy operatoris the sum of the 2D total-variation of all the bands.

3.3.3 Dictionaries and sparsity

As sparsity is a key issue in compressive sensing, we should find a dic-tionary in which the HS cube is sparse. The first natural choice is the 3dimensional separable wavelets transform. The 3D separable WT is doneby applying a 1D WT on each dimension, obtaining eight bands: Approx-imation, and seven details bands. The transform is then applied to theapproximation, and so on.

40


Figure 21: RGB image of the cropped ‘Leaf’ data cube.

The spectral axis, λ, tends to be highly correlated for each spatiallocation. Also, many adjacent spatial location have a very similar spectralsignature. Those two observation give us hope that the HS data cube willhave a sparse representation, compared to the 2D case. Figure 22 displaythe sparsity of the TV de-noised ‘Leaf’ data cube (xTV ) under 2D WT(only spatial), 1D WT (spectral axis) and the 3D separable WT - the lastresult is much better.

The bi-orthogonal wavelets were chosen because they tend to havegood performance for sparse representation. Exhaustive search over allthe WT types was not done, because in real life we will not have theoriginal data to allow us to choose the best transform. We will needto use good general transforms. (It is possible, though, to do severalreconstruction using different dictionaries until the result is satisfactory).

Observing figure 22, we come to the conclusion that the λ axis isnot well represented by the wavelets basis, and it might be better to useovercomplete dictionary in the λ axis in the reconstruction part.

One may also consider using non-separable overcmplete 3D transforms- 3D versions of Dual-tree wavelets [35], Curvelets [37] or Contourlets [38].The complexity of calculating those non-separable transforms is muchlarger than that of separable transforms, which renders the reconstructionprocess infeasible on the hardware we used. However, by using differenthardware and optimizing the code it may be possible to use those trans-forms. Furthermore, the SESOP [28] algorithm deals with the case ofefficiently optimizing objective functions of the form f(Ax) + g(x), wereA is a linear transformation that is computationally demanding compared

41


0 5 10 15 20 25 30 35 4010

20

30

40

50

60

70

% coefficients

PS

NR

[db]

Approximation error of HS data

bior4.4−bior4.4 − nonenone−none − bior4.4bior4.4−bior4.4 − bior4.4none−none − none

Figure 22: Sparsity of the ‘Leaf’ data cube in different transformations - 2D(spatial domain, blue), 1D (spectral axis, green), and 3D (red). The sparsity ofthe HS cube itself is also presented (none).

42


to f(·) and g(·).

Overcomplete 2D dictionaries Overcomplete transformations, suchas curvelets [36] and contourlets [38] have been known to improve imagessparsity. We will not use them for now, though, because of the computa-tional complexity.

3.3.4 Reconstruction - plain CS

Now that we have a sparsifying base to work with, we will try to recon-struct the sampled HS cube using the `1 scheme:

min {‖z‖1 s.t. ‖Az − y‖2 ≤ ε} (18)

were A = ΦΨ, Ψ is the inverse of the sparsifying transform, and Φ is thesampling operator. here we will use an unconstraint optimization, similarto BPDN (Basis Pursuit De-Noising):

min {‖z‖1 + ω‖Az − y‖22} (19)

By tuning ω, we can obtain results that are equivalent to the constraintsetting. Note that (19) is the Lagrangian of (18), and thus by searchingover ω we get the solution to (18).

Reconstruction results evaluation There is a question of how topresent and evaluate the results? While in 1D and 2D we can simplyplot a graph or an image, the HS data cube is composed of many (120in our case) images, or alternatively 64× 64 signals. In order to comparereconstruction performance, we will use three different views on the error:The first is to display the overall PSNR, the second is a graph of thePSNR in each band, and the third is an image of the spectral axis SNR ineach spatial pixel. In addition, selected bands from the original HS cubeand from the reconstruction will be displayed. Regarding PSNR: the peaksignal value is taken to be the maximum element of the cube X, Xmax,and not the upper limit of the dynamic range - 216 − 1 because the HSacquisition device doesn’t use the entire dynamic range. Let Xi,j,k be thepixel at the spatial location i, j and in the k’th spectral band. The MSEbetween data cubes X and X is defined as:

MSE(X, X) =1

N

∑

i,j,k

(Xi,j,k − Xi,j,k

)2

(20)

Were N is the number of elements in X. The PSNR between two HScubes is defined as:

PSNR(X, X) = 10 log10

X2max

MSE(X, X)(21)

When evaluating the PSNR of a single band, Xk, we will define PSNR1,which is the PSNR calculated with the single cube maximum, and PSNR2,

43


which is calculated with the specific band’s maximum. Let Xkmax be the

k’th band maximal value. The MSE between the k’th bands of X and X:

MSE(Xk, Xk) =1

Nxy

∑i,j

(Xk

i,j − Xki,j

)2

(22)

Were Nxy is the number of spatial pixels in Xk. PSNR1 and PSNR2 aredefined as:

PSNR1(X, X) = 10 log10

X2max

MSE(Xk, Xk)(23)

PSNR2(X, X) = 10 log10

(Xkmax)

2

MSE(Xk, Xk)(24)

The separation to PSNR1 and PSNR2 is important when looking at bandswith low intensity (UV and IR bands), where the PSNR1 level may behigh, but the PSNR2 level is low, indicating that after proper illuminationscaling, we will have a noisy image.

Optimization and convergence As mentioned above, we will useunconstrained optimization with weight ω on the sampling term as a sub-stitute to constraint optimization (18). There exist a specific value for ωsuch that the solution for the unconstraint problem (19) will be equal tothe solution of (18), as (19) is the Lagrangian of the constraint optimiza-tion problem (18), with ω the lagrange multiplier (we discard ε as it doesnot affect the minimal point). Solving (19) is a matter of tuning ω suchthat the sampling term will have the desired value at the minimum, bybisection for example. Using unconstraint setting was found to be muchfaster than solving (18) by the penalty method, probably because thequadratic constraint function becomes a power of 4 function after apply-ing the quadratic penalty function. The minimum of (18) is found, butthat doesn’t guarantee a perfect (or even reasonable) reconstruction, asthe original signal itself does not minimize the optimization problem. Wecan say that in a way, the goal in compressed sensing is finding an opti-mization problem who’s minima is the desired signal. Hence, my target insolving CS problem by convex optimization is to reach a signal all of who’sterms (`1, total-variation and sampling for example) are smaller or equalto those of the original signal. When such a signal is reached, there isvery little hope that further optimization will improve the reconstruction.If the result is not good enough, then the optimization problem should berefined.

Plain CS - results PSNR1 (23) and PSNR2 (24) in each band aredisplayed in figure 23, and the overall PSNR is 27.1 db. An example bandis displayed in figure 24.

The results are far from satisfactory. This might be a surprise, giventhat in figure 22, for a 4% sparse approximation we have 40db PSNR,meaning that the HS data cube is rather compressible. The problem isthat Φ, the sampling matrix, is not a simple random matrix - it is mostlyempty, and the mixing is done only within the columns. Moreover, forfixed x, the mixing is the same for all the bands. It is interesting to check

44


0 20 40 60 80 100 12020

30

40

50

60P

SN

R [d

b]

band

CS of HS cube: L1, PSNR=27.14 db

PSNR1PSNR2

Figure 23: PSNR per band of simple (`1) CS reconstruction. PSNR1 (23) inblue, PSNR2 (24) in red. Overall PSNR is 27.14 db.

a. Band #40, 569 nm b. of HS cube: L1, PSNR = 21.58

Figure 24: Comparison between the original HS cube and the `1 based recon-struction of band #40, 21.6 db PSNR. very poor performance.

45


0 20 40 60 80 100 12020

30

40

50

60

PS

NR

[db]

band

CS of HS cube: L1, PSNR=31.76 db

PSNR1PSNR2

Figure 25: PSNR per band of simple (`1) CS reconstruction, separable 2Dcompressed sampling. PSNR1 (23) in blue, PSNR2 (24) in red. Overall PSNRis 31.76 db.

what will be the performance if we use the separable 2D sampling scheme,with different sampling matrices for the different bands. The results aredisplayed in figures 25 and 26 with 31.8 db PSNR. As expected, thereis a significant improvement over the results of the columns-only mixing.They might get even better if the mixing will be done between bands aswell. In any case, we have to stay with the column (or rows, see footnotein page 38) mixing and identical Φ matrices for all the bands, becausethis is the sampling architecture dictated by the physical system.

3.3.5 Reconstruction - CS + TVxy

We have seen in subsubsection 2.3.4 that total-variation regularizationimprove the reconstruction. We add the TVxy term to the objective func-tion, were TVxy is the sum of the TV of each band separately: TVxy =∑

λ TV2D(Xλ), were TV2D(·) is the TV of image operator, and Xλ is theimage in the λ band. The optimization problem will now be:

minz{‖z‖1 + ω1TVxy(Ψz) + ω2‖Az − y‖22} (25)

Optimization and convergence While in (19) we had a singlescalar, ω, to tune, in (25) we have two, which makes the tuning of the ω’smuch harder. Furthermore, we will have more terms next - for the RGBimage, for example (see below). In order to avoid the need for tedioussearch over the ω’s, we use the Soft Shrinkage operator [26] as a middleway between the constraint and unconstraint optimization. The idea isthat the penalty function, φλ(t), equals the (shifted) identity function for

46


a. Band #40, 569 nm b. of HS cube: L1, PSNR = 26.63

Figure 26: Comparison between the original HS cube and the `1 with separable2D compressed sampling reconstruction of band #40, 26.6 db PSNR. A dramaticimprovement compared to figure 24.

values higher than δ, and zero otherwise:

φλ(t) =

{t− λ, t ≥ λ

0, t < λ(26)

φλ(t) in (26) is not differentiable in δ, so a smoothed version must be usedfor optimization. Following [28] we use:

φλ,s(t) =1

2

(t− λ− s +

√(λ + s− t)2 + 4st

)(27)

See plots of φλ,s(t) with different values of s in figure 27. In addition, allthe terms and the constraints functions are normalized by their expectedtarget values. For the sampling term (constraint), the target value is theknown noise level. The normalization constants will be denoted by λterm,and the λ in φλ,s(t) (27) will be set to 1, because we want the normalizedconstraints to be bellow 1 at the optimal solution. (25) will be replacedby:

minz{λ`1‖z‖1 + ωλTV TVxy(Ψz) + φ1,s

(λsamp‖Az − y‖22

)} (28)

Finding the correct value of λ`1 and λTV is not straight-forward - thetotal-variation of the bands can be estimated from the RGB image (seebelow), and the `1 norm of the HS cube can be estimated based on pastexperiments. Refinement of the relation between λ`1 and λTV is done byadjusting ω after initial optimization. Note that the only parameter thatshould be calibrated is ω, as the λ’s are pre-determined and fixed. Let

47


0 0.5 1 1.5 2 2.5 3

0

0.2

0.4

0.6

0.8

1

t

φ’1,

s(t)

b. dφ1,s

(t)/dt for different s.

0 0.5 1 1.5 2 2.5 3

0

0.5

1

1.5

2

t

φ 1,s(t

)

a. φ1,s

(t) for different s.

s=0.1s=0.05s=0.01s=0

s=0.1s=0.05s=0.01s=0

Figure 27: Smooth approximations of Soft Shrinkage: φ1,s(t) in a., and thederivatives on b.

48


0 20 40 60 80 100 12020

30

40

50

60

PS

NR

[db]

band

CS of HS cube: L1+TVxy, PSNR=35.51 db

PSNR1PSNR2

Figure 28: PSNR per band of `1 + TVxy CS reconstruction. PSNR1 (23) inblue, PSNR2 (24)in red. Overall PSNR is 35.5 db.

us write the objective function in (28) again in a symbolic form, omittingthe normalization constants:

f(z) = f`1(z) + fTVxy (z) + φ1,s (fsamp(z)) (29)

Now we look at the gradient of (29):

∇f(z) = ∇f`1(z) +∇fTVxy (z) + φ′1,s (fsamp(z))∇fsamp(z) (30)

As in the case of penalty method [40], (30) is the gradient of the augmentedlagrangian of:

minz f`1(z) + fTVxy (z)s.t. fsamp(z) ≤ 1

(31)

As long as the lagrange multiplier of fsamp(z) ≤ 1 is smaller than1, φ′1,s (fsamp(z)) in the solution will be the lagrange multiplier. Sincewe used a smooth φ1,s, we don’t get the solution to (31) exactly - theconstraint will be fsamp(z) ≤ ε, with ε ”around” 1, depending on thesmoothness parameter s. If we want a more accurate solution, we candecrease s, were in the limit s → 0 we get the exact solution. However,it is enough to use s = 0.05 - we get 0.5 < ε < 1.5, which is enough here.Increasing s makes the optimization problem much harder. The problemconverge to values lower than those of the original signal in each termafter less than 500 functions evaluations.

CS + TVxy - results The results are displayed in figures 28 and 29.We can see an improvement over the `1-only reconstruction, although

there is still place for improvement.

49


a. Band #40, 569 nm b. of HS cube: L1+TVxy, PSNR = 28.88

Figure 29: Comparison between the original HS cube and the `1 + TVxy recon-struction of band #40, 28.9 db PSNR.

3.3.6 TVλ

Adding a TV term in the λ axis may contribute to the reconstruction, butthe negative influence of total-variation regularization on the peaks in thespectrums should be taken into consideration. It may be possible to usea variation of the TV regularization that preserves peaks in the signal byusing `p norm, with p < 1. This is left for future work.

CS + TVxy + TVλ - results The results are displayed in figures 30and 31. There is very small improvement over the CS + TVxy results.The reason is that the total-variation along the λ axis of the CS + TVxy

reconstruction is already low compared to the original signal’s TVλ.

3.3.7 LPHP CS for HS

As in the 2D case, we would try to implement the principle of LPHP CS.The mirror-array system impose limitations on the ability to sample ap-proximations of the HS data cube. In the spatial plane, the column (orrows)-only mixing allow only approximation in on dimension. The bandscan be sub-sampled only in one direction each time, which is much lessefficient. Because of the relatively small spatial resolution in our exper-iments, we will use limited LPHP approach by only sample the columnmean.LPHP CS will also be implemented in the spectral axis indirectly - byusing the RGB image, which can be viewed as a low spectral-resolutionapproximation of the HS cube. See subsubsections 3.3.8 and 3.3.9.

50


0 20 40 60 80 100 12020

30

40

50

60P

SN

R [d

b]

band

CS of HS cube: L1+TVxy+TVz, PSNR=35.62 db

PSNR1PSNR2

Figure 30: PSNR per band of `1 + TVxy + TVλ CS reconstruction. PSNR1 (23)in blue, PSNR2 (24)in red. Overall PSNR is 35.6 db.

a. Band #40, 569 nm b. of HS cube: L1+TVxy+TVz, PSNR = 29.15

Figure 31: Comparison between the original HS cube and the `1 + TVxy + TVλ

reconstruction of band #40, 28.9 db PSNR.

51


400 500 600 700 800 900 10000

0.01

0.02

0.03

0.04

0.05

wavelength [nm]

resp

onse

RGB−IR imager spectral response

Red (620 nm)Green (530 nm)Blue (460 nm)near−IR (800 nm)

Figure 32: Imager R,G,B and IR pixels response to the EM spectrum.

3.3.8 LPHP - RGB image

The first application of the LPHP principle will be by using the RGB im-age apparatus: in addition to the MMA part, there is an ordinary RGBcamera parallel to the optical axis (see subsubsection 3.2.2). The RGBimage (or, preferably RGB-IR) camera give us a low-spectral resolution(but full spatial resolution) approximation of the HS cube. This is some-what different than the LPHP CS approach presented in 2.3.5, as the HPpart is done along the y-axis only. Another view of the RGB image is thatthe MMA-based sampling of the HS data cube loses the spatial structureof the image, while preserving the spectral data. The RGB image preservethe spatial structure, but loses the spectral resolution. The reason behindthis specific approximation is that RGB, and even RGB-IR images arereadily available, small and cheap. Some details: The RGB-IR cameraoutput is four Nx × Ny matrices, each for every R, G, B and IR bands.The color filters on the imager are positioned according to the Bayer lay-out:

R G

IR B

While RGB cameras often use interpolation in order to achieve max-imal resolution, here we can assume that an 128 × 128 imager is usedand no interpolation is done in the demosaic process - every square ofR,G,B,IR pixels is saved as a single 4 color pixel.

In order to use the data further, we need to know the transformationbetween the spectral signature and the response of the different colorpixels. Such data is part of any imager specification. For simplicity weassume gaussian distributions around the R,G,B and near-IR parts ofthe spectrum. See figure 32. Let λc be a vector of the HS cube central

52


wavelength in each band, and λw the ”width” of each band. (the bandscentral wavelength may be not evenly distributed). Let wR be a vectorof mean response of the red pixels around the corresponding wavelength:wR(i) is the mean of the red pixels response to a λw(i) wide range ofwavelengths around λc(i). Define the vector wR by:

wR(i) =λw(i)wR(i)∑

i λw(i)(32)

The red plane of the image will be thus calculated by: IR(X)(x, y) =∑i wR(i)X(x, y, i). Let wG(i), wG(i), wIR(i) be the weights correspond-

ing to the green, blue and IR pixels, and we have:

IR(X)(x, y) =∑

i wR(i)X(x, y, i)IG(X)(x, y) =

∑i wG(i)X(x, y, i)

IB(X)(x, y) =∑

i wB(i)X(x, y, i)IIR(X)(x, y) =

∑i wIR(i)X(x, y, i)

(33)

Denote the colors set as color = {R, G, B, IR}. The RGB term (or con-straint) will be:

fRGB(z) =∑

k

‖Icolork (z)− Icolork‖22 (34)

Were Icolork without an argument is the sampled color plane, Icolork (X)is a color plane calculated from the HS data cube X, and Icolork (z) is thecolor plane calculated from the z coefficients vector by first transformingit into X. The optimization problem will be:

f(z) = f`1(z) + fTVxy (z) + φ1,s (fsamp(z)) + φ1,s (fRGB(z)) (35)

Again, fRGB(z) will be normalized by its expected value - the noise level.

CS + TVxy +RGB image - results The results of the RGB LPHPCS (plus TVxy) are displayed in figures 33 and 34. Again, we can seean improvement - the PSNR2 in all the bands is above 30 db, and inmost bands is above 35 db. The overall PSNR is 40.7 db. There is anew phenomenon of fluctuating performance between the bands - from43 db PSNR2 to 31 db. The fluctuations seems to be related to thespectral positions of the the RGB-IR sensors. Later on, we will try tomitigate this phenomenon by tighter control over the errors in differentbands separately .

3.3.9 LPHP - Column means

As the sampling of the approximation part of the wavelets transform bythe mirror-array is infeasible in the current configuration, we will onlysample columns means as a limited version of the LPHP CS in subsub-section 2.3.5. First, one of the rows (the first) in each of the samplingmatrices is changed to be all ones. Thus, we have the column means.We can use them in several ways: firstly, just change the sampling andtrust the sampling term to make the columns means close enough to their

53


0 20 40 60 80 100 12020

30

40

50

60P

SN

R [d

b]

band

CS of HS cube: L1+TVxy+RGB, PSNR=40.68 db

PSNR1PSNR2

Figure 33: PSNR per band of `1 + TVxy + IRGB CS reconstruction. PSNR1(23) in blue, PSNR2 (24) in red. Overall PSNR is 40.7 db.

a. Band #40, 569 nm b. of HS cube: L1+TVxy+RGB, PSNR = 36.84

Figure 34: Comparison between the original HS cube and the `1 +TVxy + IRGB

reconstruction of band #40, 36.8 db PSNR.

54


measured values. The second approach is to address the columns meansspecifically, forcing their error to be small regardless of the overall sam-pling error. The first approach was used, and the error of the columnmeans was small enough, so there was no need for refinement. The recon-struction results, however, did not improve and were slightly worse (by0.2 db). This doesn’t mean that the columns means may not improve theperformance on different data, and we should bear in mind that for largerdata cubes we are likely to use a more detailed LP approximation.

3.3.10 Non-negativity

Non-negativity is an accepted and strait-forward constraint in image restora-tion problems, as the result image should be non-negative. This also truefor HS cubes, and a non-negativity term can be added to the objectivefunction. However, the results we have seen so far are non-negative any-way, so adding such a constraint will only make the optimization processlonger (every objective function evaluation will take longer). If there willbe negative values in future results, a non-negativity term will be added.We will use the following penalty for negativity:

fnon−neg(x) = ‖x−‖22, x−(i) =

{0, x(i) ≥ 0x(i)2, x(i) < 0

(36)

3.4 Summary

We have seen that HS data can be reconstructed from a four-times under-sampling, with reasonable performance. Total-variation regularization inthe image plane have played a major role in allowing the reconstruction,as well as low spectral resolution approximation by an RGB-IR camera.The PSNR2 of the bands of the three reconstruction is displayed in figure35. Still, there is place for improvement - we would like to have all thebands near the 40 db PSNR2 or above. The future work subsubsectionbelow will outline our research plans in the near future.

3.4.1 Future work

In order to improve the compressed sensing of HS data reconstructionresults, we will use the following novel approaches:

1) Tight error control While the use of ”overall error” terms (suchas ‖Ax−y‖22) is common in image restoration and de-noising, in the com-pressed sensing setting the error often tends to concentrates in specificareas, degrading the reconstruction appearance and PSNR. This is ap-parent in figure 35, were some bands have low PSNR2, while others hashigh PSNR2. In the future we intend to try and make the errors moreuniform.

2) Spectral axis sparsity Compressed sensing is based on the spar-sity of the signal (although not entirely). Looking at figure 22, we see thatcritically-sampled 1D wavelets (bior4.4) transform of the spectral axis does

55


0 20 40 60 80 100 12025

30

35

40

45

band

PS

NR

[db]

PSNR2 of different reconstructions

L1, PSNR=33.5

L1+TVxy, PSNR=35.5

L1+TVxy+RGB, PSNR=40.7

Figure 35: Comparison between the different reconstruction schemes - `1, TVxy-regularized `1, and `1+TVxy+IRGB regularization. PSNR2 is compared betweenthe different algorithms.

not really result in a sparse representation. The 3D transform is better,naturally, but finding a 1D transform that will give us sparser representa-tion of the spectral axis is bound to improve the sparsity of the 3D trans-form. We will focus on the spectral axis for two reasons: first, inspectingthe spectral signals reveals that they are rather simple, and similar to eachother, which also invite us to use bootstrap dictionary-learning. Second,the linear transformations are implemented by matrix-vector multiplica-tions and thus the complexity of the transformation increases only linearlywith size of the dictionary - as long as it is either a 1D transformation(for the spectral axis), or a separable 2D transformation.

3) Spectral axis dictionary learning As mentioned above, thespectrums in a single HS cube are similar to each other, and tailoreddictionary is expected to perform well in this case. [29] propose a methodof training sampling matrix and dictionary for a specific class of images.If, however prior knowledge on the data is not assumed, we may still beable to construct useful dictionary based on the sampled data itself. Wecall this approach Bootstrap method for dictionary construction. We planto use PCA, ICA [27] and KSVD [25] for the dictionary construction.Preliminary experiments show promising results.

4) Spectral axis regularization As already hinted in subsubsec-tion 3.3.6, we plan to explore methods for the regularizing the spectralaxis that preserve spikes. We will test two variations on the total-variation

56


algorithm, one that uses the `p norm with p < 1, and the other will be aniteratively re-weighted total variation.

Acknowledgment: This research was supported in part by the Eu-ropean Community’s FP7-FET program, SMALL project, under grantagreement no. 225913.

References

[1] D. Donoho, Compressed Sensing, IEEE Transactions on InformationTheory, 2006

[2] X. Huo, D. Donoho, Uncertainty Principles and Ideal Atomic De-composition, IEEE Transactions on Information Theory, 47(7): 2845-2862, 2001

[3] E. Candes, Compressive Sampling, Proceedings of the InternationalCongress of Mathematicians, 3:14331452, 2006

[4] E. Candes, J. Romberg, and T. Tao, Robust Uncertainty Principles:Exact Signal Reconstruction from Highly Incomplete Frequency Infor-mation, IEEE Transactions on Information Theory, 52(2): 489-509,2006

[5] E. Candes, J. Romberg, T. Tao, Stable Signal Recovery from Incom-plete and Inaccurate Measurements, Communications on Pure andApplied Mathematics, 59:12071223, 2006

[6] E. Candes, T. Tao, Near Optimal Signal Recovery From RandomProjections: Universal Encoding Strategies?, IEEE transactions oninformation theory, 52:54065425, 2006

[7] E. Candes, The Restricted Isometry Property and Its Implicationsfor Compressed Sensing, Comptes Rendus Mathematique, 346(9-10):589-592, 2008

[8] E. Candes, T. Tao, Decoding By Linear Programming, InformationTheory, IEEE Transactions on Information Theory, 51(12): 4203-4215, 2005

[9] E. Candes, J. Romberg, Sparsity and Incoherence in CompressiveSampling, Inverse Problems, 23:969985, 2007

[10] Y. Weiss, H. S. Chang, W. T. Freeman, Learning Compressed Sens-ing, Snowbird Learning Workshop, Allerton, CA, 2007

[11] S. Mallat, Z. Zhang, Matching Pursuits with time-frequency dictio-naries, IEEE Transactions on Signal Processing, 41(12):3397-3415,1993

[12] D. Needell, R. Vershynin, Signal recovery from incomplete and in-accurate measurements via regularized orthogonal matching pursuit,preprint, 2007

57


[13] L. Rudin, S. Osher, E. Fatemi, Nonlinear total variation based noiseremoval algorithms, Physica D, 60:259268, 1992

[14] J. Tropp, Greed Is Good: Algorithmic Results For Sparse Approxima-tion, IEEE Transactions on Information Theory, 50(10):2231-2242,2004

[15] D. NEEDELL, J. Tropp, CoSaMP: Iterative signal recovery from in-complete and inaccurate samples, Applied and Computational Har-monic Analysis, 26(3):301321, 2009

[16] S. Sarvotham, D. Baron, R. Baraniuk, Measurements vs. Bits: Com-pressed Sensing meets Information Theory Proc. 44th Ann. AllertonConference on Communication, Control, and Computing, Monticello,IL, Sep 2006.

[17] Y. Rivenson, A. Stern, Compressed Imaging with a Separable SensingOperator, IEEE Signal Processing Letters, 16(6):449-452, 2009

[18] W. L. Chan, K. Charan, D. Takhar, K. F. Kelly, R. G. Baraniuk,D. M. Mittleman, A single-pixel terahertz imaging system based oncompressed sensing, Applied Physics Letters, 93, 121105, 2008

[19] , J. Laska, S. Kirolos, Y. Massoud, R. Baraniuk, A. Gilbert, M. Iwen,M. Strauss Random Sampling for Analog-to-Information Conversionof Wideband Signals, IEEE Dallas Circuits and Systems Workshop(DCAS), Dallas, Texas, Oct. 2006.

[20] M. Lustig, D. Donoho, J. Pauly, Sparse MRI: The Application ofCompressed Sensing for Rapid MR Imaging, Magnetic Resonance inMedicine, 58(6): 1182-1195, 2007

[21] H. Rauhut, K. Schnass and P. Vandergheynst, Compressed Sens-ing and Redundant Dictionaries, IEEE Trans. Inform. Theory,54(5):2210-2219, 2008.

[22] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin, A SimpleProof of the Restricted Isometry Property for Random Matrices, Con-structive Approximation, 28(3):253-263, 2008

[23] M. Elad, A. Bruckstein, A Generalized Uncertainty Principle andSparse Representation in Pairs of <N Bases, IEEE Transactions onInformation Theory, 49:2558-2567, 2002

[24] M. Aharon, M. Elad, A. Bruckstein, On the uniqueness of overcom-plete dictionaries, and a practical way to retrieve them, Linear Alge-bra and Its Applications, 416(1):48-67, 2006

[25] M. Aharon, M. Elad, A. Bruckstein, Y. Katz, K-SVD: An Algo-rithm for Designing of Overcomplete Dictionaries for Sparse Rep-resentation, IEEE Transactions on Signal Processing, 54(11):4311-4322, 2006

[26] M. Elad, M. Zibulevsky Iterated Shrinkage Algorithms in Signal andImage Processing, preprint.

[27] A. Bell, T. Sejnowski, The independent components of natural scenesare edge filters, Vision research, 37(23):Pages 3327-3338, 1997

58


[28] M. Elad, B. Matalon, M. Zibulevsky, Coordinate and subspace opti-mization methods for linear least squares with non-quadratic regular-ization, Applied and Computational Harmonic Analysis, 23(3):346-367, 2007.

[29] J. Duarte-Carvajalino, G. Sapiro, Learning to sense sparse signals:Simultaneous sensing matrix and sparsifying dictionary optimization,IEEE Trans. Image Processing, 18(7):1395-408, 2009

[30] A. Ben-Tal, M. Zibulevsky, Penalty/barrier multiplier methods forconvex programming problems, SIAM Journal on Optimization,7(2):347-366, 1997

[31] M. Zibulevsky, Smoothing method of multipliers for sum-max prob-lems, Technical report, Department of Electrical Engineering, Tech-nion, Haifa, Israel, 2001. http://ie.technion.ac.il/∼mcib/.

[32] M. Schmidt, G. Fung, R. Rosaless, Optimization Methodsfor `1-Regularization, unpublished. http://www.cs.ubc.ca/cgi-bin/tr/2009/TR-2009-19.pdf

[33] Mark Schmidt website, http://people.cs.ubc.ca/ schmidtm/Software/minFunc.html

[34] `1 magic in E.Candes’s website:http://www.acm.caltech.edu/l1magic/

[35] A. Abdelnour, I. Selesnick, Symmetric Nearly Shift-Invariant TightFrame Wavelets, IEEE Transactions on Signal Processing, 53(1):231-239, 2005

[36] J. Starck, E. Candes, D. Donoho, The Curvelet Transform for ImageDenoising, IEEE Transactions on image processing, 11(6):670-684,2002

[37] L. Ying, L. Demanet, E. Candes, 3D Discrete Curvelet Transform,Proc. SPIE wavelets XI, San Diego, 591413, 2005

[38] Y. Lu, M. Do, A new contourlet transform with sharp frequencylocalization, Proc. IEEE Int. Conf. on Image Processing, 1629-1632,2006

[39] S. Mallat, A Wavelet Tour of Signal Processing, Academic Press,Second edition, 1998

[40] D. Bertsekas, Nonlinear programming, Athena Scientific, 1995

[41] R. Gonzalez, R. Woods, Digital Image Processing, Prentice Hall 2002

[42] AVIRIS project website: http://aviris.jpl.nasa.gov/

[43] O. Kuybeda, D. Malah, M. Barzohar, Hyperspectral channel reduc-tion for local anomaly detection, 17th European Signal ProcessingConference (EUSIPCO 2009)

59


A Micro-Mirror Array based System for Compressive Sensing ... · ing of hyperspectral data section we will develop a novel sampling system for hyper-spectral data that is based on

Documents