MATHEMATICAL PROBLEMS IN ENGINEERING - ALGORITHMS …zzhu/pub/hzly-OSTL.pdf · MATHEMATICAL PROBLEMS IN ENGINEERING - ALGORITHMS FOR COMPRESSIVE SENSING SIGNAL RECONSTRUCTION WITH

MATHEMATICAL PROBLEMS IN ENGINEERING - ALGORITHMS FOR COMPRESSIVE SENSING SIGNAL RECONSTRUCTION WITH APPLICATIONS 1

An Efficient Algorithm for OvercompleteSparsifying Transform Learning with Signal

DenoisingBeiping Hou, Zhihui Zhu, Gang Li, and Aihua Yu

Abstract—This paper deals with the problem of over-completetransform learning. An alternating minimization-based proce-dure is proposed for solving the formulated sparsifying transformlearning problem. A closed-form solution is derived for theminimization involved in transform update stage. Comparedwith existing ones, our proposed algorithm significantly reducesthe computation complexity. Experiments and simulations arecarried out with synthetic data and real images to demonstratethe superiority of the proposed approach in terms of the averagedrepresentation and denoising errors, the percentage of successfuland meaningful recovery of the analysis dictionary, and moresignificantly, the computation efficiency.

Index Terms—Sparse representations; analysis dictionarylearning; low-dimensional signal model.

I. I NTRODUCTION

Signal representations play a role of cornerstones in signalprocessing. They have experienced the evolution from theFourier transforms, which are traced back to the later of the19th century, to the wavelet expansions in the 1980s and thenthe sparse representations which have been one of the hotspotsin signal processing community for the last two decades.Instead of the traditional expansions in terms of bases andframes, sparse redundant representations look for the bestapproximation of a signal vector with a linear combination offew atoms from an over-complete set of well-designed vectors[1]. This topic is closely related to thesparsifying dictionarylearning [2], [3] and the compressed sensing (CS) [4] - [6], anemerging area of research in which the sparsity of the signalsto be recovered is a prerequisite.

Let v ∈ ℜN×1. The lp-norm of v is defined as

||v||p= (

N∑

k=1

|v(k)|p)1/p, p ≥ 1 (1)

For convenience,||v||0 is used to denote the number of non-zero elements inv though thel0 is not a norm as it does notsatisfy the homogeneity.

B.P. Hou and A.H. Yu are with the School of Automation and ElectricalEngineering, Zhejiang University of Science and Technology, Hangzhou,310023, Zhejiang, P.R. China.

Z.H. Zhu is with the Department of Electrical Engineering and ComputerScience at the Colorado School of Mines, 1500 Illinois St., Golden, CO 80401,USA.

G. Li (contact author, e-mail: [email protected]) is with the School ofAutomation and Electrical Engineering, Zhejiang University of Science andTechnology, and the College of Information Engineering, Zhejiang Universityof Technology, Hangzhou, 310023, Zhejiang, P.R. China.

A. Signal models and dictionary learning

Let y = x + ξ ∈ ℜN×1 be the signal under consideration,where x is the clean signal andξ is an additive noise ormodelling error. There are basically two models used for theclean signalx. The first one is calledsynthesis modelwhichassumes that the clean signalx is given by

x =

J∑

k=1

s(k)ψk= Ψs (2)

whereΨ ∈ ℜN×J is usually called asynthesis dictionary1

and is saidover-completeif N < J , and s ∈ ℜJ×1 is thecoefficient vector ofx in Ψ. The signals is saidκ-sparse inΨ if ||s||0 ≤ κ.

The procedure for findings with x andΨ given is referredas toanalysis. Clearly,

s = Ωx (3)

is a solution to this problem as long asΨΩ = IN , whereINis the identity matrix of dimensionN . It should be pointedout that (2) and (3) are equivalent whenΨ is square and non-singular. WhenΨ is overcomplete, such a claim does not holdas (2) is not an one-to-one mapping betweenx and s. (3) isreferred to asanalysis model[16] - [21] andΩ ∈ ℜJ×N iscalledanalysis dictionaryor analysis operator. WhenJ > N ,which is assumed throughout this paper, the dictionary is saidredundantor over-complete.

Let yn be a set of signal samples belonging to a certainclass of signals. For a given sparsity level, sayκ, sparsi-fying synthesis dictionary learning is to find aΨ such thatyn = Ψsn + ξn, ∀ n with ||sn||0 ≤ κ. Methods for learningthe dictionaryΨ from signal samplesyk have been oneof the hottest topics in signal processing, among which themethod of optimal directions (MOD) [7] and K-singular valuedecomposition (K-SVD) based algorithms [8]-[9] have beenconsidered as state-of-the-art in this area. The basic ideabehind these methods is based on the alternating minimizationstrategy [10] - [12], in which the dictionaryΨ and the sparsecoefficientss are updated alternatively using the orthogonalmatching pursuit (OMP) techniques [13] - [15].

Compared with the synthesis model-based sparsifying dic-tionary learning, its analysis model-based counterpart has re-ceived less attention. The analysis dictionary learning problem

1Throughout this paper,qk and qij denote thekth column vector and(i, j)th entry of matrixQ, respectively.


can be stated as

Ω, X = argminΩ,X ||Y − X||2F

||Ωxm||0 ≤ J − l= κ, ∀ m

(4)

where l, denoting the number of zero elements ins, is theco-sparsity given andxm denotes themth column ofX .2 AsX is the clean signal, (4) is also a signal denoising.

The first attempt to address such a problem was given in[16], where it is suggested to incrementally learn the dictionaryΩ one row at a time. The performance of the proposed algo-rithm, however, depends heavily on a randomized initializationstrategy. A different approach to the dictionary learning wasproposed in [17] - [19]. As proposed in [19], (4) is investigatedwith the following formulation

minΩ,X

||ΩX ||1 s.t Ω ∈ C, ||Y − X||F ≤ ǫ0

whereC is a specified set within which the optimal operatoris searched andǫ0 ≥ 0 is the parameter corresponding to thenoise level. Such a problem is alternatively addressed using aregularized version in [19].

Another signal model, which is very much close to theanalysis model, is thetransform modelraised in [22]. Sucha model is characterized with

Wy = s+ ǫ (5)

whereW ∈ ℜJ×N is called a transform, s is sparse with||s||0 ≤ κ, while ǫ is the representation/modeling error intransform domain and is assumed to bevery small. It isinteresting to note that for an analysis signaly = x + ξwith Ωx = s sparse, one hasΩy = s + Ωξ which isalways in the form specified in (5), while ify belongs to thetransform model-based class inΩ, which means there existsa pair (s, e) with s being sparse such thatΩy = s + e,Ω[

x ξ]

=[

s e]

may have no solution forx, ξ atall given that the number of equations is bigger than that ofunknowns due toN < J assumed.

WhenW is available, the signaly is transformed to aκ-sparse vectors:

s= arg min

s||s−Wy||22, s.t. ||s||0 ≤ κ

The solution can be achieved simply by thresholding theproduct Wy and retaining only theκ largest coefficients(magnitude-wise) of this product. The estimatey of theoriginal signaly can be found froms andW by minimizing||s−Wy||22 with respect toy. This yields

y = (W TW )−1W T s

as long asW is full rank (under the assumption thatJ ≥ N ),whereT denotes the transpose operator.

The sparsifying transform learning problem is formulatedas

W,S = argminW ,S ||S − WY ||2F||sm||0 ≤ κ, ∀ m

(6)

2Throughout this paper, we use ”tilde“ (sayQ) to denote a (matrix/vector)variable, whileQ is used for the optimal/true one in a problem.

This problem is not well defined and has the problem of so-calledscale ambiguity. It is observed that a trivial solution forthe above problem isS = 0 andW = 0. One approach toprevent such solutions is to add some penalty functions intothe objective. In [22], the penaltyν||W ||2F − log det(W ) wasincluded in the cost function for learning a complete transformwith dimensionN ×N . The termlog det(W ) is used to forcethe full rank ofW and the term||W ||2F is applied to avoidthe scaling issue. See below:

LRB(W , S)= ||S−WY ||2F +λ[ν||W ||2F − log det(W )] (7)

whereλ andν are two positive constants, whilelog det(·) isthe natural logarithmic operator applied to the determinant ofa matrix. The optimal transform design was then reformulatedas

W,S = argminW ,S LRB(W , S)

||sm||0 ≤ κ, ∀ m(8)

subject todet(W ) > 0. Such a problem was solved in [22],[23].

B. Contributions

In this paper, we will consider solving (6) with a given(Y, κ). An efficient algorithm is proposed to solve the optimalsparsifying transform design problem. It should be pointedout that our proposed approach is more general than those in[22], [23] as it deals with for overcomplete transform learningand more efficient than [24] as the transform is updatedusing a closed-form solution rather than the one obtainedwith a gradient-based numerical procedure, which will beelaborated further in next section. Roughly speaking, (4) hasbeen attacked using alternative optimization, in which with X

obtained,Ω is updated based on||Ωxm||0 ≤ J − l= κ, ∀ m.

This can be considered with

Ω= argminΩ,S ||S − ΩX||2F

||sm||0 ≤ κ, ∀ m(9)

As seen, this problem is exactly the same as (6) withYreplaced withX. So, the efficiency of the analysis KSVDproposed in [21] is expected to be improved using ourproposed algorithm for analysis dictionary update. We willdemonstrate the potential of this proposed algorithm in a seriesof experiments using both synthetic and real data and compareit with the approach in [21]. As to be seen, our approach yieldsa performance comparable to that by the latter in terms ofdenoising and recovery of the reference dictionary, and moreimportantly, a much improved computation efficiency.

The outline of this paper is given as follows. In Section II,after briefing some related work, we formulate the problem tobe investigated in the context of transform learning and howitis related to analysis dictionary learning. Section III is devotedto solving the optimal transform design. An efficient algorithmis derived for that purpose, in which a closed-form solutionisobtained for updating the transform. The performance of thisalgorithm and its application to signal denoising are discussedin this section. In order to demonstrate the superiority of the


proposed algorithm, experiments are carried out in SectionIV,which show promising performance of the proposed approach.Some concluding remarks are given in Section V.

II. PRELIMINARIES AND PROBLEM FORMULATION

A. Related works

The first attempt to address the analysis dictionary learningwas given in [16], where it is suggested to incrementallylearn the dictionaryΩ one row at a time. The performanceof the proposed algorithm, however, depends heavily on arandomized initialization strategy. A different approachto thedictionary learning was proposed in [17] - [19], where thegoal of sparsifying the representationsΩX is formulated usingan l1-based penalty function rather than thel0-based one.To avoid the trivial solutions, the dictionary is constrainedto be uniform normalized tight frame, that isΩT Ω = IN ,the identity matrix of dimensionN . Such a constraint clearlylimits the searching space for the optimal dictionary.

Recently, it was argued in [21] that in the analysis model,one should emphasize on the zeros ofΩx. Based on thisargument, (4) was reformulated in [21] as

Ω, X, ΛiLi=1= argminΩ,X,ΛiL

i=1

||Y − X||2FΩΛi

xi = 0, Rank(ΩΛi) = N − r, i = 1, · · · , L

||ω†m||2 = 1, ∀ m

(10)whereq†m denotes themth row vector of matrixQ, Λi is theco-supportof xi, defined as the set of indices for thel rowsof Ω that are orthogonal toxi, and ΩΛi

is the sub-matrix ofΩ that contains only the rows indexed byΛi, while r is thedimension of the subspace to whichxi belongs.

An algorithm, called analysis K-SVD, was proposed in [21]to address (10). The analysis K-SVD, denoted as AK-SVD,was developed in a similar way as the one adopted by itscounterpart K-SVD [8] for the synthesis model and as shownin [21], outperformed the one reported in [16]. It has beenobserved, however, that the AK-SVD is very time consumingand therefore more efficient algorithms are needed.

In order to restrict the solution set of the transform learningproblem (6) to an admissible set,W T W = IJ was appliedin [25], which means to force the row of the transformWto be a tight frame. However, only tight frame constraint isinsufficient to prevent undesired solutions of (6) for learningan over-complete transformW (J > N ), for which it is highlypossible that practical algorithms result in a local minimalsolution of the formW = [W T

1 0]T with W1 being a tightframe. In [24], the overcomplete transform learning (6) wasattacked with the following penalty function to be embeddedto the cost function:

ν(||W ||2F − log(W T W ) + η∑

n6=n′

|w†nw

†Tn′ |q

whereν andη are weighting factors andq is usually set to1and2. An iterative procedure using the alternating minimiza-tion strategy was proposed to solve the corresponding problem,where a gradient-based algorithm is utilized for updating thetransform. Clearly, such a method is time consuming and may

suffers from local minimum issue due to the use of gradientbased algorithm. Furthermore, the possibility of having samerows inW can be reduced with the second term of the penaltybut zero rows may occur ifW is not (row) normalized.

B. Problem formulation

Based on the previous discussions, the optimal sparsifyingtransform learning problem can be formulated as

W,S = argminW ,S ||S − WY ||2F

||sm||0 ≤ κ, ∀ m; W T W ≻ 0(11)

subject to||w†

k||2 = 1, ∀ k (12)

To implement (11), we consider the following cost function

L(W , S)= ||S − WY ||2F + λF(W ) (13)

whereF(W ) is a penalty term (i.e., a regularizer) defined as

F(W )= ν||W ||2F − log det(W T W ) (14)

Both λ andν are positive parameters to be discussed later.The problem to be addressed in this paper is finally formu-

lated as

W,S = argminW ,S L(W , S)

||sm||0 ≤ κ, ∀ m; ||w†k||2 = 1, ∀ k (15)

Comment 2.1:• The additional regularizerF(W ) in (13) serves as a

constraint on the space for searching the true transformW . The term||W ||2F is to ensure thatW is not too big,while − log det(W T W ) is to guarantee thatW is not toosmall and hence the constraint of full rank is met. Thecombination of the two terms ensures that the optimaldictionary is searched within a well behaved space suchthat the undesired scaling problem can be avoided;

• As to be seen in the next subsection, the minimum ofthis regularizer is a set of tight-frames. Therefore, if (4)has a solutionW close to a tight frame, it can be wellestimated by (15);

• Comparing (15) with (8), one realizes that (15)can be used for optimal over-complete sparsifyingtransform learning. Besides, instead of− log det(W ),− log det(W T W ) is adopted in our approach. By doingso, the constraintdet(W ) > 0 is no longer needed.

C. Analysis dictionaries and transforms

Consider the analysis-model-based dictionary learning (4).

DefineE= Y −X andE

= ΩY−S as therepresentation errorand thesparsification error, respectively, whereΩX = S.Then

ΩY − S = Ω(Y − X) (16)

It is assumed in the sequel that the true dictionaryΩ is fullrank. Such an assumption ensures that the row vectors ofΩcan span the entire signal spaceℜN×1.


The following lemma specifies a class of true dictionariesΩ for which the equivalence between (11) and (4) holds.

Lemma 1. (11) is equivalent to (4) if the true dictionaryΩ isa c-tight frame (TF) (c > 0 ), that isΩ has an SVD of form

Ω = cU[

IN 0]TV T (17)

where bothU and V are orthonormal matrices of properdimension.

Proof: With (16) and (17), we have

||S − ΩY ||2F = ||Ω(Y − X)||2F= c2||U

[

IN 0]TV T (Y − X)||2F

= c2||[

IN 0]TV T (Y − X)||2F

= c2||V T (Y − X)||2F = c2||Y − X||2FThen, Lemma 1 follows immediately from the fact that thetwo constraints||sm||0 ≤ κ and ||Ωxm||0 ≤ κ are equivalentas S = ΩX.

Comment 2.2: As seen from Lemma 1, the equivalencebetween the two problems holds if the true dictionaryΩis a c-tight frame. This is in general not true. Howev-er, it is observed that for a givenY the problem definedby (4) has in general more than one solution(Ω, X). In

fact, defineDα= diag(α1, · · · , αm, · · · , αJ ) and Dβ

=

diag(β1, · · · , βm, · · · , βL). It can be seen that if(Ω0, X0) isa solution, so is(Ω, X):

Ω = DαΩ0T, X = T−1X0Dβ

whereDα is any given non-singular diagonal matrix, whileboth T ∈ ℜN×N andDβ are arbitrary exceptdet(Dβ) 6= 0such that||Y−X ||F = ||Y−X0||F . This means that there exista lot of degrees of freedom in the solution set of (4) and thatby forcing to find the solution for which the dictionaryΩ is theclosest to a TF, (11) is then expected to yield a transformWthat is a good estimate of a solution for the analysis dictionarylearning problem (4).

D. Rule of thumb for choosingλ and ν

Now, we present a result regarding minimization ofF(W ).A similar result was given in [22].

Lemma 2. Let F(W ) be defined in (14). Then,

F(W ) ≥ N [1 + log(ν)]= η0 (18)

with the equality achieved if and only ifW is a√ν−1-tight

frame.

Proof: Let W = U[

ΣN 0]TV T be the SVD ofΩ, where

ΣN =diag(σ1, · · · , σk, · · · , σN ) with σk ≥ σk+1 > 0, ∀ k.

Clearly,F(Ω) =∑N

k=1[νσ2k − log(σ2

k)]=

∑Nk=1 f(σk).

Note thatdf(σk)dσk

= ν 2σk− 2σk

σ2

k

andσk > 0. It follows fromdf(σk)dσk

= 0 that the solution for minimizingf(σk) is given by

σ∗k =

√ν−1 ⇒ f(σ∗

k) = 1 + log(ν) ≤ f(σk), ∀ k

and hence (18) holds. Clearly, the equality is attainedif andonly if σ∗

k =√ν−1, ∀ k.

Comment 2.3:

• Based on Lemma 2, the cost functionL(W , S) = ||S −WY ||2F + λ F(W ) is bounded from below withλ η0;

• It follows from Lemmas 1 and 2 that if the referenceanalysis operator happens to be a tight frame, then thesolution of (15) should yield the same operator no matterwhatλ is taken;

• As indicated in Comment 2.2, the solutionW can beexpected to be close to a tight frame. This suggests thatλ should be taken much larger than one since whenλ is chosen very large, minimizingL(W , S) can beapproximated with minimizingF(W ).

• As to be seen from (15), we need to normalize thedictionary W such that ||w†

j ||2 = 1, ∀j. Under thisconstraint, the constantc in (17) turns out to be

√

J/N .SinceJ ≥ N holds for over-complete analysis dictionary,ν < 1 had better to be ensured.

Before turning to next section, we should point out thatthough the proposed approach is not directly related to theovercomplete analysis dictionary learning (4) as (10) is ingeneral, it can provide a good estimate of the true dictionaryin a very efficient way, which is believed to be useful inimproving the analysis KSVD algorithm.

III. T HE PROPOSED ALGORITHM

This section is devoted to attacking the optimal sparsifyingtransform learning problem by deriving an algorithm to solve(15).

A. The algorithm

Note that the problem defined in (15) is non-convex. Apractical approach to such problems is to adopt the alternatingminimization strategy [10] - [12]. The basic idea is to updatethe dictionaryW and the (column) sparse matrixS alterna-tively. The proposed algorithm is outlined below:

AlgProposed

Initialization: Set the training data matrixY , the number ofiterationsNite, and the parametersλ, ν; InitiateW with W0. 3

Begin: k = 1, 2, · · · , Nite

3There are several ways to initializeW . A simple way is to setW =randn(J,N). An alternative way is for themth row (m = 1, 2, . . . , J) wefirst randomly chooseN − 1 columns from the training data matrixY toform a submatrixY , then setw†

m = e − eY Y ‡, wheree = randn(1, N)and Y ‡ denotes the pseudoinverse ofY . At last we normalize each rows tounit l2 norm. In the simulations, we choose the second approach to initializeW in order to have a fair comparison with [21].


• Step I: UpdateS such thatL(Wk−1, S) is minimizedunder the constraint||sm||0 ≤ κ, ∀ m. This is equivalentto

Sk= arg minS ||S −Wk−1Y ||2F

||sm||0 ≤ κ, ∀ m(19)

• Step II: With Sk just obtained, updateW by solving

Wk= argmin

WL(W , Sk) (20)

• Step III: NormalizeWk such that all the rows ofWk =TscWk are of unit length. This can be achieved by settingthe diagonalTsc with its (j, j)th entry equal to the inverseof the length for thejth row of Wk for j = 1, 2, · · · , J .

End: OutputW =WNite- the estimate of the true dictionary

W .

Now, let us elaborate the three steps involved in the pro-posed algorithm further.

The first one involves in updatingS with (19), whosesolution Sk, as mentioned before, can be achieved directly

with the productV=Wk−1Y in the following way:

Begin: m = 1, 2, · · · , L• Determinem which is equal to the absolute value of

the κth largest coefficients (magnitude-wise) in themthcolumn ofV ;

• for j = 1, 2, · · · , J , set the(j,m)th element ofSk to

vjm, |vjm| ≥ m0, |vjm| < m

(21)

wherevjm denotes the(j,m)th element of matrixV .End: OutputSk.4

Now, let us consider (20) involved inStep II. Define

M= SY T , Z

= λνIN + Y Y T (22)

It can be shown that the cost functionL(W , S) can berewritten as

L(W , S) = ||S||2F − 2tr[MW T ] + tr[WZW T ]

−λ log det(W T W )= L1(W )

where tr[.] denotes the trace operator of a matrix andL1(W ) isused to emphasize the fact thatS is fixed in the minimizationto be considered.

Therefore, (20) is just a special case of the followingminimization problem

W ∗ = argmin

WL1(W ) (23)

Theorem 1. Let M ∈ ℜJ×N , Z ∈ ℜN×N and S bethree matrices independent of matrixW ∈ ℜJ×N withJ ≥ N , and Z = UzΛ

2UTz be an SVD of the (symmetric)

positive-definiteZ and Zsqrt= UzΛU

Tz . Furthermore, let

4In case that there are more thanκ no-zero elements inSk, just keep anyset of the firstκ largest ones and set the others to zero.

M= MZ−1

sqrt = U0

[

Π2

0

]

V T0 be an SVD ofM , where

Π = diag(π1, · · · , πk, · · · , πN ). A solution to (23) is thengiven by

W ∗ = U0

[

Σ2

0

]

V T0 Z−1

sqrt (24)

whereΣ2 = diag(σ21 , · · · , σ2

k, · · · , σ2N ) with

σ2k =

π2k +

√

π4k + 4λ

2, ∀ k (25)

The detailed proof can be found inAppendix A.As seen from (22), no matter whatY is taken the

correspondingZ matrix in our problem is always (symmetric)positive-definite as long asλ > 0 andν > 0. Therefore, (20)in Step II of the proposed algorithm can be solved usingTheorem 1.

Step III is to normalize the transform such thatWk has allits rows of unit length. This is for avoiding zero rows in thetransformW and thescaling ambiguity[22]. Such a scalinghas been popularly used in sparsifying dictionary design.

Comment 3.1: As noted, the proposed algorithm shares thesame framework as the one used in [22]. It should be pointedout that besides that our algorithm is more general in thesense that it can deal with analysis dictionaries (transforms)of any dimension rather those square non-singular ones, themain difference between the two lies in the way how to solve(20) in Step II. In [22], a gradient-based algorithm was usedto attack (20). Such an algorithm suffers in general fromslow convergence and more seriously, convergence to a localminimum of the problem. Therefore, our proposed algorithmis more efficient and a faster convergence is then expected.

B. Convergence

Consider the situation when the row normalization (i.e.,Step III) is not executed in the proposedAlgProposed. Asthe solutions for the minimizations involved in bothStepsI and II are exact, the cost functionL(W , S) is ensured tobe decreasing. Given that the cost function is bounded frombelow, the proposedAlgProposed without Step III can ensurethat the cost functionL(W , S) converges. As mentionedbefore, the obtained transformW may have zero rows duringthe iterative procedure. In order to avoid this phenomenon,Step IIIhas to be applied when the transform is over-completeand in that case, the cost function is found empirically toconverge as well.

As argued in [22], the convergence of cost function doesnot indicate that of the iterates. There have been no theoreticalresults achieved regarding convergence of the iterates fortheproposed algorithm. However, extensive experiments showedthat the proposed algorithm does demonstrate good behaviorof convergence.


C. Computation efficiency

A thorough comparison between the algorithm proposed in[22], denoted asAlgRB, and some relevant algorithms in termsof computational cost can be found in [22]. Since our proposedalgorithm shares the same framework withAlgRB and isderived as an alternative ofAlgAKSVD - the one recentlyreported in [21], we mainly focus on comparing our proposedAlgProposed andAlgAKSVD. For the sake of completeness, wepresent the computation issues of our proposedAlgProposed

andAlgAKSVD as follows.Our algorithm is an iterative method. In each iteration, there

are three steps. They are sparse coding, dictionary updatingand the dictionary normalizing. In the sparse coding step,we simply keep theκ largest entries of each column inthe matrixWY by full sorting. This, as indicated in [27],requiresO(JL logN) operations. Knowing that computingΩY requiresO(JNL) operations, one concludes that thesparse coding step needs roughlyO(JL(N + logN)) flops.In the dictionary update step, what we mainly do is justcomputing the SVD form of a matrixM . Thus, this step costsapproximatelyO(NJ2+N3) flops. While in the normalizationstep, it needsO(JN) flops. Thus the total cost per iterationof the proposed algorithm is the sum of the three, which isdominated byO(JNL) under the assumption ofN ≤ J ≪ L.

As to AlgAKSVD, there are two steps involved in eachiteration: sparse coding and dictionary updating, and the com-putation cost is dominated by the sparse coding step. As givenin [21], the sparse coding step requiresO(LN3J) operations ifthe optimized backward greedy (OBG) is utilized. Therefore,the overall computation complexity scales asO(LN3J).

Based on the computations above, one can see that ourproposed algorithm is much more efficient thanAlgAKSVD.This is confirmed by experiments to be given in the nextsection.

D. Signal denoising

In AlgAKSVD, for a measurementyi and a (row normal-ized) dictionaryΩ given the clean signalxi is estimated bysolving the following problem

xi, Λi= argmin

x,Λ‖x− yi‖2

ΩΛx = 0, Rank(ΩΛ) = N − r

whereΛ ⊂ 1, 2, . . . , L is the set of row indices andr is thedimension of the subspace the signal belongs to. The OBG isused to attack such a problem and it has to be runL timesfor each iteration in order to obtain the clear signal matrixXand the final estimate of the true dictionaryΩ. That is whyAlgAKSVD is very time consuming.

In AlgProposed, instead ofX the (column) sparse matrixS is used. As mentioned before, the latter can be updatedefficiently using (21). Taking the transform obtained usingAlgProposed as the analysis dictionary, we can run the OBGto get the corresponding estimatexi of xi for i = 1, 2, · · · , L.

Alternatively, following the same approach as in [22] we

can consider the following

W,X, S = argminW ,X,S J (W , X, S)

||sm||0 ≤ κ, ∀ m (26)

where the objective functionJ (W , X, S) is defined as

J (W , X, S)= ||W X− S||2F +λF(W )+µ||Y − X||2F (27)

with F(W ) the same as defined before andµ > 0 constantdetermining the importance of the representation error.

This is a simultaneous optimization of the dictionaryW , theclean signalsX , and the sparse coefficientsS. It is easy tounderstand that this problem can be addressed using alternat-ing minimization and a similar algorithm toAlgProposed canbe derived. In fact, such an algorithm can be obtained fromAlgProposed with adding: i)X0 = Y to the Initialization, andii) a new step betweenStep I and Step II for updating theclean signalsX :

Xk= argmin

XJ (Wk−1, X, Sk)

The above is equivalent to minimizing||Wk−1X − Sk||2F +µ||Y − X||2F with respect toX - a standard least squaresproblem.

In this paper, we focus on the the OBG based denosingapproach as comparisons are given between the proposedAlgProposed andAlgAKSVD.

IV. EXPERIMENTAL RESULTS

In this section, we present some experiments to illustratethe performance of our proposed algorithmAlgProposed andcompare it with that byAlgAKSVD given in [21]. For conve-nience, the dictionaries, obtained using these two algorithms,are denoted withΩProposed and ΩAKSVD, respectively. Allthe experiments are performed on a computer with an IntelCore i7 CPU at 3.40 GHz, 16 GB memory and 64-bit Windows7 operating system.

A. Synthetic experiments

In order to have a fair comparison, the setup used here issimilar to [21]. Let J = 50, N = 25 and L = 50, 000. AJ×N matrixW is generated with normally distributed entriesand thenΩ = DαW , whereDα is such a diagonal matrixthat all the rows ofΩ have unit l2-norm. ThisΩ serves asthe true over-complete analysis dictionary. Then, a set ofLvectorsxi of N × 1 is produced with each living in a4-dimensional subspace. Each vectorxi is obtained as follows:we first construct a sub-matrixΩ(i) ∈ ℜ21×N by randomlychoosing21 rows fromΩ, then compute the orthonormal basisψ(i)

k for the null space ofΩ(i), which can be obtained from

the SVD of Ω(i). With Ψ(i) =

[

ψ(i)1 ψ

(i)2 ψ

(i)3 ψ

(i)4

]

,

calculatexi = Ψ(i)si, where each element insi is ran-domly generated by Gaussian distribution of independent andidentically distributed (i.i.d.) zero-mean and unit variance. Atlast, each vectorxi is normalized so that||xi||2 = 1 for alli = 1, 2, · · · , L. We can see thatΩxi has at least21 zero


components for allxi (i = 1, 2, · · · , L) generated in such away.

Let Iκ : ℜJ → ℜJ denote a hard thresholding operator thatkeeps the largest (in magnitude)κ elements of a vector. Wedefine the effective sparsityκ of a length-J vectors as

κ = min κ s.t. ||Iκ(s)||2 ≥ 0.99||s||2. (28)

Fig. 1 represents the histogram of the effective sparsitiesofΩxi(i = 1, 2, . . . , L). As can be seen, the effective sparsitiesare varying from17 to 25, and are most concentrated near21.

5 10 15 20 25 30 35 40 45 500

2000

4000

6000

8000

10000

κ

Num

ber o

f sig

nals

Fig. 1. Histogram of the effective sparsities ofΩxi(i = 1, 2, . . . , L) forJ = 50, N = 25, L = 50000.

The measurementsyi are generated with

yi = xi + ǫi, ∀ iwhere each entry of the additive noise vectorǫi is a whiteGaussian noise withi.i.d. zero-mean andσ2-variance. Definethe signal-to-noise ratio (SNR) asρsnr

= 10 log10

E[||xi||22]E[||ǫi||22]

,where E[·] denotes the mathematical expectation operator.When σ = 0.2√

N= 0.04, ρsnr = 10 log10

1N∗σ2 =

10 log10(25) ≈ 14dB.We then estimate the analysis dictionaryΩ using

AlgAKSVD and AlgProposed with X = Y (see (9)),respectively, for both noise-free case (i.e.,σ = 0) andnoisy case with noise levelρsnr = 14dB. In both cases,we fix κ = 21 andNite = 200, while ν = 0.01, λ = 300for AlgProposed.5 Both AlgProposed and AlgAKSVD areinitialized with a dictioanry Ω0 in which each row isgenerated as follows: we first randomly chooseN − 1columns from the training data matrixY to form a submatrixY ; then setω†

m = e − eY Y ‡, wheree = randn(1, N) andY ‡, as defined before, denotes the pseudoinverse ofY ; at lastwe normalize each row to unitl2 norm [21].

Convergence behavior ofAlgProposed

The convergence behavior of our proposedAlgProposed

is shown in Fig.s 2 - 3 in terms of the objective functionL(Ωk, Sk) and the relative change of iterates which is definedas‖Ωk − Ωk−1‖F/‖Ωk−1‖F .

5As we said before,Ωxi has at least21 zero components for allxi.Thus, we can chooseκ such thatκ ≤ J − 21 = 29. It is illustrated in Fig. 1that the effective sparsities are concentrated near21, and therefore we chooseκ = 21.

0 50 100 150 200−3000

−2000

−1000

0

1000

2000

3000

4000

5000

6000

7000

Iteration

Obj

ectiv

e Fu

nctio

n

Noise − freeNoisy − ρ

snr = 14

Fig. 2. Evolution of the objective functionL(Ωk, Sk) for J = 50, N = 25,L = 50000.

0 50 100 150 2000

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

Iteration

Rela

tive

Chan

ge

Noise − freeNoisy − ρ

snr = 14

Fig. 3. Evolution of relative change of iterates‖Ωk −Ωk−1‖F /‖Ωk−1‖Ffor J = 50, N = 25, L = 50000.

Comment 4.1:

• The results are self-explanatory. As expected, the objec-tive functionL(Ωk, Sk) decreases smoothly in both noise-free case and noisy case. This is confirmed in Fig. 2,from which it is seen that the latter case yields a largerL(Ωk, Sk) than the noise-free case because of the largersparsification error‖Sk − ΩkY ‖2F in the latter case;

• The convergence behavior of the iteratesΩk is depictedin Fig. 3. One observes that the iterates of the proposedalgorithm show a very good trend to convergence in bothcases.

Fig.s 4 and 5 show the distribution of singular values of thetwo analysis dictionariesΩAKSVD andΩProposed.

As observed, the ratio between the maximal and the minimalsingular values ofΩProposed is almost the same as that ofΩ, smaller than the one ofΩAKSVD. This implies that ourproposed algorithm, as expected, yields a dictionary closer toa TF thanΩAKSVD owing to the penalty termF(Ω).

Performance evaluation of the obtained dictionariesThe averaged representation error is defined as‖Y −

Xk‖F /√LN for the dictionaryΩk, whereXk is the estimate

of the clean signals which are available forAlgAKSVD andcan be obtained using the OBG forAlgProposed.

The performance of a given dictionary for denoising isusually measured with the averaged denoising error definedas‖Xk −X‖F/

√LN , whereX is the true (clean) signals.


0 5 10 15 20 250.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

Index

Sing

ular

val

ues

Noise−free: Reference ΩNoise−free: Ω

AKSVD

Noise−free: Ω Proposed

Fig. 4. Distribution of singular values of reference dictionaryΩ, ΩAKSV D

andΩProposed for the noise free case.

0 5 10 15 20 25

0.5

1

1.5

2

2.5

Index

Sing

ular

val

ues

Noisy − ρsnr

= 14: Reference Ω

Noisy − ρsnr

= 14: Ω AKSVD

Noisy − ρsnr

= 14: Ω Proposed

Fig. 5. Distribution of singular values of reference dictionaryΩ, ΩAKSV D

andΩProposed for the noisy case.

Fig.s 6 and 7 show comparison of the two algorithms interms of the averaged representation error and the averageddenoising error, respectively.6

0 50 100 150 2000

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Iteration

Aver

aged

Rep

rese

ntat

ion

Erro

r

Noise−free: AlgAKSVD

Noisy − ρsnr

= 14: Alg AKSVD

Noise−free: AlgProposed

Noisy − ρsnr

= 14: Alg Proposed

Fig. 6. Evolution of the averaged representation error‖Y −Xk‖F /√NL

verses iterations withJ = 50, N = 25, L = 50000.

Now, let us look at how closeΩProposed andΩAKSVD areto the reference dictionaryΩ.

We say a rowω†m in the true analysis dictionaryΩ is

recovered inΩ if

minn

(1− |ω†nω

†Tm |) < 0.01 (29)

6In Fig. 7, the plots for the noise-free case are not presentedas they arethe same as those shown in Fig. 6 due to the factY = X.

0 50 100 150 2000.04

0.045

0.05

0.055

0.06

0.065

0.07

0.075

0.08

Iteration

Aver

aged

Den

osin

g Er

ror

Noisy − ρsnr

= 14: Alg AKSVD

Noisy − ρsnr

= 14: Alg Proposed

Fig. 7. Evolution of the average denoising error‖Xk −X‖F /√LN verses

iterations withJ = 50, N = 25, L = 50000.

Denote Jk as the number of recovered rows inΩk. Thepercentage of the recovered rows, denoted asρprr(k) in Ωk

is calculated withρprr(k)= Jk/J . This is one of the

performance indicators used in [21] for evaluating a learntdictionary.

With signals generated above and the same setup as the oneused in [21] exceptNite = 200 instead of 100, we run eachof AlgProposed andAlgAKSVD and obtain a sequence ofNite

analysis dictionariesΩk. Fig. 8 shows the evolution of thepercentage of the recovered rows for the two algorithms.

0 50 100 150 2000

10

20

30

40

50

60

70

80

90

100

Iteration

Perc

enta

ge o

f Rec

over

ed R

ows

Noise−free: AlgAKSVD

Noisy − ρsnr

= 14: Alg AKSVD

Noise−free: AlgProposed

Noisy − ρsnr

= 14: Alg Proposed

Fig. 8. Evolution of the percentage of the recovered rowsρprr(k) versesiterations withJ = 50, N = 25, L = 50000.

0 50 100 150 2000.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Iteration

Aver

aged

Rep

rese

ntat

ion

Erro

r

Noise−free: Alg AKSVD

Noisy − ρsnr

= 14: Alg AKSVD

Noise−free: Alg Proposed

Noisy − ρsnr

= 14: Alg Proposed

Fig. 9. Evolution of the averaged representation error‖Y −Xk‖F /√NL

verses iterations withJ = 40, N = 20, L = 10000.


0 50 100 150 2000.04

0.045

0.05

0.055

0.06

0.065

0.07

0.075

0.08

0.085

0.09

Iteration

Aver

aged

Den

oisin

g Er

ror

Noisy − ρsnr

= 14: Alg AKSVD

Noisy − ρsnr

= 14: Alg Proposed

Fig. 10. Evolution of the average denoising error‖Xk−X‖F /√LN verses

iterations withJ = 40, N = 20, L = 10000.

0 50 100 150 2000

10

20

30

40

50

60

70

80

90

100

Iteration

Perc

ent R

ecov

ered

Row

s

Noise−free: Alg AKSVD

Noisy − ρsnr

= 14: Alg AKSVD

Noise−free: Alg Proposed

Noisy − ρsnr

= 14: Alg Proposed

Fig. 11. Evolution of the percentage of the recovered rowsρprr(k) versesiterations withJ = 40, N = 20, L = 10000.

Comment 4.2:

• It is seen that our proposed algorithm surprisingly out-performsAlgAKSVD in terms of averaged representationerror, averaged denoising error, and percentage of recov-ered rows for both cases;

• According to Eqn. (6) of [21], the averaged denoisingerror in the oracle setup (that is the true dictionary and allthe co-supports are known) is given by

√

rN σ, which is

equal to0.4σ for this example. As observed from Fig. 7,the averaged denoising error for both algorithms seemsto converge to a noise variance close toσ = 0.04. Thisindicates these two algorithms just yield one of the localminimal points of the problems though very close to thereference one.

Figs. 9-11, as another example, display the results for thecase ofN = 20, J = 40, L = 10000 and the vectorxi livingin a3-dimensional subspace, leading toκ = N−3 = 17. Here,we setλ = 200 andν = 0.01 for our proposed algorithm. Theresults are self-explanatory.

What is the most significant improvement of our proposedalgorithm overAlgAKSVD is the implementation efficiency.This can be seen from Fig. 12 which yields the relationshipbetween the time (in seconds) used and the number of trainingsamplesL for the setting:N = 25, J = 50 andNite = 200.As an example, one notes that forL = 50000, AlgAKSVD

requires about105 seconds, while our proposed algorithmneeds less than3× 102 seconds.

0 1 2 3 4 5

x 104

100

101

102

103

104

105

106

L

Tim

e(se

cond

s)

Noise − free: Alg AKSVD

Noise − free: Alg Proposed

Fig. 12. Relationship between the time (in seconds) consumed and thenumber of the training samplesL.

B. Experiments on image denoising

Now, let us consider image denoising for five widely usedimages, ‘Lenna’, ‘Barbara’, ‘Boats’, ‘House’, ‘Peppers’.

For a given noisy image with noise levelσ ∈ 5, 10, 15, 20(which means each pixel is contaminated by a white Gaussiannoise with i.i.d. zero-mean andσ2-variance), we generate atraining set consisting ofL = 20, 000 overlapping8× 8 patchexamples. Then, we runAlgProposed andAlgAKSVD with λ =100, ν = 0.01, r = 7, κ = 30 andNite = 20 to learn theanalysis dictionaries of sizeJ ×N = 128× 64.

We apply the learnt dictionaries for patch-based imagedenoising. Each noised patch is denoised by using error-basedOBG for sparse coding, which is the same as that in [21]. Thedenoising results with respect to image PSNR in dB are givenin Table I.

TABLE IDENOISING PERFORMANCE(IMAGE PSNRIN DB) EVALUATED WITH FIVE

DIFFERENT PICTURES IN DIFFERENT NOISE LEVELσ FORAlgProposed

AND AlgAKSV D

σ Algorithm Lena Barbara Boats House Perpers256

5 AlgAKSV D 37.32 36.88 36.60 38.04 36.89AlgProposed 37.54 37.11 36.65 38.11 36.98

10AlgAKSV D 33.35 32.00 32.07 33.89 32.11AlgProposed 33.46 32.85 32.07 33.88 32.29

15 AlgAKSV D 31.05 29.22 29.62 31.35 29.53AlgProposed 31.30 29.73 29.91 31.60 29.94

20AlgAKSV D 29.45 27.30 27.92 29.58 27.79AlgProposed 29.49 28.11 28.54 29.64 28.18

It is observed that denoising performance byAlgProposed isin general better than that byAlgAKSVD in terms of PSNR.

Fig. 13 shows the visual effects of denosing results on theimage ‘Barbara’ with a noise level ofσ = 10.

V. CONCLUSIONS

In this paper, we have investigated the problem of over-complete transform learning for analysis model. An alternatingminimization based iterative procedure has been proposed tosolve the formulated optimal transform design problem. Aclosed-form solution has been derived for updating the trans-form. The superiority of the proposed approach in terms of theaveraged representation and denoising errors, the percentage


(a) (b)

(c) (d)Fig. 13. ’Barbara’: (a) The original; (b) Noisy image with noise σ = 10(PSNR = 28.13dB); (c) Denoised image usingΩAKSV D (PSNR = 32.00dB);(d) Denoised image usingΩProposed (PSNR = 32.85dB).

of the recovered rows, and more significantly, the computationefficiency has been demonstrated with a series of experimentsusing both synthetic and real data.

The analysis model and transform model make the dic-tionary and transform learning problems to be of multiplesolutions. As observed, both the AKSVD and our proposedalgorithm demonstrate a good convergence to a local mini-mum. Deeper studies along those lines [28], [29] are neededfor theoretical guarantees of convergence to a global minimumand developing high performance algorithms for sparsifyingdictionary and transform design. Furthermore, it is expectedthat the performance of the analysis KSVD algorithms in [21]can be improved with the proposed approach for updatingdictionary embedded. Investigations along these lines areon-going.

Appendix A - Proof of Theorem 1

Proof: According to the Laplace’s formula,

det(Q) =N∑

j=1

qij(−1)i+jMij , ∀ i ∈ 1, 2, · · · , N

whereMij is the (i, j)th minor of matrix Q, i.e., the de-terminant of the(N − 1) × (N − 1)-matrix that resultsfrom Q by removing theith row and thejth column, while(−1)i+jMij is known as the(i, j)th co-factorof Q. Clearly,∂ det(Q)

∂qkl

= (−1)k+lMkl and hence

∂ log det(Q)

∂qkl=

(−1)k+lMkl

det(Q), ∀ k, l

as long as long asdet(Q) > 0.

Therefore, the gradient off(Q)= log det(Q) with respect

to Q is given by

∇Qf(Q) =QT

det(Q)= Q−T

whereQ is theadjugatematrix ofQ with its (i, j)th elementgiven by the (j, i)th co-factor of Q and satisfiesQQ =det(Q)IN .

It can then be shown that forg(Ω)= f(Q) with Q = ΩT Ω

∇Ωg(Ω) = Ω[∇Qf(Q) + (∇Qf(Q))T ] = 2Ω(ΩT Ω)−1

and hence

dL1(Ω)

dΩ= −2M + 2ΩZ − 2λ Ω(ΩT Ω)−1

The optimalΩ should satisfiesdL1(Ω)

dΩ|Ω=Ω = 0, which leads

to ΩZΩT Ω−MΩT Ω− λ Ω = 0. Equivalently,

[WW T − MW T − λ IJ ]W = 0

whereM=MZ−1

sqrt andW= ΩZsqrt with Zsqrt the square

root matrix ofZ defined before.Let

W = Uw

[

Σ2

0

]

V Tw , M = U0

[

Π2

0

]

V T0

be an SVD ofW and M , respectively. The above equationbecomes

[

Σ40

0 0

]

− UTw U0

[

Π2

0

]

V T0 Vw

[

Σ20

]

− λ IJ[

Σ2

0

]

= 0

A solution (Uw, Vw,Σ2) to the above is

Vw = V0, Uw = U0, Σ2 = diag(σ21 , · · · , σ2

k, · · · , σ2N )

with σ2k > 0 constrained by

σ4k − π2

kσ2k − λ = 0, ∀ k

which yields (25) and hence (24) follows. This completes theproof.

Competing InterestsThe authors declare that they have no competing interests.

AcknowledgmentsThis work was supported by NSFC Grants 61273195,61304124, 61473262, and 61503339.

REFERENCES

[1] M. Elad, Sparse and Redundant Representations: From Theory toApplications in Signal and Image Processing, Springer New YorkDordrecht Heidelberg London, 2010.

[2] D. L. Donoho and M. Elad, “Optimally sparse representation in general(nonorthonormal) dictionaries vial1 minimization,” Proc. Nat. Acad.Sci., vol. 100, no. 5, pp. 2197 - 2202, Mar. 2003.

[3] I. Tosic and P. Frossard, “Dictionary learning: what is the rightrepresentation for my signal?”IEEE Signal Process. Mag., pp. 27 -38, Mar. 2011.


[4] D. L. Donoho, “Compressed sensing,”IEEE Trans. Inf. Theory, vol.52, no. 4, pp. 1289-1306, Sept. 2006.

[5] E. J. Candes, J. Romberg, and T. Tao, “Robust uncertainty principles:exact signal reconstruction from highly incomplete frequency informa-tion,” IEEE Trans. Inf. Theory, vol. 52, no. 2, pp. 489-509, Feb. 2006.

[6] E. J. Candes and M. B. Wakin, “An intoduction to compressivesampling,” IEEE Signal Process. Mag., vol. 25, no. 2, pp. 21 - 30,Mar. 2008.

[7] K. Engan, S. O. Aase, and J. H. Hakon-Housoy, “Method of optimaldirection for frame design,” inIEEE Int. Conf. Acoust., Speech, SignalProcess., vol. 5, pp. 2443-2446, 1999.

[8] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: an algorithm fordesigning overcomplete dictionaries for sparse representation,” IEEETrans. Signal Process., vol. 54, no. 11, pp. 4311-4322, 2006.

[9] L. Zelnik-Manor, K. Rosenblum, and Y. C. Eldar, “Dictionary optimiza-tion for block-sparse representations”,IEEE Trans. Signal Process.,vol. 60, pp. 2386-2395, 2012.

[10] J.A. Tropp, I. S. Dhillon, R. W. Heath Jr., and T. Strohmer, “Designingstructured tight frame via alternating projection,”IEEE Trans. Inf.Theory, vol. 51, no. 1, pp. 188-209, 2005.

[11] I. S. Dhillon, R. W. Heath Jr., T. Strohmer, and J.A. Tropp, “Construct-ing packings in Grassmannian manifolds via alternating projection,”Experimental Mathematics, vol. 17, no. 1, pp. 9 - 35, 2008.

[12] M. Yaghoobi, L. Daudet, and M. E. Davies, “Parametric dictionarydesign for sparse coding,”IEEE Trans. Signal Process., vol. 57, no.12, pp. 4800-4810, 2009.

[13] J. Tropp, “Greed is good: algorithmic results for sparse approximation,”IEEE Trans. Inf. Theory, vol. 50, no. 10, pp. 2231 - 2242, Oct. 2004.

[14] W. Dai and O. Milenkovic, “Subspace pursuit for compressive sensingsignal reconstruction,”IEEE Trans. Inf. Theory, vol. 55, no. 5, pp. 2230- 2249, 2009.

[15] J. A. Tropp and S. J. Wright, “Computational methods forsparsesolution of linear inverse problems,”Proc. IEEE, vol. 98, no. 6, pp.948-958, Jun. 2010.

[16] B. Ophir, M. Elad, N. Bertin, and M. D. Plumbley, “Sequential minimaleigenvalues - an approach to analysis dictionary learning,” in Proc.EUSIPCO, Barcelona, Spain, Sep. 2011.

[17] M. Yaghoobi, S. Nam, R. Gribonval, and M. E. Davies, ”Analysisoperator learning for overcomplete cosparse representations,” in Proc.EUSIPCO, Barcelona, Spain, Sep. 2011.

[18] M. Yaghoobi, S. Nam, R. Gribonval, and M. E. Davies, “Noise awareanalysis operator learning for approximately cosparse signals,” inProc.IEEE Int. Conf. Acoust., Speech, Signal Process., pp. 5409 - 5412,2012.

[19] M. Yaghoobi, S. Nam, R. Gribonval, and M. E. Davies, “Constrainedovercomplete analysis operator learning for cosparse signal modeling,”IEEE Trans. Signal Process., vol. 61, no. 9, pp. 2341 - 2354, May2013.

[20] R. Rubinstein, T. Faktor, and M. Elad, “K-SVD dictionary-learning forthe analysis sparse model,” inProc. IEEE Int. Conf. Acoust., Speech,Signal Process., pp. 5405 - 5408, 2012.

[21] R. Rubinstein, T. Peleg, and M. Elad, “Analysis K-SVD: adictionary-learning algorithm for the analysis sparse model,”IEEE Trans. SignalProcess., vol. 61, no. 3, pp. 661 - 677, Feb. 2013.

[22] S. Ravishankar and Y. Bresler, “Learning sparsifying transforms,”IEEETrans. Signal Process., vol. 61, no. 5, pp. 1072 - 1086, Mar. 2013.

[23] S. Ravishankar and Y. Bresler, “Closed-form solutionswithin sparsi-fying transform learning,” inProc. IEEE Int. Conf. Acoust., Speech,Signal Process., pp. 5378 - 5382, 2013.

[24] S. Ravishankar and Y. Bresler, “Learning overcompletesparsifyingtransforms for signal processing,” inProc. IEEE Int. Conf. Acoust.,Speech, Signal Process., pp. 3088 - 3092, 2013.

[25] J. Cai, H. Ji, Z. Shen, and G. Ye, “Data-driven tight frame constructionand image denoising,”Appl. Comput. Harmon. Anal., vol. 37, no. 1,pp. 89-105, 2014.

[26] L. Shen, M. Papadakis, I. Kakadiaris, I. Konstantini, D. Kouri, and D.Hoffman, “Image denoising using a tight frame,”IEEE Trans. ImageProcess., vol. 15, no. 5, pp. 1254-1263, 2006.

[27] J.A. Fill and S. Janson, “Quicksort asymptotics,” inJ. Algorithms, vol.44, no. 1, pp. 4 - 28, 2002.

[28] K. Schnass, “On the identifiability of overcomplete dictionaries via theminimisation principle underlying K-SVD”,Appl. Comput. Harmon.Anal., vol. 37, pp. 464-491, 2014.

[29] R. Gribonval and K. Schnass, “Dictionary identification-sparse matrix-factorization vial1-minimization,” IEEE Trans. Inf. Theory, vol. 56,no. 7, pp. 3523 - 3539, 2010.

MATHEMATICAL PROBLEMS IN ENGINEERING - ALGORITHMS …zzhu/pub/hzly-OSTL.pdf · MATHEMATICAL PROBLEMS IN ENGINEERING - ALGORITHMS FOR COMPRESSIVE SENSING SIGNAL RECONSTRUCTION WITH

Documents