This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
IEEE
Proo
f
IEEE TRANSACTIONS ON INFORMATION THEORY 1
A Unified Formulation of Gaussian VersusSparse Stochastic Processes—Part I:
Continuous-Domain TheoryMichael Unser, Fellow, IEEE, Pouya D. Tafti, and Qiyu Sun
Abstract— We introduce a general distributional framework1
that results in a unifying description and characterization of a2
rich variety of continuous-time stochastic processes. The corner-3
stone of our approach is an innovation model that is driven by4
some generalized white noise process, which may be Gaussian or5
not (e.g., Laplace, impulsive Poisson, or alpha stable). This allows6
for a conceptual decoupling between the correlation properties7
of the process, which are imposed by the whitening operator8
L, and its sparsity pattern, which is determined by the type of9
noise excitation. The latter is fully specified by a Lévy measure.10
We show that the range of admissible innovation behavior11
varies between the purely Gaussian and super-sparse extremes.12
We prove that the corresponding generalized stochastic processes13
are well-defined mathematically provided that the (adjoint)14
inverse of the whitening operator satisfies some L p bound for15
p ≥ 1. We present a novel operator-based method that yields16
an explicit characterization of all Lévy-driven processes that are17
solutions of constant-coefficient stochastic differential equations.18
When the underlying system is stable, we recover the family of19
stationary CARMA processes, including the Gaussian ones. TheAQ:1 20
approach remains valid when the system is unstable and leads21
to the identification of potentially useful generalizations of the22
Lévy processes, which are sparse and non-stationary. Finally, we23
show that these processes admit a sparse representation in some24
matched wavelet domain and provide a full characterization of25
their transform-domain statistics.26
Index Terms— XXXXX.AQ:2 27
I. INTRODUCTION28
IN RECENT years, the research focus in signal process-29
ing has shifted away from the classical linear paradigm,30
which is intimately linked with the theory of stationary31
Gaussian processes [1], [2]. Instead of considering Fourier32
transforms and performing quadratic optimization, researchers33
are presently favoring wavelet-like representations and have34
adopted sparsity as design paradigm [3]–[8]. The property that35
a signal admits a sparse expansion can be exploited elegantly36
Manuscript received September 21, 2012; revised October 7, 2013; acceptedDecember 13, 2013. This work was supported in part by the Swiss NationalScience Foundation under Grant 200020-144355, in part by the EuropeanCommission under Grant ERC-2010-AdG-267439-FUNSP, and in part by theNational Science Foundation under Grant DMS 1109063.
M. Unser and P. D. Tafti are with the Biomedical Imaging Group,École Polytechnique Fédérale de Lausanne, Lausanne CH-1015, Switzerland(e-mail: [email protected]; [email protected]).
Q. Sun is with the Department of Mathematics, University of CentralFlorida, Orlando, FL 32816 USA (e-mail: [email protected]).
Communicated by V. Borkar, Associate Editor for CommunicationNetworks.
Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TIT.2014.2298453
for compressive sensing, which is presently a very active area 37
of research (cf. special issue of the Proceedings of the IEEE 38
[9], [10]). The concept is equally helpful for solving inverse 39
problems and has resulted in significant algorithmic advances 40
for the efficient resolution of large scale �1-norm minimization 41
problems [11]–[13]. 42
The current formulations of compressed sensing and sparse 43
signal recovery are fundamentally deterministic. Also, they are 44
predominantly discrete and based on finite-dimensional mathe- 45
matics, with the notable exception of the works of Eldar [14], 46
Adcock and Hansen [15]. By drawing on the analogy with 47
the classical theory of signal processing, it is likely that 48
further progress may be achieved by adopting a statistical 49
(or estimation theoretic) point of view for the description 50
of sparse signals in the analog domain. This stands as our 51
primary motivation for the investigation of the present class 52
of continuous-time stochastic processes, the greater part of 53
which is sparse by construction. These processes are specified 54
as a superset of the Gaussian ones, which is essential for 55
maintaining backward compatibility with traditional statistical 56
signal processing. 57
The present construction is a generalization of a classical 58
idea in communication theory and signal processing which is 59
to view a stochastic process as filtered version of a white noise 60
(a.k.a. innovation) [16]. The fundamental aspect here is that 61
the modeling is done in the continuous domain, which, as we 62
shall see, imposes strong constraints on the class of admissible 63
innovations; that is, the generalized white noise that constitutes 64
the input of the innovation model. The second ingredient is a 65
powerful operational calculus (the generalization of the idea 66
of filtering) for solving stochastic differential equations (SDE), 67
including unstable ones, which is essential for inducing inter- 68
esting (non-stationary) behaviors such as self-similarity. The 69
combination of these ingredients results in the specification 70
of an extended class of stochastic processes that are either 71
Gaussian or sparse, at the exclusion of any other type. The 72
proposed theory has a unifying character in that it connects a 73
number of contemporary topics in signal processing, statistics 74
and approximation theory: 75
sparsity (in relation to compressed sensing) [3], [4] 76
signals with a finite rate of innovation [17], [18] 77
the classical theory of Gaussian stationary processes [1], 78
[16] 79
non-Gaussian continuous-domain modeling of signals 80
splines, wavelets and linear system theory [5], [23].83
Most importantly, it explains why certain classes of processes84
admit a sparse representation in a matched wavelet-like basis85
(see introductory example in Section II where the Haar trans-86
form outperfoms the classical Karhunen-Loève transform).87
Since these models are the natural functional extension of the88
Gaussian stationary processes, they may stimulate the develop-89
ment of novel algorithms for statistical signal processing. This90
has already been demonstrated in the context of biomedical91
image reconstruction [24], the derivation of statistical priors92
for discrete-domain signal representation [25], optimal signal93
denoising [26], and MMSE interpolation [27].94
Because the proposed model is intrinsically linear, we have95
adopted a formulation that relies on generalized functions,96
rather than the traditional mathematical concepts (random97
measures and Ito integrals) from the theory of stochastic98
differential equations [21], [22], [28]. We are then taking99
advantage of the theory of generalized stochastic processes100
of Gelfand (arguably, the second most famous Soviet math-101
ematician after Kolmogorov) and some powerful tools of102
functional analysis (Minlos-Bochner’s theorem) [29] that are103
not widely known to engineers nor statisticians. While this104
may look like an unnecessary abstraction at first sight, it is105
very much in line with the intuition of an engineer who prefers106
to work with analog filters and convolution operators rather107
than with stochastic integrals. We are then able to use the108
whole machinery of linear system theory and the power of the109
characteristic functional to derive the statistics of the signal in110
any (linearly) transformed domain.111
The paper is organized as follows. The basic flavor112
of the innovation model is conveyed in Section II by113
focusing on a first-order differential system which results in114
the generation of Gaussian and non-Gaussian AR(1) stochastic115
processes. We use of this model to illustrate that a properly-116
matched wavelet transform can outperform the classical117
Karhunen-Loève transform (or the DCT) for the compres-118
sion of (non-Gaussian) signals. In Section III, we review119
the foundations of Gelfand’s theory of generalized stochas-120
tic processes. In particular, we characterize the complete121
class of admissible continuous-time white noise processes122
(innovations) and give some argumentation as to why the123
non-Gaussian brands are inherently sparse. In Section IV,124
we give a high-level description of the general innova-125
tion model and provide a novel operator-based method126
for the solution of SDE. In Section V, we make use of127
Gelfand’s formalism to fully characterize our extended class of128
(non-Gaussian) stochastic processes including the special129
cases of CARMA and N th-order generalized Lévy processes.130
We also derive the statistics of the wavelet-domain representa-131
tion of these signals, which allows for a common (stationary)132
treatment of the two latter classes of processes, irrespective133
of any stability consideration. Finally, in Section VI, we134
turn back to our introductory example by moving into the135
unstable regime (single pole at the origin) which yields a136
non-conventional system-theoretic interpretation of classical137
Lévy processes[28], [30], [31]. We also point out the structural138
similarity between the increments of Lévy processes and 139
their Haar wavelet coefficients. For higher-order illustrations 140
of sparse processes, we refer to our companion paper [32], 141
which is specifically devoted to the study of the discrete-time 142
implication of the theory and the way to best decouple (e.g. 143
“sparsify”) such processes. The notation, which is common to 144
both papers, is summarized in [32, Table II]. 145
II. MOTIVATION: GAUSSIAN VS. NON-GAUSSIAN 146
AR(1) PROCESSES 147
A continuous-time Gaussian AR(1) (or Gauss-Markov) 148
process can be formally generated by applying a first-order 149
analog filter to a Gaussian white noise process w: 150
sα(t) = (ρα ∗ w)(t) (1) 151
where ρα(t) = �+(t)eαt with Re(α) < 0 and �+(t) is the 152
unit-step function. Next, we observe that ρα = (D −αId)−1δ 153
where δ is the Dirac impulse and where D = ddt and Id are the 154
derivative and identity operators, respectively. These operators 155
as well as the inverse are to be interpreted in the distributional 156
sense (see Section III-A). This suggests that sα satisfies the 157
“innovation” model (cf.[1], [16]) 158
(D − αId)sα(t) = w(t), (2) 159
or, equivalently, the stochastic differential equation (cf.[22]) 160
dsα(t)− αsα(t)dt = dW (t), 161162
where W (t) = ∫ t0 w(τ)dτ is a standard Brownian motion 163
(or Wiener process) excitation. In the statistical literature, 164
the solution of the above first-order SDE is often called the 165
Ornstein-Uhlenbeck process. 166
Let (sα[k] = sα(t)|k=t )k∈Z denote the sampled version of 167
the continuous-time process. Then, one can show that sα[·] is 168
a discrete AR(1) autoregressive process that can be whitened 169
by applying the first-order linear predictor: 170
sα[k] − eαsα[k − 1] = u[k] (3) 171
where u[·] (prediction error) is an i.i.d. Gaussian sequence. 172
Alternatively, one can decorrelate the signal by computing its 173
discrete cosine transform (DCT), which is known to be asymp- 174
totically equivalent to the Karhunen-Loève transform (KLT) of 175
the process [33], [34]. Eq. (3) provides the basis for classical 176
linear predictive coding (LPC), while the decorrelation prop- 177
erty of the DCT is often invoked to justify the popular JPEG 178
transform-domain coding scheme [35]. 179
In this paper, we are concerned with the non-Gaussian 180
counterpart of this story, which, as we shall see, will result 181
in the identification of sparse processes. The idea is to retain 182
the simplicity of the classical innovation model, while substi- 183
tuting the continuous-time Gaussian noise by some generalized 184
Lévy innovation (to be properly defined in the sequel). This 185
translates into Eqs. (1)–(3) remaining valid, except that the 186
underlying random variates are no longer Gaussian. The more 187
significant finding is that the KLT (or its discrete approxi- 188
mation by the DCT) is no longer optimal for producing the 189
best M-term approximation of the signal. This is illustrated in 190
unser
Cross-Out
unser
Replacement Text
Itô
unser
Inserted Text
[space]
IEEE
Proo
f
UNSER et al.: UNIFIED FORMULATION OF GAUSSIAN SPARSE STOCHASTIC PROCESSES 3
Fig. 1. Wavelets vs. KLT (or DCT) for the M-term approximation ofGaussian vs. sparse AR(1) processes with α = −0.1: (a) classical Gaussianscenario, (b) sparse scenario with symmetric Cauchy innovations. The E-splinewavelets are matched to the innovation model. The displayed results (relativequadratic error as a function of M/N ) are averages over 1000 realizationsfor AR(1) signals of length N = 1024; the performance of DCT and KLT isundistinguishable.
Fig. 1, which compares the performance of various transforms191
for the compression of two kinds of AR(1) processes with192
correlation e−0.1 ≈ 0.90: Gaussian vs. sparse where the latter193
innovation follows a Cauchy distribution. The key observation194
is that the E-spline wavelet transform, which is matched to the195
operator L = D − αId, provides the best results in the non-196
Gaussian scenario over the whole range of experimentation197
[cf. Fig. 1(b)], while the outcome in the Gaussian case is as198
predicted by the classical theory with the KLT being superior.199
Examples of orthogonal E-spline wavelets at two successive200
scales are shown in Fig. 2 next to their Haar counterparts.201
We selected the E-spline wavelets because of their ability202
to decouple the process which follows from their operator-203
like behavior: ψi = L∗φi where i is the scale index and φi204
a suitable smoothing kernel [36, Theorem 2]. Unlike their205
conventional cousins, they are not dilated versions of each206
other, but rather extrapolations in the sense that the slope207
of the exponential segments remains the same at all scales.208
They can, however, be computed efficiently using a perfect209
reconstruction filterbank with scale-dependent filters [36].210
The equivalence with traditional wavelet analysis (Haar)211
and finite-differencing (as used in the computation of total212
variation) for signal “sparsification” is achieved by letting213
α → 0. The catch, however, is that the underlying system214
becomes unstable! Fortunately, the problem can be fixed,215
but it calls for an advanced mathematical treatment that is216
beyond the traditional formulation of stationary processes. The217
reminder of the paper is devoted to giving a proper sense218
to what has just been described informally, and to extend-219
ing the approach to the whole class of ordinary differential220
operators, including the non-stable scenarios. The non-trivial221
outcome, as we shall see, is that many non-stable systems222
are linked with non-stationary stochastic processes. These, in223
Fig. 2. Comparison of operator-like and conventional wavelet basis functionsat two successive scales: (a) first-order E-spline wavelets with α = −0.5.(b) Haar wavelets. The vertical axis is rescaled for full range display.
turn, can be stationarized and “sparsified” by application of a 224
suitable wavelet transformation. The companion paper [32] is 225
focused on the discrete aspects of the theory including the 226
generalization of (3) for decoupling purposes and the full 227
characterization of the underlying processes. 228
III. MATHEMATICAL BACKGROUND 229
The purpose of this section is to introduce the distribu- 230
tional formalism that is required for the proper definition of 231
continuous-time white noise that is the driving term of (1) 232
and its generalization. We start with a brief summary of some 233
required notions in functional analysis, which also serves us to 234
set the notation. We then introduce the fundamental concept 235
of characteristic functional which constitutes the foundation 236
of Gelfand’s theory of generalized stochastic processes. We 237
proceed by giving the complete characterization of the possible 238
types of continuous-domain white noises—not necessarily 239
Gaussian—which will be used as universal input for our inno- 240
vation models. We conclude the section by showing that the 241
non-Gaussian brands of noises that are allowed by Gelfand’s 242
formulation are intrinsically sparse, a property that has not 243
been emphasized before (to the best of our knowledge). 244
A. Functional and Distributional Context 245
The L p-norm of a function f = f (t) is ‖ f ‖p = 246
(∫R
| f (t)|pdt) 1
p for 1 ≤ p < ∞ and ‖ f ‖∞ = 247
ess supt∈R | f (t)| for p = +∞ with the corresponding 248
Lebesgue space being denoted by L p = L p(R). The concept is 249
extendable for characterizing the rate of decay of functions. To 250
that end, we introduce the weighted L p,α spaces with α ∈ R+
251
L p,α = {f ∈ L p : ‖ f ‖p,α < +∞}
252
where the α-weighted L p-norm of f is defined as 253
‖ f ‖p,α = ‖(1 + | · |α) f (·)‖p . 254
Hence, the statement f ∈ L∞,α implies that f (t) decays 255
at least as fast as 1/|t|α as t tends to ±∞; more precisely, 256
that | f (t)| ≤ ‖ f ‖∞,α
1+|t |α almost everywhere. In particular, this 257
allows us to infer that L∞, 1p +ε ⊂ L p for any ε > 0 and 258
p ≥ 1. Another obvious inclusion is L p,α ⊆ L p,α0 for any 259
unser
Inserted Text
,
IEEE
Proo
f
4 IEEE TRANSACTIONS ON INFORMATION THEORY
α ≥ α0. In the limit, we end up with the space of rapidly-260
decreasing functions R = {f : ‖ f ‖∞,m < +∞, ∀m ∈ Z
+},261
which is included in all the others.1262
We use ϕ = ϕ(t) to denote a generic function in Schwartz’s263
class S of rapidly-decaying and infinitely-differentiable test264
functions. Specifically, Schwartz’s space is defined as:265
S = {ϕ ∈ C∞ : ‖Dnϕ‖∞,m < +∞, ∀m, n ∈ Z
+},266
with the operator notation Dn = dn
dt n and the convention that267
D0 = Id (identity). S is a complete topological vector space268
with respect to the topology induced by the series of semi-269
norm ‖Dn · ‖∞,m with m, n ∈ Z+. Its topological dual is270
the space of tempered distributions S ′; a distribution φ ∈ S ′271
is a continuous linear functional on S that is characterized272
by a duality product rule φ(ϕ) = 〈φ, ϕ〉 = ∫Rφ(t)ϕ(t)dt273
with ϕ ∈ S where the right-hand side expression has a literal274
interpretation as an integral only when φ(t) is true function275
of t . The prototypical example of a tempered distribution is the276
Dirac distribution δ, which is defined as δ(ϕ) = 〈δ, ϕ〉 = ϕ(0).277
In the sequel, we will drop the explicit dependence of the278
distribution on the generic test function ϕ ∈ S and simply279
write φ, φ(·) or even φ(t) (with an abuse of notation) where t280
as the generic time index. For instance, we shall denote the281
shifted Dirac impulse2 by δ(· − t0), or δ(t − t0) which is the282
conventional notation used by engineers.283
Let T be a continuous3 linear operator that maps S into284
itself (or eventually some enlarged topological space such285
as L p). It is then possible to extend the action of T over286
S ′ (or an appropriate subset of it) based on the definition287
〈Tφ, ϕ〉 = 〈φ,T∗ϕ〉 for φ ∈ S ′ if T∗ is the adjoint of T288
which maps ϕ to another test function T∗ϕ ∈ S continuously.289
An important example is the Fourier transform whose classical290
definition is F{ f }(ω) = f (ω) = ∫R
f (t)e− jωt dt . Since F is291
a S-continuous operator, it is extendable to S ′ based on the292
adjoint relation 〈Fφ, ϕ〉 = 〈φ,Fϕ〉 for all ϕ ∈ S (generalized293
Fourier transform).294
A linear, shift-invariant (LSI) operator that is well-defined295
over S can always be written as a convolution product:296
TLSI{ϕ} = h ∗ ϕ =∫
R
h(τ )ϕ(· − τ )dτ297
where h = TLSI{δ} is the impulse response of the system.298
The adjoint operator is the convolution with the time-reversed299
version of h:300
h∨(t) ≡ h(−t).301
The better-known categories of LSI operators are the302
the ordinary differential operators. While the latter are not304
BIBO-stable, they do work well with test functions.305
1The topology of R is defined by the family of semi-norms ‖ · ‖∞,m ,m = 1, 2, 3, . . .
2The precise definition is 〈δ(· − t0), ϕ〉 = ϕ(t0) for all ϕ ∈ S .3An operator T is continuous from a sequential topological vector space
V into another one iff. ϕk → ϕ in the topology of V implies that Tϕk → Tϕin the topology (or norm) of the second space. If the two spaces coincide, wesay that T is V-continuous.
1) L p-Stable LSI Operators: The BIBO-stable filters corre- 306
spond to the case where h ∈ L1, or, more generally, when h 307
corresponds to a complex-valued Borel measure of bounded 308
variation. The latter extension allows for discrete filters of the 309
form hd = ∑n∈Z
d[n]δ(·−n) with d[n] ∈ �1. We will refer to 310
these filters as L p-stable because they specify bounded oper- 311
ators in all the L p spaces (by Young’s inequality). L p-stable 312
convolution operators satisfy the properties of commutativity, 313
associativity, and distributivity with respect to addition. 314
2) S-Continuous LSI Operators: For an L p-stable filter to 315
yield a Schwartz function as output, it is necessary that its 316
impulse response (continuous or discrete) be rapidly-decaying. 317
In fact, the condition h ∈ R (which is much stronger than inte- 318
grability) ensures that the filter is S-continuous. The nth-order 319
derivative Dn and its adjoint Dn∗ = (−1)nDn are in the 320
same category. The nth-order weak derivative of the tempered 321
distribution φ is defined as Dnφ(ϕ) = 〈Dnφ, ϕ〉 = 〈φ,Dn∗ϕ〉 322
for any ϕ ∈ S. The latter operator—or, by extension, any 323
polynomial of distributional derivatives PN (D) = ∑Nn=1 anDn
324
with constant coefficients an ∈ C—maps S ′ into itself. The 325
class of these differential operators enjoys the same properties 326
as its classical counterpart: shift-invariance, commutativity, 327
associativity and distributivity. 328
B. Notion of Generalized Stochastic Process 329
Classically, a stochastic process is a random function 330
s(t), t ∈ R whose statistical description is provided by the 331
probability law of its point values {s(t1), s(t2), . . . , s(tn), . . . } 332
for any finite sequence of time instants {tn}Nn=1. The implicit 333
assumption there is that one has a mechanism for probing the 334
value of the function s at any time t ∈ R, which is only 335
achievable approximately in the real physical world. 336
The leading idea in Gelfand and Vilenkin’s theory of 337
generalized stochastic processes is to replace the point mea- 338
surements {s(tn)} by a series of scalar products {〈s, ϕn〉} with 339
suitable “test” functions ϕ1, . . . , ϕN ∈ S [29]. The physical 340
motivation that these authors give is that Xn = 〈s, ϕn〉 may 341
represent the reading of a finite-resolution detector whose 342
output is some “averaged” value∫R
s(t)ϕn(t)dt , which is a 343
more plausible form of probing than ideal sampling. The 344
additional hypothesis is that the linear measurement X = 〈s, ϕ〉 345
depends continuously on ϕ and that the quantities Xn = 〈s, ϕn〉 346
obtained for different test functions {ϕn} are mutually compat- 347
ible. Mathematically, this translates into defining a generalized 348
stochastic process as a continuous linear random functional on 349
some topological vector space such as S. 350
Let s be such a generalized process. We first observe 351
that the scalar product X1 = 〈s, ϕ1〉 with a given test 352
function ϕ1 is a conventional (scalar) random variable that 353
is characterized by its probability density function (pdf) 354
pX1(x1); the latter is in one-to-one correspondence (via the 355
Fourier transform) with the characteristic function pX1(ω1) = 356
E{e jω1 X1} = ∫R
e jω1x1 pX1(x1)dx1 = E{e j 〈s,ω1ϕ1〉} where E{·} 357
is the expectation operator. The same applies for the 2nd-order 358
pdf pX1,X2(x1, x2) associated with a pair of test functions ϕ1 359
and ϕ2 which is the inverse Fourier transform of the 2-D 360
characteristic function pX1,X2(ω1, ω2) = E{e j 〈s,ω1ϕ1+ω2ϕ2〉}, 361
and so forth if one wants to specify higher-order dependencies. 362
unser
Cross-Out
unser
Replacement Text
stands as our
unser
Cross-Out
unser
Inserted Text
, which
unser
Cross-Out
unser
Replacement Text
,
IEEE
Proo
f
UNSER et al.: UNIFIED FORMULATION OF GAUSSIAN SPARSE STOCHASTIC PROCESSES 5
The foundation for the theory of generalized stochastic363
processes is that one can deduce the complete statistical364
information about the process from the knowledge of its365
characteristic form366
Ps(ϕ) = E{e j 〈s,ϕ〉} (4)367
which is a continuous, positive-definite functional over S such368
that Ps(0) = 1. Since the variable ϕ in Ps(ϕ) is completely369
generic, it provides the equivalent of an infinite-dimensional370
generalization of the characteristic function. Indeed, any finite371
dimensional version can be recovered by direct substitution of372
ϕ = ω1ϕ1 +· · ·+ωNϕN in Ps(ϕ) where the ϕn are fixed and373
where ω = (ω1, · · · , ωN ) takes the role of the N-dimensional374
Fourier variable.375
In fact, Gelfand’s theory rests upon the principle that speci-376
fying an admissible functional Ps(ϕ) is equivalent to defining377
the underlying generalized stochastic process (Bochner-Minlos378
theorem). To explain this remarkable result, we start by379
recalling the fundamental notion of positive-definiteness for380
univariate functions [37].381
Definition 1: A complex-valued function f of the real382
variable ω is said to be positive-definite iff.383
N∑
m=1
N∑
n=1
f (ωm − ωn)ξmξ n ≥ 0384
for every possible choice of ω1, . . . , ωN ∈ R, ξ1, . . . , ξN ∈ C385
and N ∈ Z+.386
This is equivalent to the requirement that the N × N matrix387
F whose elements are given by [F]mn = f (ωm − ωn) is388
positive semi-definite (that is, non-negative definite) for all389
N , no matter how the ωn’s are chosen.390
Bochner’s theorem states that a bounded, continuous func-391
tion p is positive-definite if and only if it is the Fourier392
transform of a positive and finite Borel measure P:393
p(ω) =∫
R
e jωxdP(x).394
In particular, Bochner’s theorem implies that a function pX (ω)395
is a valid characteristic function—that is, pX (ω) = E{e jωX } =396 ∫R
e jωx PX (dx) = ∫R
e jωx pX (x)dx where X is a random397
variable with probability measure PX (or pdf pX )—iff. pX is398
continuous, positive-definite and such that pX (0) = 1.399
The power of functional analysis is that these concepts400
carry over to functionals on some abstract nuclear space X ,401
the prime example being Schwartz’s class S of smooth and402
rapidly-decreasing test functions[29].403
Definition 2: A complex-valued functional F(ϕ) defined404
over the function space X is said to be positive-definite iff.405
N∑
m=1
N∑
n=1
F(ϕm − ϕn)ξmξ∗n ≥ 0406
for every possible choice of ϕ1, . . . , ϕN ∈ X , ξ1, . . . , ξN ∈ C407
and N ∈ Z+.408
Definition 3: A functional F : X → R (or C) is said to409
be continuous (with respect to the topology of the function410
space X ) if, for any convergent sequence (ϕi ) in X with limit 411
ϕ ∈ X , the sequence F(ϕi ) converges to F(ϕ); that is, 412
limi
F(ϕi ) = F(limiϕi ). 413
Theorem 1 (Minlos-Bochner): Given a functional Ps(ϕ) 414
on a nuclear space X that is continuous, positive-definite and 415
such that Ps(0) = 1, there exists a unique probability measure 416
Ps on the dual space X ′ such that 417
Ps(ϕ) = E{e j 〈s,ϕ〉} =∫
X ′e j 〈s,ϕ〉dPs(s), 418
where 〈s, ϕ〉 is the dual pairing map. One further has the 419
guarantee that all finite dimensional probabilities measures 420
derived from Ps(ϕ) by setting ϕ = ω1ϕ1 + · · · + ωNϕN are 421
mutually compatible. 422
The characteristic form therefore uniquely specifies the 423
generalized stochastic process s = s(ϕ) (via the infinite- 424
dimensional probability measure Ps) in essentially the 425
same way as the characteristic function fully determines 426
the probability measure of a scalar or multivariate random 427
variable. 428
C. White Noise Processes (Innovations) 429
We define a white noise w as a generalized random process 430
that is stationary and whose measurements for non-overlapping 431
test functions are independent. A remarkable aspect of the 432
theory of generalized stochastic processes is that it is 433
possible to deduce the complete class of such noises based 434
on functional considerations only [29]. To that end, Gelfand 435
and Vilenkin consider the generic class of functionals of the 436
form 437
Pw(ϕ) = exp
(∫
R
f(ϕ(t)
)dt
)
(5) 438
where f is a continuous function on the real line and ϕ 439
is a test function from some suitable space. This functional 440
specifies an independent noise process if Pw is continuous 441
and positive-definite and Pw(ϕ1 + ϕ2) = Pw(ϕ1)Pw(ϕ2) 442
whenever ϕ1 and ϕ2 have non-overlapping support. The latter 443
property is equivalent to having f (0) = 0 in (5). Gelfand 444
and Vilenkin then go on to prove that the complete class of 445
functionals of the form (5) with the required mathematical 446
properties (continuity, positive-definiteness and factorizability) 447
is obtained by choosing f to be a Lévy exponent, as defined 448
below. 449
Definition 4: A complex-valued continuous function f (ω) 450
is a valid Lévy exponent if and only if f (0) = 0 and gτ (ω) = 451
eτ f (ω) is a positive-definite function of ω for all τ ∈ R+. 452
In doing so, they actually establish a one-to-one corre- 453
spondence between the characteristic form of an indepen- 454
dent noise processes (5) and the family of infinite-divisible 455
laws whose characteristic function takes the form pX (ω) = 456
e f (ω) = E{e jωX } [38], [39]. While Definition 4 is hard to 457
exploit directly, the good news is that there exists a complete 458
constructive, characterization of Lévy exponents, which is a 459
classical result in probability theory: 460
unser
Pencil
unser
Sticky Note
P({\rm d}x)
unser
Pencil
unser
Sticky Note
\mathscr{P}({\rm d}s)
IEEE
Proo
f
6 IEEE TRANSACTIONS ON INFORMATION THEORY
Theorem 2 (Lévy-Khintchine Formula): f (ω) is a valid461
Lévy exponent if and only if it can be written as462
f (ω) = jb′1ω − b2ω
2
2463
+∫
R\{0}[e jaω − 1 − jaω�{|a|<1}(a)] V (da)464
(6)465
where b′1 ∈ R and b2 ∈ R
+ are some constants and V is a466
Lévy measure, that is, a (positive) Borel measure on R\{0}467
such that468 ∫
R\{0}min(1, a2) V (da) < ∞. (7)469
The notation � (a) refers to the indicator function that takes470
the value 1 if a ∈ and zero otherwise. Theorem 2 is funda-471
mental to the classical theories of infinite-divisible laws and472
Lévy processes [28], [31], [39]. To further our mathematical473
understanding of the Lévy-Khintchine formula (6), we note474
that e jaω − 1 − jaω�{|a|<1}(a) ∼ − 12 a2ω2 as a → 0. This475
ensures that the integral is convergent even when the Lévy476
measure V is singular at the origin to the extent allowed by477
the admissibility condition (7). If the Lévy measure is finite or478
symmetrical (i.e., V (E) = V (−E) for any E ⊂ R), it is then479
also possible to use the equivalent, simplified form of Lévy480
exponent481
f (ω) = jb1ω − b2ω2
2+
∫
R\{0}(e jaω − 1
)V (da) (8)482
with b1 = b′1 − ∫
0<|a|<1 aV (da). The bottomline is that483
a particular brand of independent noise process is thereby484
completely characterized by its Lévy exponent or, equivalently,485
its Lévy triplet (b1, b2, v) where v is the so-called Lévy density486
associated with V such that487
V (E) =∫
Ev(a)da488
for any Borel set E ⊆ R. With this latter convention, the489
three primary types of innovations encountered in the signal490
processing and statistics literature are specified as follows:491
0, v(a) = Cα|a|α+1 with 0 < α < 2 and Cα = sin( πα2 )
π a501
suitable normalization constant, 502
fα(ω) = −|ω|αα! , 503
Pw(ϕ) = e− 1α! ‖ϕ‖αLα . (11) 504
The latter follows from the fact that −|ω|αα! is the generalized 505
Fourier transform of Cα|t |α+1 with the convention that α! = 506
�(α + 1) where � is Euler’s Gamma function [41]. 507
While none of these innovations has a classical interpre- 508
tation as a random function of t , we can at least provide an 509
explicit description of the Poisson noise as an infinite random 510
sequence of Dirac impulses (cf.[18, Theorem 1]) 511
wλ(t) =∑
k
akδ(t − tk) 512
where the tk are random locations that are uniformly distrib- 513
uted over R with density λ, and where the weights ak are 514
i.i.d. random variables with pdf pA(a). Remarkably, this is 515
the only innovation process in the family that has a finite rate 516
of innovation [17]; however, it is, by far, not the only one that 517
is sparse as explained next. 518
D. Gaussian Versus Sparse Categorization 519
To get a better understanding of the underlying class of 520
white noises w, we propose to probe them through some 521
localized analysis window ϕ, which will yield a conventional 522
i.i.d. random variable X = 〈w,ϕ〉 with some pdf pϕ(x). The 523
most convenient choice is to pick the rectangular analysis 524
window ϕ(t) = rect(t) = �[− 12 ,
12 ](t) when 〈w, rect〉 is 525
well-defined. By using the fact that e jaωrect(t)−1 = e jaω−1 for 526
t ∈ [− 12 ,
12 ], and zero otherwise, we find that the characteristic 527
function of X is simply given by 528
prect(ω) = Pw (ω · rect(t)) = exp ( f (ω)) , 529
which corresponds to the generic (Lévy-Khinchine) form asso- 530
ciated with an infinitely-divisible distribution [31], [39], [42]. 531
The above result makes the mapping between generalized 532
white noise processes and classical infinite-divisible (id) laws4533
explicit: The “canonical” id pdf of w, pid(x) = prect(x), is 534
obtained by observing the noise through a rectangular window. 535
Conversely, given the Lévy exponent of an id distribution, 536
f (ω) = log (F{pid}(ω)), we can specify a corresponding 537
innovation process w via the characteristic form Pw(ϕ) by 538
merely substituting the frequency variable ω by the generic 539
test function ϕ(t), adding an integration over R and taking 540
the exponential as in (5). 541
We note, in passing, that sparsity in signal processing may 542
refer to two distinct notions. The first is that of a finite rate 543
of innovation; i.e., a finite (but perhaps random) number of 544
innovations per unit of time and/or space, which results in a 545
mass at zero in the histogram of observations. The second 546
possibility is to have a large, even infinite, number of 547
innovations, but with the property that a few large innovations 548
4A random variable X with pdf pX (x) is said to be infinitely divisible (id)if for any n ∈ Z
+ there exist i.i.d. random variables X1, . . . , Xn with pdf saypn(x) such that X = X1 + · · · + Xn in law.
unser
Cross-Out
unser
Replacement Text
$A_k$
unser
Cross-Out
unser
Replacement Text
A_k
IEEE
Proo
f
UNSER et al.: UNIFIED FORMULATION OF GAUSSIAN SPARSE STOCHASTIC PROCESSES 7
dominate the overall behavior. In this case the histogram of549
observations is distinguished by its ‘heavy tails’. (A combina-550
tion of the two is also possible, for instance in a compound551
Poisson process with a heavy-tailed amplitude distribution.552
For such a process one may observe a change of behavior553
in passing from one dominant type of sparsity to the other).554
Our framework permits us to consider both types of sparsity,555
in the former case with compound Poisson models and in the556
latter with heavy-tailed infinitely-divisible innovations.557
To make our point, we consider two distinct scenarios.558
1) Finite Variance Case: We first assume that the second559
moment m2 = ∫R\{0} a2 V (da) of the Lévy measure V in (6)560
is finite. This allows us to rewrite the classical Lévy-Khinchine561
representation as562
f (ω) = jc1ω − b2ω2
2+
∫
R\{0}[e jaω − 1 − jaω] V (da)563
with c1 = b′′1 + ∫
|a|>1 aV (da) and where the Poisson part564
of the functional is now fully compensated. Indeed, we are565
guaranteed that the above integral is convergent because566
|e jaω − 1 − jωa| � |aω|2 as a → 0 and |e jaω − 1 − jωa| ∼567
|aω| as a → ±∞. An interesting non-Poisson example of568
infinitely-divisible probability laws that falls into this category569
(with non-finite V ) is the Laplace distribution with Lévy triplet570
(0, 0, v(a) = e−|a||a| ) and pid(x) = 1
2 e−|x |. This model is571
particularly relevant for sparse signal processing because it572
provides a tight connection between Lévy processes and total573
variation regularization [18, Section VI].574
Now, if the Lévy measure is finite∫R
V (da) = λ < ∞,575
the admissibility condition yields∫R\{0} a V (da) < ∞, which576
allows us to pull the bias correction out of the integral. The577
representation then simplifies to (8). This implies that we578
can decompose X into the sum of two independent Gaussian579
and compound Poisson random variables. The variances of580
the Gaussian and Poisson components are σ 2 = b2 and581 ∫R
a2V (da), respectively. The Poisson component is sparse582
because its pdf exhibits a mass distribution e−λδ(x) at the583
origin, meaning that the chances for a continuous amplitude584
distribution of getting zero are overwhelmingly higher than585
any other value, especially for smaller values of λ > 0. It is586
therefore justifiable to use 0 ≤ e−λ < 1 as our Poisson sparsity587
index.588
2) Infinite Variance Case: We now turn our attention to589
the case where the second moment of the Lévy measure590
is unbounded, which we like to label as the “super-sparse”591
one. To substantiate this claim, we invoke the Ramachandran-592
Wolfe theorem which states that the pth moment E{|X |p}593
with p ∈ R+ of an infinitely divisible distribution is finite594
iff.∫|a|>1 |a|p V (da) < ∞ [43], [44]. For p ≥ 2, the595
latter is equivalent to∫R\{0} |a|p V (da) < ∞ because of the596
admissibility condition (7). It follows that the cases that are597
not covered by the previous scenario (including the Gaussian598
+ Poisson model) necessarily give rise to distributions whose599
moments of order p are unbounded for p ≥ 2. The proto-600
typical representatives of such heavy tail distributions are the601
alpha-stable ones or, by extension, the broad family of infinite602
divisible probability laws that are in their domain of attraction.603
Note that these distributions all fulfill the stringent conditions 604
for �p compressibility[45], [46]. 605
IV. INNOVATION APPROACH TO CONTINUOUS-TIME 606
STOCHASTIC PROCESSES 607
Specifying a stochastic process through an innovation model 608
(or an equivalent stochastic differential equation) is attractive 609
conceptually, but it presupposes that we can provide an inverse 610
operator (in the form of an integral transform) that transforms 611
the innovation back into the initial stochastic process. This is 612
the reason why, after laying out general conditions for exis- 613
tence, we shall spend the greater part of our effort investigating 614
suitable inverse operators. 615
A. Stochastic Differential Equations 616
Our aim is to define the generalized process with whitening 617
operator L : S ′ → S ′ and Lévy exponent f as the solution of 618
the stochastic linear differential equation 619
Ls = w, (12) 620
where w is an innovation process, as described in 621
Section III-C. This definition is obviously only usable if we 622
can construct an inverse operator T = L−1 that solves this 623
equation. For the cases where the inverse is not unique, we will 624
need to select one preferential operator, which is equivalent 625
to imposing specific boundary conditions. We are then able 626
to formally express the stochastic process as a transformed 627
version of a white noise 628
s = L−1w. (13) 629
The requirement for such a solution to be consistent with (12) 630
is that the operator satisfies the right-inverse property LL−1 = 631
Id over the underlying class of tempered distributions. By 632
using the adjoint relation 〈s, ϕ〉 = 〈L−1w,ϕ〉 = 〈w,L−1∗ϕ〉, 633
we can then transfer the action of the operator onto the test 634
function inside the characteristic form and obtain a com- 635
plete statistical characterization of the so-defined generalized 636
stochastic process 637
Ps(ϕ) = PL−1w(ϕ) = Pw(L−1∗ϕ), (14) 638
where Pw is given by (5) (or one of the specific forms in the 639
list at the end of Section III-C) and where we are implicitly 640
requiring that the adjoint L−1∗ is mathematically well-defined 641
(continuous) over S, and that its composition with Pw is 642
well-defined for all ϕ ∈ S. 643
In order to realize the above idea mathematically, it isusually easier to proceed backwards: one specifies an operatorT that satisfies the left-inverse property: ∀ϕ ∈ S, TL∗ϕ = ϕ,and that is continuous (i.e., bounded in the proper norm(s))over the chosen class of test functions. One then characterizesthe adjoint of T, which is the operator T∗ : S ′ → S ′ (or anappropriate subset thereof) such that, for a given φ ∈ S ′,
∀ϕ ∈ S, 〈φ, ϕ〉 = 〈LT∗φ, ϕ〉 = 〈φ, TL∗︸︷︷︸
Id
ϕ〉.
Finally, we set L−1 = T∗, which yields the proper distribu- 644
tional definition of the right inverse of L in (13). 645
unser
Cross-Out
unser
Replacement Text
{\rm d}
unser
Inserted Text
[space]
IEEE
Proo
f
8 IEEE TRANSACTIONS ON INFORMATION THEORY
B. General Conditions for Existence646
To validate the proposed innovation model, we need to647
ensure that the solution s = L−1w is a bona fide generalized648
stochastic process.649
In order to simplify the analysis, we shall restrict our650
attention to an appropriate subclass of Lévy exponents.651
Definition 5: A Lévy exponent f with derivative f ′ is652
p-admissible with 1 ≤ p ≤ 2 if there exists a positive constant653
C such that | f (ω)| + |ω| · | f ′(ω)| ≤ C|ω|p for all ω ∈ R.654
Note that this p-admissibility condition is not very con-655
straining and that it is satisfied by the great majority of656
members of the Lévy-Kintchine family (see Section III-C).657
For instance in the compound Poisson case, we can show that658
|ω| · | f ′(ω)| ≤ λ|ω| E{|A|} and f (ω) ≤ λ|ω| E{|A|} by659
using the fact |e j x − 1| ≤ |x |; this implies that the bound660
in Definition 5 with p = 1 is always satisfied provided661
that the first (absolute) moment of the amplitude pdf pA(a)662
in (10) is finite. Similarly, all symmetric Lévy exponents with663
− f ′′(0) < ∞ (finite variance case) are p-admissible with664
p = 2, the prototypical example being the Gaussian. The only665
cases we are aware of that do not fulfill the condition are the666
alpha-stable noises with 0 < α < 1, which are notorious for667
their exotic behavior.668
The first advantage of imposing p-admissibility is that it669
allows us to extend the set of acceptable analysis functions670
from S to L p which is crucial if we intend to do conventional671
signal processing.672
Theorem 3: If the Lévy exponent f is p-admissible, then673
the characteristic form Pw(ϕ) = exp(∫
Rf(ϕ(t)
)dt
)is a674
continuous, positive-definite functional over L p .675
Proof: Since the exponential function is continuous, it is676
sufficient to consider the functional677
F(ϕ) = log Pw(ϕ) =∫
R
f (ϕ(t))dt,678
which is such that F(0) = 0. To show that F(ϕ)(and hence679
Pw(ϕ))
is well-defined over L p , we note that680
|F(ϕ)| ≤∫
R
| f (ϕ(t))| dt ≤ C‖ϕ‖pp,681
which follows from the p-admissibility condition. The positive682
definiteness of Pw(ϕ) over S is a direct consequence of f683
being a Lévy exponent and is therefore also transferable to684
L p . For the interested reader, this can be shown quite easily685
by proving that F(ϕ) is conditionally positive-definite of order686
one (see [20]).687
The only remaining work is to establish the L p-continuity688
of F(ϕ). To that end, we observe that689
| f (u)− f (v)| =∣∣∣
∫ u
vf ′(t)dt
∣∣∣690
≤ C∣∣∣
∫ u
vt p−1dt
∣∣∣691
(by the assumption on f )692
≤ C max(|u|p−1, |v|p−1)|u − v|693
≤ C(|v|p−1 + |u − v|p−1)|u − v|.694
(by the triangle inequality)695
Next, we pick a convergent sequence in L p , {ϕn}∞n=1, whose 696
limit is denoted by ϕ. The convergence in L p is expressed as 697
limn→∞ ‖ϕn − ϕ‖p = 0. (15) 698
We then have 699
∣∣∣
∫
R
f (ϕn(t))dt −∫
R
f (ϕ(t))dt∣∣∣ 700
≤C∫
R
|ϕ(t)|p−1|ϕn(t)− ϕ(t)| + |ϕn(t)− ϕ(t)|pdt 701
≤C(‖ϕ‖p−1
p ‖ϕn − ϕ‖p + ‖ϕn − ϕ‖pp
)
(by Hölder’s inequality)
702
→0 as n → ∞, (by (15)) 703704
which proves the continuity of the functional Pw on L p . 705
Thanks to this result, we can then rely on the Minlos- 706
Bochner theorem (Theorem 1) to state basic conditions on 707
T = L−1∗ that ensure that s = T∗w is a well-defined 708
generalized process over S ′. 709
Theorem 4 (Existence of Generalized Process): Let f be a 710
valid Lévy exponent and T be an operator acting on ϕ ∈ S 711
such that any one of the conditions below is met: 712
1) T is a continuous linear map from S into itself, 713
2) T is a continuous linear map from S into L p and the 714
Lévy exponent f is p-admissible. 715
Then, Ps(ϕ) = exp(∫
Rf(Tϕ(t)
)dt
)is a continuous, positive- 716
definite functional on S such that Ps(0) = 1. 717
Proof: We already know that Pw is a continuous 718
functional on S (resp., on L p when f is p-admissible) by 719
construction. This, together with the assumption that T is a 720
continuous operator on S (resp., from S to L p), implies that 721
the composed functional Ps(ϕ) := Pw(Tϕ) is continuous 722
on S. 723
Given the functions ϕ1, . . . , ϕN in S and some complex 724
coefficients ξ1, . . . , ξN , 725
∑
1≤m,n≤N
Ps(ϕm − ϕn)ξmξn 726
=∑
1≤m,n≤N
Pw
(T(ϕm − ϕn)
)ξmξn 727
=∑
1≤m,n≤N
Pw(Tϕm − Tϕn)ξmξn
(by the linearity of the operator T)
728
≥ 0. (by the positivity of Pw over S or L p) 729730
This proves the positive definiteness of the functional Ps 731
on S. 732
Lastly, Ps(0) = Pw(T0) = Pw(0) = 1. 733
The final fundamental issue relates to the interpretation 734
of s = L−1w as an ordinary stochastic process; that is, a 735
random function s(t) of the time variable t . This presupposes 736
that the shaping operator L−1 performs a minimal amount of 737
smoothing since the driving term of the model, w, is too rough 738
jω0 with ω0 ∈ R. This third case, which is incompatible 811
with the conventional formulation of stationary processes, is 812
most interesting theoretically because it opens the door to 813
important extensions such as Lévy processes, as we shall see in 814
Section V. Here, we will show that marginally-stable systems 815
can be handled within our generalized framework as well, 816
thanks to the introduction of appropriate inverse operators. 817
The first natural candidate for (D − jω0Id)−1 is the inversefilter whose frequency response is
ρ jω0(ω) = 1
j (ω − ω0)+ πδ(ω − ω0).
It is a convolution operator whose time-domain definition is 818
Iω0ϕ(t) = (ρ jω0 ∗ ϕ)(t) 819
= e jω0t∫ t
−∞e− jω0τ ϕ(τ )dτ. (19) 820
Its impulse response ρ jω0(t) is causal and compatible with 821
Definition (18), but not (rapidly) decaying. The adjoint of Iω0 822
is given by 823
I∗ω0ϕ(t) = (ρ∨
jω0∗ ϕ)(t) 824
= e− jω0t∫ +∞
te jω0τ ϕ(τ )dτ. (20) 825
While Iω0ϕ(t) and I∗ω0ϕ(t) are both well-defined when ϕ ∈ L1, 826
the problem is that these inverse filters are not BIBO stable 827
since their impulse responses, ρ jω0(t) and ρ∨jω0(t), are not 828
in L1. In particular, one can easily see that Iω0ϕ (resp., I∗ω0ϕ) 829
with ϕ ∈ S is generally not in L p with 1 ≤ p < +∞, 830
unless ϕ(ω0) = 0 (resp., ϕ(−ω0) = 0). The conclusion is 831
that I∗ω0fails to be a bounded operator over the class of test 832
functions S. 833
This leads us to introduce some “corrected” version of the 834
adjoint inverse operator I∗ω0, 835
I∗ω0,t0ϕ(t) = I∗ω0
{ϕ − ϕ(−ω0)e
− jω0t0δ(· − t0)}(t) 836
= I∗ω0ϕ(t)− ϕ(−ω0)e
− jω0t0ρ∨jω0(t − t0), (21) 837
where t0 ∈ R is a fixed location parameter and where 838
ϕ(−ω0) = ∫R
e jω0tϕ(t)dt is the complex sinusoidal moment 839
associated with the frequency ω0. The idea is to correct for 840
5: ρ is a Green functions of (D −αn Id) iff. (D −αn Id)ρ = δ; the completeset of solutions is given ρ(t) = ραn (t)+Ceαn t which is the sum of the causalGreen function ραn (t) plus an arbitrary exponential component that is in thenull space of the operator.
unser
Inserted Text
ing
IEEE
Proo
f
10 IEEE TRANSACTIONS ON INFORMATION THEORY
the lack of decay of I∗ω0ϕ(t) as t → −∞ by subtracting841
a properly weighted version of the impulse response of the842
operator. An equivalent Fourier-based formulation is provided843
by the formula at the bottom of Table I; the main difference844
with the corresponding expression for Iω0ϕ is the presence of a845
regularization term in the numerator that prevents the integrant846
from diverging at ω = ω0. The next step is to identify the847
adjoint of I∗ω0,t0 , which is achieved via the following inner-848
product manipulation849
〈ϕ, I∗ω0,t0φ〉 = 〈ϕ, I∗ω0φ〉 − φ(−ω0)e
− jω0t0〈ϕ, ρ∨jω0(· − t0)〉850
= 〈Iω0ϕ, φ〉 − 〈e jω0·, φ〉 e− jω0t0 Iω0ϕ(t0)851
(using(19))852
= 〈Iω0ϕ, φ〉 − 〈e jω0(·−t0)Iω0ϕ(t0), φ〉.853
Since the above is equal to 〈Iω0,t0ϕ, φ〉 by definition, we obtain854
that855
Iω0,t0ϕ(t) = Iω0ϕ(t)− e jω0(t−t0) Iω0ϕ(t0). (22)856
Interestingly, this operator imposes the boundary condition857
Iω0,t0ϕ(t0) = 0 via the subtraction of a sinusoidal component858
that is in the null space of the operator (D − jω0Id), which859
gives a direct interpretation of the location parameter t0.860
Observe that expressions (21) and (22) define linear operators,861
albeit not shift-invariant ones, in contrast with the classical862
inverse operators Iω0 and I∗ω0.863
For analysis purposes, it is convenient to relate the proposed864
inverse operators to the anti-derivatives corresponding to the865
case ω0 = 0. To that end, we introduce the modulation866
operator867
Mω0ϕ(t) = e jω0tϕ(t)868
which is a unitary map on L2 with the property that869
M−1ω0
= M−ω0 .870
Proposition 1: The inverse operators defined by (19), (20),871
(22), and (21) satisfy the modulation relations872
Iω0ϕ(t) = Mω0 I0 M−1ω0ϕ(t),873
I∗ω0ϕ(t) = M−1
ω0I∗0 Mω0ϕ(t),874
Iω0,t0ϕ(t) = Mω0 I0,t0 M−1ω0ϕ(t),875
I∗ω0,t0ϕ(t) = M−1ω0
I∗0,t0 Mω0ϕ(t).876
Proof: These follow from the modulation property of877
the Fourier transform (i.e, F{Mω0ϕ}(ω) = F{ϕ}(ω − ω0))878
and the observations that Iω0δ(t) = ρ jω0(t) = Mω0ρ0(t) and879
I∗ω0δ(t) = ρ∨
jω0(t) = M−ω0ρ
∨0 (t) with ρ0(t) = �+(t) (the unit880
step function).881
The important functional property of I∗ω0,t0 is that it essentially882
preserves decay and integrability, while Iω0,t0 fully retains sig-883
nal differentiability. Unfortunately, it is not possible to have the884
two simultaneously unless Iω0ϕ(t0) and ϕ(−ω0) are both zero.885
Proposition 2: If f ∈ L∞,α with α > 1, then there exists886
a constant Ct0 such that887
|I∗ω0,t0 f (t)| ≤ Ct0‖ f ‖∞,α
1 + |t|α−1 ,888
which implies that I∗ω0,t0 f ∈ L∞,α−1.889
Proof: Since modulation does not affect the decay properties 890
of a function, we can invoke Proposition 1 and concentrate on 891
the investigation of the anti-derivative operator I∗0,t0 . Without 892
loss of generality, we can also pick t0 = 0 and transfer the 893
bound to any other finite value of t0 by adjusting the value of 894
the constant Ct0 . Specifically, for t < 0, we write this inverse 895
operator as 896
I∗0,0 f (t) = I∗0 f (t)− f (0) 897
=∫ +∞
tf (τ )dτ −
∫ ∞
−∞f (τ )dτ 898
= −∫ t
−∞f (τ )dτ. 899
This implies that 900
|I∗0,0 f (t)| =∣∣∣∣
∫ t
−∞f (τ )dτ
∣∣∣∣ ≤ ‖ f ‖∞,α
∫ t
−∞1
1 + |τ |α dτÊ 901
≤(
2α
α − 1
) ‖ f ‖∞,α
1 + |t|α−1 902
for all t < 0. For t > 0, I∗0,0 f (t) = ∫ ∞t f (τ )dτ so that the 903
above upper bounds remain valid. 904
The interpretation of the above result is that the inverse 905
operator I∗ω0,t0 reduces inverse polynomial decay by one order. 906
Proposition 2 actually implies that the operator will preserve 907
the rapid decay of the Schwartz functions which are included 908
in L∞,α for any α ∈ R+. It also guarantees that I∗ω0,t0ϕ belongs 909
to L p for any Schwartz function ϕ. However, I∗ω0,t0 will spoil 910
the global smoothness properties of ϕ because it introduces a 911
discontinuity at t0, unless ϕ(−ω0) is zero in which case the 912
output remains in the Schwartz class. This allows us to state 913
the following theorem which summarizes the higher-level part 914
of those results for further reference. 915
Theorem 6: The operator I∗ω0,t0 defined by (22) is a continu- 916
ous linear map from R into R (the space of bounded functions 917
with rapid decay). Its adjoint Iω0,t0 is given by (21) and has the 918
property that Iω0,t0ϕ(t0) = 0. Together, these operators satisfy 919
the complementary left- and right-inverse relations 920
{I∗ω0,t0(D − jω0Id)∗ϕ = ϕ
(D − jω0Id)Iω0,t0ϕ = ϕ921
922
for all ϕ ∈ S. 923
Having a tight control on the action of I∗ω0,t0 over S allows 924
us to extend the right-inverse operator Iω0,t0 to an appropriate 925
subset of tempered distributions φ ∈ S ′ according to the rule 926
〈Iω0,t0φ, ϕ〉 = 〈φ, I∗ω0,t0ϕ〉. Our complete set of inverse oper- 927
ators is summarized in Table I together with their equivalent 928
Fourier-based definitions which are also interpretable in the 929
generalized sense of distributions. The first three entries of 930
the table are standard results from the theory of linear systems 931
(e.g., [49, Table 4.1]), while the other operators are specific 932
to this work. 933
D. Solution of Generic Stochastic Differential Equation 934
We now have all the elements to solve the generic stochastic 935
linear differential equation 936
N∑
n=0
anDns =M∑
m=0
bmDmw (23) 937
IEEE
Proo
f
UNSER et al.: UNIFIED FORMULATION OF GAUSSIAN SPARSE STOCHASTIC PROCESSES 11
TABLE I
FIRST-ORDER DIFFERENTIAL OPERATORS AND THEIR INVERSES
where the an and bm are arbitrary complex coefficients with the938
normalization constraint aN = 1. While this reminds us of the939
textbook formula of an ordinary N th-order differential system,940
the non-standard aspect in (23) is that the driving term is a941
innovation process w, which is generally not defined point-942
wise, and that we are not imposing any stability constraint.943
Eq. (23) thus covers the general case (12) where L is a shift-944
invariant operator with the rational transfer function945
simplification using the left-inverse property of L−1∗. The1230
statement about independence follows from the exponential1231
nature of the characteristic function and the property that1232
f (0) = 0, which allows for the factorization of the charac-1233
teristic function when the support of the kernels are distinct1234
7A technical remark is in order here: the substitution of a non-smoothfunction such as φi ∈ R in the characteristic noise functional Pw is legitimateprovided that the domain of continuity of the functional can be extendedfrom S to R, or, even less restrictively, to L p when f is p-admissible (seeTheorem 3).
(independence of the noise at every point). The correlation 1235
formula is obtained by direct application of the first result 1236
in Property 1 with ϕ1 = ψi1,k1 = L∗φi1(· − 2i1 k1) and 1237
ϕ2 = ψi2 ,k2 = L∗φi2 (· − 2i2 k2). 1238
These results provide a complete characterization of the 1239
statistical distribution of sparse stochastic processes in some 1240
matched wavelet domain. They also indicate that the repre- 1241
sentation is intrinsically sparse since the transformed-domain 1242
statistics are infinitely divisible. Practically, this translates 1243
into the wavelet domain pdfs being heavier tailed than a 1244
Gaussian (unless the process is Gaussian) (cf. argumentation in 1245
Section III-D). 1246
To make matters more explicit, we consider the case where 1247
the innovation process is SαS. The application of Property 2 1248
with f (ω) = −|ω|αα! yields fi (ω) = −|σiω|α
α! with disper- 1249
sion parameter σi = ‖φi‖Lα . This proves that the wavelet 1250
coefficients of a generalized SαS stochastic process follow 1251
SαS distributions with the spread of the pdf at scale i being 1252
determined by the Lα norm of the corresponding wavelet 1253
smoothing kernels. This strongly suggests that, for α < 2, 1254
the process is compressible in the sense that the essential part 1255
of the “energy content” is carried by a tiny fraction of wavelet 1256
coefficients, as illustrated in Fig. 1. 1257
It should be noted, however, that the quality of the decou- 1258
pling is strongly dependent upon the spread of the wavelet 1259
smoothing kernels φi which should be chosen to be max- 1260
imally localized for best performance. In the case of the 1261
first-order system (cf. example in Section II), the basis func- 1262
tions for i fixed are not overlapping which implies that the 1263
wavelet coefficients within a given scale are independent. 1264
This is not so across scale because of the cone-shaped region 1265
where the support of the kernels φi1 and φi2 overlap, which 1266
induces dependencies. Incidentally, the inter-scale correlation 1267
of wavelet coefficients is often exploited for improving coding 1268
performance [50] and signal reconstruction by imposing joint 1269
sparsity constraints [51]. 1270
VI. LÉVY PROCESSES REVISITED 1271
We now illustrate our method by specifying classical Lévy 1272
processes—denoted by W (t)—via the solution of the (mar- 1273
where β(0,0) = β0 ∗ β0 is the causal B-spline of degree 1 1349
(triangle function). Since D∗ = −D, this confirms that 1350
the underlying smoothing kernels are dilated versions of a 1351
B-spline of degree 1. Moreover, since the wavelet-domain 1352
sampling is critical, there is no overlap of the basis 1353
functions within a given scale which implies that the 1354
wavelets coefficients are independent on a scale-by-scale basis 1355
(cf. Property 3). If we now compare the situation with that 1356
of the Lévy increments, we observe that the wavelet analysis 1357
involves one more layer of smoothing of the innovation with 1358
β0 (due to the factorization property of β(0,0)) which slightly 1359
complicates the statistical calculations. 1360
While the smoothing effect on the innovation is qualitatively 1361
the same in both instances, there are fundamental differences, 1362
too. In the wavelet case, the underlying discrete transform 1363
is orthogonal, but the coefficients are not fully decoupled 1364
because of the inter-scale dependencies which are unavoidable, 1365
as explained in Section V-D. By contrast, the decoupling of 1366
the Lévy increments is perfect, but the underlying discrete 1367
transform (finite difference transform) is non-orthogonal. In 1368
our companion paper, we shall see how this latter strategy is 1369
extendable to the much broader family of sparse processes via 1370
the definition of the generalized increment process. 1371
C. Examples of Lévy Processes 1372
Realizations of four different Lévy processes are shown in 1373
Fig. 1 together with their Lévy triplets(b1, b2, v(a)
). The 1374
first signal is a Brownian motion (a.k.a. Wiener process) that 1375
is obtained by integration of a white Gaussian noise. This 1376
classical process is known to be nowhere differentiable in the 1377
classical sense, despite the fact that it is continuous everywhere 1378
(almost surely) as all the members of the Lévy family. While 1379
the sampled version of �0W is i.i.d. in all cases, it does not 1380
yield a sparse representation in this first instance because the 1381
underlying distribution remains Gaussian. The second process,
AQ:3
1382
which may be termed Lévy-Laplace motion, is specified by 1383
the Lévy density v(a) = e−|a|/|a| which is not in L1. By 1384
taking the inverse Fourier transform of (34), we can show that 1385
its increment process has a Laplace distribution [18]; note that 1386
unser
Inserted Text
Fig. 3
unser
Pencil
unser
Cross-Out
unser
Replacement Text
_{(0,t_1]}
unser
Cross-Out
unser
Replacement Text
_{(0,t_1]}
unser
Cross-Out
unser
Replacement Text
(0,t_1]
unser
Cross-Out
unser
Replacement Text
u
unser
Cross-Out
IEEE
Proo
f
16 IEEE TRANSACTIONS ON INFORMATION THEORY
Fig. 3. Examples of Lévy motions W (t) with increasing degrees of sparsity. (a) Brownian motion with Lévy triplet (0, 1, 0). (b) Lévy-Laplace motion with(0, 0, e−|a|
|a|). (c) Compound Poisson process with
(0, 0, λ 1√
2πe−a2/2)
with λ = 132 . (d) Symmetric Lévy flight with
(0, 0, 1/|a|α+1)
and α = 1.2.
this type of generalized Gaussian model is often used to justify1387
sparsity-promoting signal processing techniques based on �11388
minimization [52]–[54]. The third piecewise-constant signal is1389
a compound Poisson process. It is intrinsically sparse since a1390
good proportion of its increments is zero by construction (with1391
probability e−λ). Interestingly, this is the only type of Lévy1392
process that fulfills the finite rate of innovation property [17].1393
The fourth example is an alpha-stable Lévy motion (a.k.a.1394
Lévy flight) with α = 1.2. Here, the distribution of �0W1395
is heavy-tailed (SαS) with unbounded moments for p > α.1396
Although this may not be obvious from the picture, this is1397
the sparsest process of the lot because it is �α-compressible1398
in the strongest sense [45]. Specifically, we can compress the1399
sequence such as to preserve any prescribed portion r < 1 of1400
its average �α energy by retaining an arbitrarily small fraction1401
of samples as the length of the signal goes to infinity.1402
D. Link With Conventional Stochastic Calculus1403
Thanks to (30), we can view a white noise w = W as the1404
weak derivative of some classical Lévy processes W (t) which1405
is well-defined pointwise (almost everywhere). This provides1406
us with further insights on the range of admissible innovation1407
processes of Section II.C which constitute the driving terms of1408
the general stochastic differential equation (12). This funda-1409
mental observation also makes the connection with stochastic1410
calculus8 [55], [56], which avoids the notion of white noise1411
by relying on the use of stochastic integrals of the form1412
s(t) =∫
R
h(t, t ′)dW (t ′)1413
where W is a random (signed) measure associated to some1414
canonical Brownian motion (or, by extension, a Lévy process)1415
and where h(t, t ′) is an integration kernel that formally cor-1416
responds to our inverse operator L−1 (see Theorem 5).1417
8The Itô integral of conventional stochastic calculus is based on Brownianmotion, but the concept can also be generalized to Lévy driving terms usingthe more advanced theory of semimartingales [55].
VII. CONCLUSION 1418
We have set the foundations of a unifying framework that 1419
gives access to the broadest possible class of continuous- 1420
time stochastic processes specifiable by linear, shift-invariant 1421
equations, which is beneficial for signal processing purposes. 1422
We have shown that these processes admit a concise represen- 1423
tation in a wavelet-like basis. We have applied our framework 1424
to the description of the classical Lévy processes, which, in 1425
our view, provide the simplest and most basic examples of 1426
sparse processes, despite the fact that they are non-stationary. 1427
We have also hinted at the link between Lévy increments 1428
and splines, which is the theme that we shall develop in full 1429
generality next [32]. 1430
We have demonstrated that the proposed class of 1431
stochastic models and the corresponding mathematical 1432
machinery (Fourier analysis, characteristic functional, and 1433
B-spline calculus) lends itself well to the derivation of 1434
transform-domain statistics. The formulation suggests a variety 1435
of new processes whose properties are compatible with the 1436
currently-dominant paradigm in the field which is focused on 1437
the notion of sparsity. In that respect, the sparse processes that 1438
are best matched to conventional wavelets9 are those generated 1439
by N-fold integration (with proper boundary conditions) of a 1440
non-gaussian innovation. These processes, which are the solu- 1441
tion of an unstable SDE (pole of multiplicity N at the origin), 1442
are intrinsically self-similar (fractal) and non-stationary. Last 1443
but not least, the formulation is backward compatible with the 1444
classical theory of Gaussian stationary processes. 1445
ACKNOWLEDGMENT 1446
The authors are thankful to Prof. Robert Dalang (EPFL 1447
Chair of Probabilities), Julien Fageot and Dr. Arash Amini 1448
for helpful discussions. 1449
9A wavelet with N vanishing moments can always be rewritten as ψ =DNφ with φ ∈ L2(R) where the operator L = DN is scale-invariant.
IEEE
Proo
f
UNSER et al.: UNIFIED FORMULATION OF GAUSSIAN SPARSE STOCHASTIC PROCESSES 17
REFERENCES1450
[1] A. Papoulis, Probability, Random Variables, and Stochastic Processes.1451
New York, NY, USA: McGraw-Hill, 1991.1452
[2] R. Gray and L. Davisson, An Introduction to Statistical Signal Process-1453
ing. Cambridge, U.K.: Cambridge Univ. Press, 2004.1454
[3] E. J. Candès and M. B. Wakin, “An introduction to compressive1455
sampling,” IEEE Signal Process. Mag., vol. 25, no. 2, pp. 21–30,1456
Mar. 2008.1457
[4] A. M. Bruckstein, D. L. Donoho, and M. Elad, “From sparse solutions of1458
systems of equations to sparse modeling of signals and images,” SIAM1459
Rev., vol. 51, no. 1, pp. 34–81, 2009.1460
[5] S. Mallat, A Wavelet Tour of Signal Processing: The Sparse Way, 3rd ed.1461
San Diego, CA, USA: Academic Press, 2009.1462
[6] J.-L. Starck, F. Murtagh, and J. M. Fadili, Sparse Image and Signal1463
[55] P. Protter, Stochastic Integration and Differential Equations. New York, 1595
NY, USA: Springer-Verlag, 2004. 1596
[56] P. J. Brockwell, “Lèvy-driven CARMA processes,” Ann. Inst. Statist. 1597
Math., vol. 53, no. 1, pp. 113–124, 2001. 1598
unser
Cross-Out
unser
Replacement Text
DAMTP Tech. Rep. 2011/NA13
unser
Cross-Out
unser
Cross-Out
unser
Replacement Text
in press
IEEE
Proo
f
18 IEEE TRANSACTIONS ON INFORMATION THEORY
Michael Unser (M’89–SM’94–F’99) received the M.S. (summa cum laude)1599
and Ph.D. degrees in Electrical Engineering in 1981 and 1984, respectively,1600
from the Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland.1601
From 1985 to 1997, he worked as a scientist with the National Institutes1602
of Health, Bethesda USA. He is now full professor and Director of the1603
Biomedical Imaging Group at the EPFL.1604
His main research area is biomedical image processing. He has a strong1605
interest in sampling theories, multiresolution algorithms, wavelets, the use of1606
splines for image processing, and, more recently, stochastic processes. He has1607
published about 250 journal papers on those topics.1608
Dr. Unser is currently member of the editorial boards of IEEE1609
J. SELECTED TOPICS IN SIGNAL PROCESSING, Foundations and Trends1610
in Signal Processing, SIAM J. Imaging Sciences, and the PROCEEDINGS1611
OF THE IEEE. He co-organized the first IEEE International Symposium on1612
Biomedical Imaging (ISBI2002) and was the founding chair of the technical1613
committee of the IEEE-SP Society on Bio Imaging and Signal Processing1614
(BISP).1615
He received three Best Paper Awards (1995, 2000, 2003) from the IEEE1616
Signal Processing Society, and two IEEE Technical Achievement Awards1617
(2008 SPS and 2010 EMBS). He is an EURASIP Fellow and a member1618
of the Swiss Academy of Engineering Sciences.1619
Pouya D. Tafti was born in Tehran in 1981. He received his BSc degree 1620
in Electrical Engineering from Sharif University of Technology, Tehran, in 1621
2003, his MASc in Electrical and Computer Engineering from McMaster 1622
University, Hamilton, Ontario, in 2006, and his PhD in Computer, Information, 1623
and Communication Sciences from EPFL, Lausanne, in 2011. From 2006 to 1624
2012 he was with the Biomedical Imaging Group at EPFL, where he worked 1625
on vector field imaging and statistical models for signal and image processing. 1626
He currently resides in Germany where he works as a data scientist. 1627
Qiyu Sun received the BSc and PhD degree in mathematics from Hangzhou 1628
University, China in 1985 and 1990 respectively. He is a full professor with the 1629
Department of Mathematics, University of Central Florida. His prior position 1630
was with Zhejiang University (China), National University of Singapore 1631
(Singapore), Vanderbilt University, and University of Houston. 1632
His research interests include sampling theory, Wiener’s lemma, wavelet 1633
and frame theory, linear and nonlinear inverse problems, and Fourier analysis. 1634
He has published more than 100 papers on mathematics and signal processing, 1635
and written a book An Introduction to Multiband Wavelets (Zhejiang Univer- 1636
sity Press, 2001) with Ning Bi and Daren Huang. He is on the editorial 1637
board of the journals Advance in Computational Mathematics, Numerical 1638
Functional Analysis and Optimization and Sampling Theory in Signal and 1639
Imaging Processing. 1640
IEEE
Proo
f
AUTHOR QUERIES
AQ:1 = Please provide the expansion for “CARMA.”AQ:2 = Please supply index terms/keywords for your paper. To download the IEEE Taxonomy, go to
http://www.ieee.org/documents/taxonomy_v101.pdf.AQ:3 = Fig. 3 is not cited in body text. Please indicate where it should be cited.AQ:4 = Please provide the volume no. issue no., and page range for ref. [32].
IEEE
Proo
f
IEEE TRANSACTIONS ON INFORMATION THEORY 1
A Unified Formulation of Gaussian VersusSparse Stochastic Processes—Part I:
Continuous-Domain TheoryMichael Unser, Fellow, IEEE, Pouya D. Tafti, and Qiyu Sun
Abstract— We introduce a general distributional framework1
that results in a unifying description and characterization of a2
rich variety of continuous-time stochastic processes. The corner-3
stone of our approach is an innovation model that is driven by4
some generalized white noise process, which may be Gaussian or5
not (e.g., Laplace, impulsive Poisson, or alpha stable). This allows6
for a conceptual decoupling between the correlation properties7
of the process, which are imposed by the whitening operator8
L, and its sparsity pattern, which is determined by the type of9
noise excitation. The latter is fully specified by a Lévy measure.10
We show that the range of admissible innovation behavior11
varies between the purely Gaussian and super-sparse extremes.12
We prove that the corresponding generalized stochastic processes13
are well-defined mathematically provided that the (adjoint)14
inverse of the whitening operator satisfies some L p bound for15
p ≥ 1. We present a novel operator-based method that yields16
an explicit characterization of all Lévy-driven processes that are17
solutions of constant-coefficient stochastic differential equations.18
When the underlying system is stable, we recover the family of19
stationary CARMA processes, including the Gaussian ones. TheAQ:1 20
approach remains valid when the system is unstable and leads21
to the identification of potentially useful generalizations of the22
Lévy processes, which are sparse and non-stationary. Finally, we23
show that these processes admit a sparse representation in some24
matched wavelet domain and provide a full characterization of25
their transform-domain statistics.26
Index Terms— XXXXX.AQ:2 27
I. INTRODUCTION28
IN RECENT years, the research focus in signal process-29
ing has shifted away from the classical linear paradigm,30
which is intimately linked with the theory of stationary31
Gaussian processes [1], [2]. Instead of considering Fourier32
transforms and performing quadratic optimization, researchers33
are presently favoring wavelet-like representations and have34
adopted sparsity as design paradigm [3]–[8]. The property that35
a signal admits a sparse expansion can be exploited elegantly36
Manuscript received September 21, 2012; revised October 7, 2013; acceptedDecember 13, 2013. This work was supported in part by the Swiss NationalScience Foundation under Grant 200020-144355, in part by the EuropeanCommission under Grant ERC-2010-AdG-267439-FUNSP, and in part by theNational Science Foundation under Grant DMS 1109063.
M. Unser and P. D. Tafti are with the Biomedical Imaging Group,École Polytechnique Fédérale de Lausanne, Lausanne CH-1015, Switzerland(e-mail: [email protected]; [email protected]).
Q. Sun is with the Department of Mathematics, University of CentralFlorida, Orlando, FL 32816 USA (e-mail: [email protected]).
Communicated by V. Borkar, Associate Editor for CommunicationNetworks.
Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TIT.2014.2298453
for compressive sensing, which is presently a very active area 37
of research (cf. special issue of the Proceedings of the IEEE 38
[9], [10]). The concept is equally helpful for solving inverse 39
problems and has resulted in significant algorithmic advances 40
for the efficient resolution of large scale �1-norm minimization 41
problems [11]–[13]. 42
The current formulations of compressed sensing and sparse 43
signal recovery are fundamentally deterministic. Also, they are 44
predominantly discrete and based on finite-dimensional mathe- 45
matics, with the notable exception of the works of Eldar [14], 46
Adcock and Hansen [15]. By drawing on the analogy with 47
the classical theory of signal processing, it is likely that 48
further progress may be achieved by adopting a statistical 49
(or estimation theoretic) point of view for the description 50
of sparse signals in the analog domain. This stands as our 51
primary motivation for the investigation of the present class 52
of continuous-time stochastic processes, the greater part of 53
which is sparse by construction. These processes are specified 54
as a superset of the Gaussian ones, which is essential for 55
maintaining backward compatibility with traditional statistical 56
signal processing. 57
The present construction is a generalization of a classical 58
idea in communication theory and signal processing which is 59
to view a stochastic process as filtered version of a white noise 60
(a.k.a. innovation) [16]. The fundamental aspect here is that 61
the modeling is done in the continuous domain, which, as we 62
shall see, imposes strong constraints on the class of admissible 63
innovations; that is, the generalized white noise that constitutes 64
the input of the innovation model. The second ingredient is a 65
powerful operational calculus (the generalization of the idea 66
of filtering) for solving stochastic differential equations (SDE), 67
including unstable ones, which is essential for inducing inter- 68
esting (non-stationary) behaviors such as self-similarity. The 69
combination of these ingredients results in the specification 70
of an extended class of stochastic processes that are either 71
Gaussian or sparse, at the exclusion of any other type. The 72
proposed theory has a unifying character in that it connects a 73
number of contemporary topics in signal processing, statistics 74
and approximation theory: 75
sparsity (in relation to compressed sensing) [3], [4] 76
signals with a finite rate of innovation [17], [18] 77
the classical theory of Gaussian stationary processes [1], 78
[16] 79
non-Gaussian continuous-domain modeling of signals 80
splines, wavelets and linear system theory [5], [23].83
Most importantly, it explains why certain classes of processes84
admit a sparse representation in a matched wavelet-like basis85
(see introductory example in Section II where the Haar trans-86
form outperfoms the classical Karhunen-Loève transform).87
Since these models are the natural functional extension of the88
Gaussian stationary processes, they may stimulate the develop-89
ment of novel algorithms for statistical signal processing. This90
has already been demonstrated in the context of biomedical91
image reconstruction [24], the derivation of statistical priors92
for discrete-domain signal representation [25], optimal signal93
denoising [26], and MMSE interpolation [27].94
Because the proposed model is intrinsically linear, we have95
adopted a formulation that relies on generalized functions,96
rather than the traditional mathematical concepts (random97
measures and Ito integrals) from the theory of stochastic98
differential equations [21], [22], [28]. We are then taking99
advantage of the theory of generalized stochastic processes100
of Gelfand (arguably, the second most famous Soviet math-101
ematician after Kolmogorov) and some powerful tools of102
functional analysis (Minlos-Bochner’s theorem) [29] that are103
not widely known to engineers nor statisticians. While this104
may look like an unnecessary abstraction at first sight, it is105
very much in line with the intuition of an engineer who prefers106
to work with analog filters and convolution operators rather107
than with stochastic integrals. We are then able to use the108
whole machinery of linear system theory and the power of the109
characteristic functional to derive the statistics of the signal in110
any (linearly) transformed domain.111
The paper is organized as follows. The basic flavor112
of the innovation model is conveyed in Section II by113
focusing on a first-order differential system which results in114
the generation of Gaussian and non-Gaussian AR(1) stochastic115
processes. We use of this model to illustrate that a properly-116
matched wavelet transform can outperform the classical117
Karhunen-Loève transform (or the DCT) for the compres-118
sion of (non-Gaussian) signals. In Section III, we review119
the foundations of Gelfand’s theory of generalized stochas-120
tic processes. In particular, we characterize the complete121
class of admissible continuous-time white noise processes122
(innovations) and give some argumentation as to why the123
non-Gaussian brands are inherently sparse. In Section IV,124
we give a high-level description of the general innova-125
tion model and provide a novel operator-based method126
for the solution of SDE. In Section V, we make use of127
Gelfand’s formalism to fully characterize our extended class of128
(non-Gaussian) stochastic processes including the special129
cases of CARMA and N th-order generalized Lévy processes.130
We also derive the statistics of the wavelet-domain representa-131
tion of these signals, which allows for a common (stationary)132
treatment of the two latter classes of processes, irrespective133
of any stability consideration. Finally, in Section VI, we134
turn back to our introductory example by moving into the135
unstable regime (single pole at the origin) which yields a136
non-conventional system-theoretic interpretation of classical137
Lévy processes[28], [30], [31]. We also point out the structural138
similarity between the increments of Lévy processes and 139
their Haar wavelet coefficients. For higher-order illustrations 140
of sparse processes, we refer to our companion paper [32], 141
which is specifically devoted to the study of the discrete-time 142
implication of the theory and the way to best decouple (e.g. 143
“sparsify”) such processes. The notation, which is common to 144
both papers, is summarized in [32, Table II]. 145
II. MOTIVATION: GAUSSIAN VS. NON-GAUSSIAN 146
AR(1) PROCESSES 147
A continuous-time Gaussian AR(1) (or Gauss-Markov) 148
process can be formally generated by applying a first-order 149
analog filter to a Gaussian white noise process w: 150
sα(t) = (ρα ∗ w)(t) (1) 151
where ρα(t) = �+(t)eαt with Re(α) < 0 and �+(t) is the 152
unit-step function. Next, we observe that ρα = (D −αId)−1δ 153
where δ is the Dirac impulse and where D = ddt and Id are the 154
derivative and identity operators, respectively. These operators 155
as well as the inverse are to be interpreted in the distributional 156
sense (see Section III-A). This suggests that sα satisfies the 157
“innovation” model (cf.[1], [16]) 158
(D − αId)sα(t) = w(t), (2) 159
or, equivalently, the stochastic differential equation (cf.[22]) 160
dsα(t)− αsα(t)dt = dW (t), 161162
where W (t) = ∫ t0 w(τ)dτ is a standard Brownian motion 163
(or Wiener process) excitation. In the statistical literature, 164
the solution of the above first-order SDE is often called the 165
Ornstein-Uhlenbeck process. 166
Let (sα[k] = sα(t)|k=t )k∈Z denote the sampled version of 167
the continuous-time process. Then, one can show that sα[·] is 168
a discrete AR(1) autoregressive process that can be whitened 169
by applying the first-order linear predictor: 170
sα[k] − eαsα[k − 1] = u[k] (3) 171
where u[·] (prediction error) is an i.i.d. Gaussian sequence. 172
Alternatively, one can decorrelate the signal by computing its 173
discrete cosine transform (DCT), which is known to be asymp- 174
totically equivalent to the Karhunen-Loève transform (KLT) of 175
the process [33], [34]. Eq. (3) provides the basis for classical 176
linear predictive coding (LPC), while the decorrelation prop- 177
erty of the DCT is often invoked to justify the popular JPEG 178
transform-domain coding scheme [35]. 179
In this paper, we are concerned with the non-Gaussian 180
counterpart of this story, which, as we shall see, will result 181
in the identification of sparse processes. The idea is to retain 182
the simplicity of the classical innovation model, while substi- 183
tuting the continuous-time Gaussian noise by some generalized 184
Lévy innovation (to be properly defined in the sequel). This 185
translates into Eqs. (1)–(3) remaining valid, except that the 186
underlying random variates are no longer Gaussian. The more 187
significant finding is that the KLT (or its discrete approxi- 188
mation by the DCT) is no longer optimal for producing the 189
best M-term approximation of the signal. This is illustrated in 190
IEEE
Proo
f
UNSER et al.: UNIFIED FORMULATION OF GAUSSIAN SPARSE STOCHASTIC PROCESSES 3
Fig. 1. Wavelets vs. KLT (or DCT) for the M-term approximation ofGaussian vs. sparse AR(1) processes with α = −0.1: (a) classical Gaussianscenario, (b) sparse scenario with symmetric Cauchy innovations. The E-splinewavelets are matched to the innovation model. The displayed results (relativequadratic error as a function of M/N ) are averages over 1000 realizationsfor AR(1) signals of length N = 1024; the performance of DCT and KLT isundistinguishable.
Fig. 1, which compares the performance of various transforms191
for the compression of two kinds of AR(1) processes with192
correlation e−0.1 ≈ 0.90: Gaussian vs. sparse where the latter193
innovation follows a Cauchy distribution. The key observation194
is that the E-spline wavelet transform, which is matched to the195
operator L = D − αId, provides the best results in the non-196
Gaussian scenario over the whole range of experimentation197
[cf. Fig. 1(b)], while the outcome in the Gaussian case is as198
predicted by the classical theory with the KLT being superior.199
Examples of orthogonal E-spline wavelets at two successive200
scales are shown in Fig. 2 next to their Haar counterparts.201
We selected the E-spline wavelets because of their ability202
to decouple the process which follows from their operator-203
like behavior: ψi = L∗φi where i is the scale index and φi204
a suitable smoothing kernel [36, Theorem 2]. Unlike their205
conventional cousins, they are not dilated versions of each206
other, but rather extrapolations in the sense that the slope207
of the exponential segments remains the same at all scales.208
They can, however, be computed efficiently using a perfect209
reconstruction filterbank with scale-dependent filters [36].210
The equivalence with traditional wavelet analysis (Haar)211
and finite-differencing (as used in the computation of total212
variation) for signal “sparsification” is achieved by letting213
α → 0. The catch, however, is that the underlying system214
becomes unstable! Fortunately, the problem can be fixed,215
but it calls for an advanced mathematical treatment that is216
beyond the traditional formulation of stationary processes. The217
reminder of the paper is devoted to giving a proper sense218
to what has just been described informally, and to extend-219
ing the approach to the whole class of ordinary differential220
operators, including the non-stable scenarios. The non-trivial221
outcome, as we shall see, is that many non-stable systems222
are linked with non-stationary stochastic processes. These, in223
Fig. 2. Comparison of operator-like and conventional wavelet basis functionsat two successive scales: (a) first-order E-spline wavelets with α = −0.5.(b) Haar wavelets. The vertical axis is rescaled for full range display.
turn, can be stationarized and “sparsified” by application of a 224
suitable wavelet transformation. The companion paper [32] is 225
focused on the discrete aspects of the theory including the 226
generalization of (3) for decoupling purposes and the full 227
characterization of the underlying processes. 228
III. MATHEMATICAL BACKGROUND 229
The purpose of this section is to introduce the distribu- 230
tional formalism that is required for the proper definition of 231
continuous-time white noise that is the driving term of (1) 232
and its generalization. We start with a brief summary of some 233
required notions in functional analysis, which also serves us to 234
set the notation. We then introduce the fundamental concept 235
of characteristic functional which constitutes the foundation 236
of Gelfand’s theory of generalized stochastic processes. We 237
proceed by giving the complete characterization of the possible 238
types of continuous-domain white noises—not necessarily 239
Gaussian—which will be used as universal input for our inno- 240
vation models. We conclude the section by showing that the 241
non-Gaussian brands of noises that are allowed by Gelfand’s 242
formulation are intrinsically sparse, a property that has not 243
been emphasized before (to the best of our knowledge). 244
A. Functional and Distributional Context 245
The L p-norm of a function f = f (t) is ‖ f ‖p = 246
(∫R
| f (t)|pdt) 1
p for 1 ≤ p < ∞ and ‖ f ‖∞ = 247
ess supt∈R | f (t)| for p = +∞ with the corresponding 248
Lebesgue space being denoted by L p = L p(R). The concept is 249
extendable for characterizing the rate of decay of functions. To 250
that end, we introduce the weighted L p,α spaces with α ∈ R+
251
L p,α = {f ∈ L p : ‖ f ‖p,α < +∞}
252
where the α-weighted L p-norm of f is defined as 253
‖ f ‖p,α = ‖(1 + | · |α) f (·)‖p . 254
Hence, the statement f ∈ L∞,α implies that f (t) decays 255
at least as fast as 1/|t|α as t tends to ±∞; more precisely, 256
that | f (t)| ≤ ‖ f ‖∞,α
1+|t |α almost everywhere. In particular, this 257
allows us to infer that L∞, 1p +ε ⊂ L p for any ε > 0 and 258
p ≥ 1. Another obvious inclusion is L p,α ⊆ L p,α0 for any 259
IEEE
Proo
f
4 IEEE TRANSACTIONS ON INFORMATION THEORY
α ≥ α0. In the limit, we end up with the space of rapidly-260
decreasing functions R = {f : ‖ f ‖∞,m < +∞, ∀m ∈ Z
+},261
which is included in all the others.1262
We use ϕ = ϕ(t) to denote a generic function in Schwartz’s263
class S of rapidly-decaying and infinitely-differentiable test264
functions. Specifically, Schwartz’s space is defined as:265
S = {ϕ ∈ C∞ : ‖Dnϕ‖∞,m < +∞, ∀m, n ∈ Z
+},266
with the operator notation Dn = dn
dt n and the convention that267
D0 = Id (identity). S is a complete topological vector space268
with respect to the topology induced by the series of semi-269
norm ‖Dn · ‖∞,m with m, n ∈ Z+. Its topological dual is270
the space of tempered distributions S ′; a distribution φ ∈ S ′271
is a continuous linear functional on S that is characterized272
by a duality product rule φ(ϕ) = 〈φ, ϕ〉 = ∫Rφ(t)ϕ(t)dt273
with ϕ ∈ S where the right-hand side expression has a literal274
interpretation as an integral only when φ(t) is true function275
of t . The prototypical example of a tempered distribution is the276
Dirac distribution δ, which is defined as δ(ϕ) = 〈δ, ϕ〉 = ϕ(0).277
In the sequel, we will drop the explicit dependence of the278
distribution on the generic test function ϕ ∈ S and simply279
write φ, φ(·) or even φ(t) (with an abuse of notation) where t280
as the generic time index. For instance, we shall denote the281
shifted Dirac impulse2 by δ(· − t0), or δ(t − t0) which is the282
conventional notation used by engineers.283
Let T be a continuous3 linear operator that maps S into284
itself (or eventually some enlarged topological space such285
as L p). It is then possible to extend the action of T over286
S ′ (or an appropriate subset of it) based on the definition287
〈Tφ, ϕ〉 = 〈φ,T∗ϕ〉 for φ ∈ S ′ if T∗ is the adjoint of T288
which maps ϕ to another test function T∗ϕ ∈ S continuously.289
An important example is the Fourier transform whose classical290
definition is F{ f }(ω) = f (ω) = ∫R
f (t)e− jωt dt . Since F is291
a S-continuous operator, it is extendable to S ′ based on the292
adjoint relation 〈Fφ, ϕ〉 = 〈φ,Fϕ〉 for all ϕ ∈ S (generalized293
Fourier transform).294
A linear, shift-invariant (LSI) operator that is well-defined295
over S can always be written as a convolution product:296
TLSI{ϕ} = h ∗ ϕ =∫
R
h(τ )ϕ(· − τ )dτ297
where h = TLSI{δ} is the impulse response of the system.298
The adjoint operator is the convolution with the time-reversed299
version of h:300
h∨(t) ≡ h(−t).301
The better-known categories of LSI operators are the302
the ordinary differential operators. While the latter are not304
BIBO-stable, they do work well with test functions.305
1The topology of R is defined by the family of semi-norms ‖ · ‖∞,m ,m = 1, 2, 3, . . .
2The precise definition is 〈δ(· − t0), ϕ〉 = ϕ(t0) for all ϕ ∈ S .3An operator T is continuous from a sequential topological vector space
V into another one iff. ϕk → ϕ in the topology of V implies that Tϕk → Tϕin the topology (or norm) of the second space. If the two spaces coincide, wesay that T is V-continuous.
1) L p-Stable LSI Operators: The BIBO-stable filters corre- 306
spond to the case where h ∈ L1, or, more generally, when h 307
corresponds to a complex-valued Borel measure of bounded 308
variation. The latter extension allows for discrete filters of the 309
form hd = ∑n∈Z
d[n]δ(·−n) with d[n] ∈ �1. We will refer to 310
these filters as L p-stable because they specify bounded oper- 311
ators in all the L p spaces (by Young’s inequality). L p-stable 312
convolution operators satisfy the properties of commutativity, 313
associativity, and distributivity with respect to addition. 314
2) S-Continuous LSI Operators: For an L p-stable filter to 315
yield a Schwartz function as output, it is necessary that its 316
impulse response (continuous or discrete) be rapidly-decaying. 317
In fact, the condition h ∈ R (which is much stronger than inte- 318
grability) ensures that the filter is S-continuous. The nth-order 319
derivative Dn and its adjoint Dn∗ = (−1)nDn are in the 320
same category. The nth-order weak derivative of the tempered 321
distribution φ is defined as Dnφ(ϕ) = 〈Dnφ, ϕ〉 = 〈φ,Dn∗ϕ〉 322
for any ϕ ∈ S. The latter operator—or, by extension, any 323
polynomial of distributional derivatives PN (D) = ∑Nn=1 anDn
324
with constant coefficients an ∈ C—maps S ′ into itself. The 325
class of these differential operators enjoys the same properties 326
as its classical counterpart: shift-invariance, commutativity, 327
associativity and distributivity. 328
B. Notion of Generalized Stochastic Process 329
Classically, a stochastic process is a random function 330
s(t), t ∈ R whose statistical description is provided by the 331
probability law of its point values {s(t1), s(t2), . . . , s(tn), . . . } 332
for any finite sequence of time instants {tn}Nn=1. The implicit 333
assumption there is that one has a mechanism for probing the 334
value of the function s at any time t ∈ R, which is only 335
achievable approximately in the real physical world. 336
The leading idea in Gelfand and Vilenkin’s theory of 337
generalized stochastic processes is to replace the point mea- 338
surements {s(tn)} by a series of scalar products {〈s, ϕn〉} with 339
suitable “test” functions ϕ1, . . . , ϕN ∈ S [29]. The physical 340
motivation that these authors give is that Xn = 〈s, ϕn〉 may 341
represent the reading of a finite-resolution detector whose 342
output is some “averaged” value∫R
s(t)ϕn(t)dt , which is a 343
more plausible form of probing than ideal sampling. The 344
additional hypothesis is that the linear measurement X = 〈s, ϕ〉 345
depends continuously on ϕ and that the quantities Xn = 〈s, ϕn〉 346
obtained for different test functions {ϕn} are mutually compat- 347
ible. Mathematically, this translates into defining a generalized 348
stochastic process as a continuous linear random functional on 349
some topological vector space such as S. 350
Let s be such a generalized process. We first observe 351
that the scalar product X1 = 〈s, ϕ1〉 with a given test 352
function ϕ1 is a conventional (scalar) random variable that 353
is characterized by its probability density function (pdf) 354
pX1(x1); the latter is in one-to-one correspondence (via the 355
Fourier transform) with the characteristic function pX1(ω1) = 356
E{e jω1 X1} = ∫R
e jω1x1 pX1(x1)dx1 = E{e j 〈s,ω1ϕ1〉} where E{·} 357
is the expectation operator. The same applies for the 2nd-order 358
pdf pX1,X2(x1, x2) associated with a pair of test functions ϕ1 359
and ϕ2 which is the inverse Fourier transform of the 2-D 360
characteristic function pX1,X2(ω1, ω2) = E{e j 〈s,ω1ϕ1+ω2ϕ2〉}, 361
and so forth if one wants to specify higher-order dependencies. 362
IEEE
Proo
f
UNSER et al.: UNIFIED FORMULATION OF GAUSSIAN SPARSE STOCHASTIC PROCESSES 5
The foundation for the theory of generalized stochastic363
processes is that one can deduce the complete statistical364
information about the process from the knowledge of its365
characteristic form366
Ps(ϕ) = E{e j 〈s,ϕ〉} (4)367
which is a continuous, positive-definite functional over S such368
that Ps(0) = 1. Since the variable ϕ in Ps(ϕ) is completely369
generic, it provides the equivalent of an infinite-dimensional370
generalization of the characteristic function. Indeed, any finite371
dimensional version can be recovered by direct substitution of372
ϕ = ω1ϕ1 +· · ·+ωNϕN in Ps(ϕ) where the ϕn are fixed and373
where ω = (ω1, · · · , ωN ) takes the role of the N-dimensional374
Fourier variable.375
In fact, Gelfand’s theory rests upon the principle that speci-376
fying an admissible functional Ps(ϕ) is equivalent to defining377
the underlying generalized stochastic process (Bochner-Minlos378
theorem). To explain this remarkable result, we start by379
recalling the fundamental notion of positive-definiteness for380
univariate functions [37].381
Definition 1: A complex-valued function f of the real382
variable ω is said to be positive-definite iff.383
N∑
m=1
N∑
n=1
f (ωm − ωn)ξmξ n ≥ 0384
for every possible choice of ω1, . . . , ωN ∈ R, ξ1, . . . , ξN ∈ C385
and N ∈ Z+.386
This is equivalent to the requirement that the N × N matrix387
F whose elements are given by [F]mn = f (ωm − ωn) is388
positive semi-definite (that is, non-negative definite) for all389
N , no matter how the ωn’s are chosen.390
Bochner’s theorem states that a bounded, continuous func-391
tion p is positive-definite if and only if it is the Fourier392
transform of a positive and finite Borel measure P:393
p(ω) =∫
R
e jωxdP(x).394
In particular, Bochner’s theorem implies that a function pX (ω)395
is a valid characteristic function—that is, pX (ω) = E{e jωX } =396 ∫R
e jωx PX (dx) = ∫R
e jωx pX (x)dx where X is a random397
variable with probability measure PX (or pdf pX )—iff. pX is398
continuous, positive-definite and such that pX (0) = 1.399
The power of functional analysis is that these concepts400
carry over to functionals on some abstract nuclear space X ,401
the prime example being Schwartz’s class S of smooth and402
rapidly-decreasing test functions[29].403
Definition 2: A complex-valued functional F(ϕ) defined404
over the function space X is said to be positive-definite iff.405
N∑
m=1
N∑
n=1
F(ϕm − ϕn)ξmξ∗n ≥ 0406
for every possible choice of ϕ1, . . . , ϕN ∈ X , ξ1, . . . , ξN ∈ C407
and N ∈ Z+.408
Definition 3: A functional F : X → R (or C) is said to409
be continuous (with respect to the topology of the function410
space X ) if, for any convergent sequence (ϕi ) in X with limit 411
ϕ ∈ X , the sequence F(ϕi ) converges to F(ϕ); that is, 412
limi
F(ϕi ) = F(limiϕi ). 413
Theorem 1 (Minlos-Bochner): Given a functional Ps(ϕ) 414
on a nuclear space X that is continuous, positive-definite and 415
such that Ps(0) = 1, there exists a unique probability measure 416
Ps on the dual space X ′ such that 417
Ps(ϕ) = E{e j 〈s,ϕ〉} =∫
X ′e j 〈s,ϕ〉dPs(s), 418
where 〈s, ϕ〉 is the dual pairing map. One further has the 419
guarantee that all finite dimensional probabilities measures 420
derived from Ps(ϕ) by setting ϕ = ω1ϕ1 + · · · + ωNϕN are 421
mutually compatible. 422
The characteristic form therefore uniquely specifies the 423
generalized stochastic process s = s(ϕ) (via the infinite- 424
dimensional probability measure Ps) in essentially the 425
same way as the characteristic function fully determines 426
the probability measure of a scalar or multivariate random 427
variable. 428
C. White Noise Processes (Innovations) 429
We define a white noise w as a generalized random process 430
that is stationary and whose measurements for non-overlapping 431
test functions are independent. A remarkable aspect of the 432
theory of generalized stochastic processes is that it is 433
possible to deduce the complete class of such noises based 434
on functional considerations only [29]. To that end, Gelfand 435
and Vilenkin consider the generic class of functionals of the 436
form 437
Pw(ϕ) = exp
(∫
R
f(ϕ(t)
)dt
)
(5) 438
where f is a continuous function on the real line and ϕ 439
is a test function from some suitable space. This functional 440
specifies an independent noise process if Pw is continuous 441
and positive-definite and Pw(ϕ1 + ϕ2) = Pw(ϕ1)Pw(ϕ2) 442
whenever ϕ1 and ϕ2 have non-overlapping support. The latter 443
property is equivalent to having f (0) = 0 in (5). Gelfand 444
and Vilenkin then go on to prove that the complete class of 445
functionals of the form (5) with the required mathematical 446
properties (continuity, positive-definiteness and factorizability) 447
is obtained by choosing f to be a Lévy exponent, as defined 448
below. 449
Definition 4: A complex-valued continuous function f (ω) 450
is a valid Lévy exponent if and only if f (0) = 0 and gτ (ω) = 451
eτ f (ω) is a positive-definite function of ω for all τ ∈ R+. 452
In doing so, they actually establish a one-to-one corre- 453
spondence between the characteristic form of an indepen- 454
dent noise processes (5) and the family of infinite-divisible 455
laws whose characteristic function takes the form pX (ω) = 456
e f (ω) = E{e jωX } [38], [39]. While Definition 4 is hard to 457
exploit directly, the good news is that there exists a complete 458
constructive, characterization of Lévy exponents, which is a 459
classical result in probability theory: 460
IEEE
Proo
f
6 IEEE TRANSACTIONS ON INFORMATION THEORY
Theorem 2 (Lévy-Khintchine Formula): f (ω) is a valid461
Lévy exponent if and only if it can be written as462
f (ω) = jb′1ω − b2ω
2
2463
+∫
R\{0}[e jaω − 1 − jaω�{|a|<1}(a)] V (da)464
(6)465
where b′1 ∈ R and b2 ∈ R
+ are some constants and V is a466
Lévy measure, that is, a (positive) Borel measure on R\{0}467
such that468 ∫
R\{0}min(1, a2) V (da) < ∞. (7)469
The notation � (a) refers to the indicator function that takes470
the value 1 if a ∈ and zero otherwise. Theorem 2 is funda-471
mental to the classical theories of infinite-divisible laws and472
Lévy processes [28], [31], [39]. To further our mathematical473
understanding of the Lévy-Khintchine formula (6), we note474
that e jaω − 1 − jaω�{|a|<1}(a) ∼ − 12 a2ω2 as a → 0. This475
ensures that the integral is convergent even when the Lévy476
measure V is singular at the origin to the extent allowed by477
the admissibility condition (7). If the Lévy measure is finite or478
symmetrical (i.e., V (E) = V (−E) for any E ⊂ R), it is then479
also possible to use the equivalent, simplified form of Lévy480
exponent481
f (ω) = jb1ω − b2ω2
2+
∫
R\{0}(e jaω − 1
)V (da) (8)482
with b1 = b′1 − ∫
0<|a|<1 aV (da). The bottomline is that483
a particular brand of independent noise process is thereby484
completely characterized by its Lévy exponent or, equivalently,485
its Lévy triplet (b1, b2, v) where v is the so-called Lévy density486
associated with V such that487
V (E) =∫
Ev(a)da488
for any Borel set E ⊆ R. With this latter convention, the489
three primary types of innovations encountered in the signal490
processing and statistics literature are specified as follows:491
0, v(a) = Cα|a|α+1 with 0 < α < 2 and Cα = sin( πα2 )
π a501
suitable normalization constant, 502
fα(ω) = −|ω|αα! , 503
Pw(ϕ) = e− 1α! ‖ϕ‖αLα . (11) 504
The latter follows from the fact that −|ω|αα! is the generalized 505
Fourier transform of Cα|t |α+1 with the convention that α! = 506
�(α + 1) where � is Euler’s Gamma function [41]. 507
While none of these innovations has a classical interpre- 508
tation as a random function of t , we can at least provide an 509
explicit description of the Poisson noise as an infinite random 510
sequence of Dirac impulses (cf.[18, Theorem 1]) 511
wλ(t) =∑
k
akδ(t − tk) 512
where the tk are random locations that are uniformly distrib- 513
uted over R with density λ, and where the weights ak are 514
i.i.d. random variables with pdf pA(a). Remarkably, this is 515
the only innovation process in the family that has a finite rate 516
of innovation [17]; however, it is, by far, not the only one that 517
is sparse as explained next. 518
D. Gaussian Versus Sparse Categorization 519
To get a better understanding of the underlying class of 520
white noises w, we propose to probe them through some 521
localized analysis window ϕ, which will yield a conventional 522
i.i.d. random variable X = 〈w,ϕ〉 with some pdf pϕ(x). The 523
most convenient choice is to pick the rectangular analysis 524
window ϕ(t) = rect(t) = �[− 12 ,
12 ](t) when 〈w, rect〉 is 525
well-defined. By using the fact that e jaωrect(t)−1 = e jaω−1 for 526
t ∈ [− 12 ,
12 ], and zero otherwise, we find that the characteristic 527
function of X is simply given by 528
prect(ω) = Pw (ω · rect(t)) = exp ( f (ω)) , 529
which corresponds to the generic (Lévy-Khinchine) form asso- 530
ciated with an infinitely-divisible distribution [31], [39], [42]. 531
The above result makes the mapping between generalized 532
white noise processes and classical infinite-divisible (id) laws4533
explicit: The “canonical” id pdf of w, pid(x) = prect(x), is 534
obtained by observing the noise through a rectangular window. 535
Conversely, given the Lévy exponent of an id distribution, 536
f (ω) = log (F{pid}(ω)), we can specify a corresponding 537
innovation process w via the characteristic form Pw(ϕ) by 538
merely substituting the frequency variable ω by the generic 539
test function ϕ(t), adding an integration over R and taking 540
the exponential as in (5). 541
We note, in passing, that sparsity in signal processing may 542
refer to two distinct notions. The first is that of a finite rate 543
of innovation; i.e., a finite (but perhaps random) number of 544
innovations per unit of time and/or space, which results in a 545
mass at zero in the histogram of observations. The second 546
possibility is to have a large, even infinite, number of 547
innovations, but with the property that a few large innovations 548
4A random variable X with pdf pX (x) is said to be infinitely divisible (id)if for any n ∈ Z
+ there exist i.i.d. random variables X1, . . . , Xn with pdf saypn(x) such that X = X1 + · · · + Xn in law.
IEEE
Proo
f
UNSER et al.: UNIFIED FORMULATION OF GAUSSIAN SPARSE STOCHASTIC PROCESSES 7
dominate the overall behavior. In this case the histogram of549
observations is distinguished by its ‘heavy tails’. (A combina-550
tion of the two is also possible, for instance in a compound551
Poisson process with a heavy-tailed amplitude distribution.552
For such a process one may observe a change of behavior553
in passing from one dominant type of sparsity to the other).554
Our framework permits us to consider both types of sparsity,555
in the former case with compound Poisson models and in the556
latter with heavy-tailed infinitely-divisible innovations.557
To make our point, we consider two distinct scenarios.558
1) Finite Variance Case: We first assume that the second559
moment m2 = ∫R\{0} a2 V (da) of the Lévy measure V in (6)560
is finite. This allows us to rewrite the classical Lévy-Khinchine561
representation as562
f (ω) = jc1ω − b2ω2
2+
∫
R\{0}[e jaω − 1 − jaω] V (da)563
with c1 = b′′1 + ∫
|a|>1 aV (da) and where the Poisson part564
of the functional is now fully compensated. Indeed, we are565
guaranteed that the above integral is convergent because566
|e jaω − 1 − jωa| � |aω|2 as a → 0 and |e jaω − 1 − jωa| ∼567
|aω| as a → ±∞. An interesting non-Poisson example of568
infinitely-divisible probability laws that falls into this category569
(with non-finite V ) is the Laplace distribution with Lévy triplet570
(0, 0, v(a) = e−|a||a| ) and pid(x) = 1
2 e−|x |. This model is571
particularly relevant for sparse signal processing because it572
provides a tight connection between Lévy processes and total573
variation regularization [18, Section VI].574
Now, if the Lévy measure is finite∫R
V (da) = λ < ∞,575
the admissibility condition yields∫R\{0} a V (da) < ∞, which576
allows us to pull the bias correction out of the integral. The577
representation then simplifies to (8). This implies that we578
can decompose X into the sum of two independent Gaussian579
and compound Poisson random variables. The variances of580
the Gaussian and Poisson components are σ 2 = b2 and581 ∫R
a2V (da), respectively. The Poisson component is sparse582
because its pdf exhibits a mass distribution e−λδ(x) at the583
origin, meaning that the chances for a continuous amplitude584
distribution of getting zero are overwhelmingly higher than585
any other value, especially for smaller values of λ > 0. It is586
therefore justifiable to use 0 ≤ e−λ < 1 as our Poisson sparsity587
index.588
2) Infinite Variance Case: We now turn our attention to589
the case where the second moment of the Lévy measure590
is unbounded, which we like to label as the “super-sparse”591
one. To substantiate this claim, we invoke the Ramachandran-592
Wolfe theorem which states that the pth moment E{|X |p}593
with p ∈ R+ of an infinitely divisible distribution is finite594
iff.∫|a|>1 |a|p V (da) < ∞ [43], [44]. For p ≥ 2, the595
latter is equivalent to∫R\{0} |a|p V (da) < ∞ because of the596
admissibility condition (7). It follows that the cases that are597
not covered by the previous scenario (including the Gaussian598
+ Poisson model) necessarily give rise to distributions whose599
moments of order p are unbounded for p ≥ 2. The proto-600
typical representatives of such heavy tail distributions are the601
alpha-stable ones or, by extension, the broad family of infinite602
divisible probability laws that are in their domain of attraction.603
Note that these distributions all fulfill the stringent conditions 604
for �p compressibility[45], [46]. 605
IV. INNOVATION APPROACH TO CONTINUOUS-TIME 606
STOCHASTIC PROCESSES 607
Specifying a stochastic process through an innovation model 608
(or an equivalent stochastic differential equation) is attractive 609
conceptually, but it presupposes that we can provide an inverse 610
operator (in the form of an integral transform) that transforms 611
the innovation back into the initial stochastic process. This is 612
the reason why, after laying out general conditions for exis- 613
tence, we shall spend the greater part of our effort investigating 614
suitable inverse operators. 615
A. Stochastic Differential Equations 616
Our aim is to define the generalized process with whitening 617
operator L : S ′ → S ′ and Lévy exponent f as the solution of 618
the stochastic linear differential equation 619
Ls = w, (12) 620
where w is an innovation process, as described in 621
Section III-C. This definition is obviously only usable if we 622
can construct an inverse operator T = L−1 that solves this 623
equation. For the cases where the inverse is not unique, we will 624
need to select one preferential operator, which is equivalent 625
to imposing specific boundary conditions. We are then able 626
to formally express the stochastic process as a transformed 627
version of a white noise 628
s = L−1w. (13) 629
The requirement for such a solution to be consistent with (12) 630
is that the operator satisfies the right-inverse property LL−1 = 631
Id over the underlying class of tempered distributions. By 632
using the adjoint relation 〈s, ϕ〉 = 〈L−1w,ϕ〉 = 〈w,L−1∗ϕ〉, 633
we can then transfer the action of the operator onto the test 634
function inside the characteristic form and obtain a com- 635
plete statistical characterization of the so-defined generalized 636
stochastic process 637
Ps(ϕ) = PL−1w(ϕ) = Pw(L−1∗ϕ), (14) 638
where Pw is given by (5) (or one of the specific forms in the 639
list at the end of Section III-C) and where we are implicitly 640
requiring that the adjoint L−1∗ is mathematically well-defined 641
(continuous) over S, and that its composition with Pw is 642
well-defined for all ϕ ∈ S. 643
In order to realize the above idea mathematically, it isusually easier to proceed backwards: one specifies an operatorT that satisfies the left-inverse property: ∀ϕ ∈ S, TL∗ϕ = ϕ,and that is continuous (i.e., bounded in the proper norm(s))over the chosen class of test functions. One then characterizesthe adjoint of T, which is the operator T∗ : S ′ → S ′ (or anappropriate subset thereof) such that, for a given φ ∈ S ′,
∀ϕ ∈ S, 〈φ, ϕ〉 = 〈LT∗φ, ϕ〉 = 〈φ, TL∗︸︷︷︸
Id
ϕ〉.
Finally, we set L−1 = T∗, which yields the proper distribu- 644
tional definition of the right inverse of L in (13). 645
IEEE
Proo
f
8 IEEE TRANSACTIONS ON INFORMATION THEORY
B. General Conditions for Existence646
To validate the proposed innovation model, we need to647
ensure that the solution s = L−1w is a bona fide generalized648
stochastic process.649
In order to simplify the analysis, we shall restrict our650
attention to an appropriate subclass of Lévy exponents.651
Definition 5: A Lévy exponent f with derivative f ′ is652
p-admissible with 1 ≤ p ≤ 2 if there exists a positive constant653
C such that | f (ω)| + |ω| · | f ′(ω)| ≤ C|ω|p for all ω ∈ R.654
Note that this p-admissibility condition is not very con-655
straining and that it is satisfied by the great majority of656
members of the Lévy-Kintchine family (see Section III-C).657
For instance in the compound Poisson case, we can show that658
|ω| · | f ′(ω)| ≤ λ|ω| E{|A|} and f (ω) ≤ λ|ω| E{|A|} by659
using the fact |e j x − 1| ≤ |x |; this implies that the bound660
in Definition 5 with p = 1 is always satisfied provided661
that the first (absolute) moment of the amplitude pdf pA(a)662
in (10) is finite. Similarly, all symmetric Lévy exponents with663
− f ′′(0) < ∞ (finite variance case) are p-admissible with664
p = 2, the prototypical example being the Gaussian. The only665
cases we are aware of that do not fulfill the condition are the666
alpha-stable noises with 0 < α < 1, which are notorious for667
their exotic behavior.668
The first advantage of imposing p-admissibility is that it669
allows us to extend the set of acceptable analysis functions670
from S to L p which is crucial if we intend to do conventional671
signal processing.672
Theorem 3: If the Lévy exponent f is p-admissible, then673
the characteristic form Pw(ϕ) = exp(∫
Rf(ϕ(t)
)dt
)is a674
continuous, positive-definite functional over L p .675
Proof: Since the exponential function is continuous, it is676
sufficient to consider the functional677
F(ϕ) = log Pw(ϕ) =∫
R
f (ϕ(t))dt,678
which is such that F(0) = 0. To show that F(ϕ)(and hence679
Pw(ϕ))
is well-defined over L p , we note that680
|F(ϕ)| ≤∫
R
| f (ϕ(t))| dt ≤ C‖ϕ‖pp,681
which follows from the p-admissibility condition. The positive682
definiteness of Pw(ϕ) over S is a direct consequence of f683
being a Lévy exponent and is therefore also transferable to684
L p . For the interested reader, this can be shown quite easily685
by proving that F(ϕ) is conditionally positive-definite of order686
one (see [20]).687
The only remaining work is to establish the L p-continuity688
of F(ϕ). To that end, we observe that689
| f (u)− f (v)| =∣∣∣
∫ u
vf ′(t)dt
∣∣∣690
≤ C∣∣∣
∫ u
vt p−1dt
∣∣∣691
(by the assumption on f )692
≤ C max(|u|p−1, |v|p−1)|u − v|693
≤ C(|v|p−1 + |u − v|p−1)|u − v|.694
(by the triangle inequality)695
Next, we pick a convergent sequence in L p , {ϕn}∞n=1, whose 696
limit is denoted by ϕ. The convergence in L p is expressed as 697
limn→∞ ‖ϕn − ϕ‖p = 0. (15) 698
We then have 699
∣∣∣
∫
R
f (ϕn(t))dt −∫
R
f (ϕ(t))dt∣∣∣ 700
≤C∫
R
|ϕ(t)|p−1|ϕn(t)− ϕ(t)| + |ϕn(t)− ϕ(t)|pdt 701
≤C(‖ϕ‖p−1
p ‖ϕn − ϕ‖p + ‖ϕn − ϕ‖pp
)
(by Hölder’s inequality)
702
→0 as n → ∞, (by (15)) 703704
which proves the continuity of the functional Pw on L p . 705
Thanks to this result, we can then rely on the Minlos- 706
Bochner theorem (Theorem 1) to state basic conditions on 707
T = L−1∗ that ensure that s = T∗w is a well-defined 708
generalized process over S ′. 709
Theorem 4 (Existence of Generalized Process): Let f be a 710
valid Lévy exponent and T be an operator acting on ϕ ∈ S 711
such that any one of the conditions below is met: 712
1) T is a continuous linear map from S into itself, 713
2) T is a continuous linear map from S into L p and the 714
Lévy exponent f is p-admissible. 715
Then, Ps(ϕ) = exp(∫
Rf(Tϕ(t)
)dt
)is a continuous, positive- 716
definite functional on S such that Ps(0) = 1. 717
Proof: We already know that Pw is a continuous 718
functional on S (resp., on L p when f is p-admissible) by 719
construction. This, together with the assumption that T is a 720
continuous operator on S (resp., from S to L p), implies that 721
the composed functional Ps(ϕ) := Pw(Tϕ) is continuous 722
on S. 723
Given the functions ϕ1, . . . , ϕN in S and some complex 724
coefficients ξ1, . . . , ξN , 725
∑
1≤m,n≤N
Ps(ϕm − ϕn)ξmξn 726
=∑
1≤m,n≤N
Pw
(T(ϕm − ϕn)
)ξmξn 727
=∑
1≤m,n≤N
Pw(Tϕm − Tϕn)ξmξn
(by the linearity of the operator T)
728
≥ 0. (by the positivity of Pw over S or L p) 729730
This proves the positive definiteness of the functional Ps 731
on S. 732
Lastly, Ps(0) = Pw(T0) = Pw(0) = 1. 733
The final fundamental issue relates to the interpretation 734
of s = L−1w as an ordinary stochastic process; that is, a 735
random function s(t) of the time variable t . This presupposes 736
that the shaping operator L−1 performs a minimal amount of 737
smoothing since the driving term of the model, w, is too rough 738
jω0 with ω0 ∈ R. This third case, which is incompatible 811
with the conventional formulation of stationary processes, is 812
most interesting theoretically because it opens the door to 813
important extensions such as Lévy processes, as we shall see in 814
Section V. Here, we will show that marginally-stable systems 815
can be handled within our generalized framework as well, 816
thanks to the introduction of appropriate inverse operators. 817
The first natural candidate for (D − jω0Id)−1 is the inversefilter whose frequency response is
ρ jω0(ω) = 1
j (ω − ω0)+ πδ(ω − ω0).
It is a convolution operator whose time-domain definition is 818
Iω0ϕ(t) = (ρ jω0 ∗ ϕ)(t) 819
= e jω0t∫ t
−∞e− jω0τ ϕ(τ )dτ. (19) 820
Its impulse response ρ jω0(t) is causal and compatible with 821
Definition (18), but not (rapidly) decaying. The adjoint of Iω0 822
is given by 823
I∗ω0ϕ(t) = (ρ∨
jω0∗ ϕ)(t) 824
= e− jω0t∫ +∞
te jω0τ ϕ(τ )dτ. (20) 825
While Iω0ϕ(t) and I∗ω0ϕ(t) are both well-defined when ϕ ∈ L1, 826
the problem is that these inverse filters are not BIBO stable 827
since their impulse responses, ρ jω0(t) and ρ∨jω0(t), are not 828
in L1. In particular, one can easily see that Iω0ϕ (resp., I∗ω0ϕ) 829
with ϕ ∈ S is generally not in L p with 1 ≤ p < +∞, 830
unless ϕ(ω0) = 0 (resp., ϕ(−ω0) = 0). The conclusion is 831
that I∗ω0fails to be a bounded operator over the class of test 832
functions S. 833
This leads us to introduce some “corrected” version of the 834
adjoint inverse operator I∗ω0, 835
I∗ω0,t0ϕ(t) = I∗ω0
{ϕ − ϕ(−ω0)e
− jω0t0δ(· − t0)}(t) 836
= I∗ω0ϕ(t)− ϕ(−ω0)e
− jω0t0ρ∨jω0(t − t0), (21) 837
where t0 ∈ R is a fixed location parameter and where 838
ϕ(−ω0) = ∫R
e jω0tϕ(t)dt is the complex sinusoidal moment 839
associated with the frequency ω0. The idea is to correct for 840
5: ρ is a Green functions of (D −αn Id) iff. (D −αn Id)ρ = δ; the completeset of solutions is given ρ(t) = ραn (t)+Ceαn t which is the sum of the causalGreen function ραn (t) plus an arbitrary exponential component that is in thenull space of the operator.
IEEE
Proo
f
10 IEEE TRANSACTIONS ON INFORMATION THEORY
the lack of decay of I∗ω0ϕ(t) as t → −∞ by subtracting841
a properly weighted version of the impulse response of the842
operator. An equivalent Fourier-based formulation is provided843
by the formula at the bottom of Table I; the main difference844
with the corresponding expression for Iω0ϕ is the presence of a845
regularization term in the numerator that prevents the integrant846
from diverging at ω = ω0. The next step is to identify the847
adjoint of I∗ω0,t0 , which is achieved via the following inner-848
product manipulation849
〈ϕ, I∗ω0,t0φ〉 = 〈ϕ, I∗ω0φ〉 − φ(−ω0)e
− jω0t0〈ϕ, ρ∨jω0(· − t0)〉850
= 〈Iω0ϕ, φ〉 − 〈e jω0·, φ〉 e− jω0t0 Iω0ϕ(t0)851
(using(19))852
= 〈Iω0ϕ, φ〉 − 〈e jω0(·−t0)Iω0ϕ(t0), φ〉.853
Since the above is equal to 〈Iω0,t0ϕ, φ〉 by definition, we obtain854
that855
Iω0,t0ϕ(t) = Iω0ϕ(t)− e jω0(t−t0) Iω0ϕ(t0). (22)856
Interestingly, this operator imposes the boundary condition857
Iω0,t0ϕ(t0) = 0 via the subtraction of a sinusoidal component858
that is in the null space of the operator (D − jω0Id), which859
gives a direct interpretation of the location parameter t0.860
Observe that expressions (21) and (22) define linear operators,861
albeit not shift-invariant ones, in contrast with the classical862
inverse operators Iω0 and I∗ω0.863
For analysis purposes, it is convenient to relate the proposed864
inverse operators to the anti-derivatives corresponding to the865
case ω0 = 0. To that end, we introduce the modulation866
operator867
Mω0ϕ(t) = e jω0tϕ(t)868
which is a unitary map on L2 with the property that869
M−1ω0
= M−ω0 .870
Proposition 1: The inverse operators defined by (19), (20),871
(22), and (21) satisfy the modulation relations872
Iω0ϕ(t) = Mω0 I0 M−1ω0ϕ(t),873
I∗ω0ϕ(t) = M−1
ω0I∗0 Mω0ϕ(t),874
Iω0,t0ϕ(t) = Mω0 I0,t0 M−1ω0ϕ(t),875
I∗ω0,t0ϕ(t) = M−1ω0
I∗0,t0 Mω0ϕ(t).876
Proof: These follow from the modulation property of877
the Fourier transform (i.e, F{Mω0ϕ}(ω) = F{ϕ}(ω − ω0))878
and the observations that Iω0δ(t) = ρ jω0(t) = Mω0ρ0(t) and879
I∗ω0δ(t) = ρ∨
jω0(t) = M−ω0ρ
∨0 (t) with ρ0(t) = �+(t) (the unit880
step function).881
The important functional property of I∗ω0,t0 is that it essentially882
preserves decay and integrability, while Iω0,t0 fully retains sig-883
nal differentiability. Unfortunately, it is not possible to have the884
two simultaneously unless Iω0ϕ(t0) and ϕ(−ω0) are both zero.885
Proposition 2: If f ∈ L∞,α with α > 1, then there exists886
a constant Ct0 such that887
|I∗ω0,t0 f (t)| ≤ Ct0‖ f ‖∞,α
1 + |t|α−1 ,888
which implies that I∗ω0,t0 f ∈ L∞,α−1.889
Proof: Since modulation does not affect the decay properties 890
of a function, we can invoke Proposition 1 and concentrate on 891
the investigation of the anti-derivative operator I∗0,t0 . Without 892
loss of generality, we can also pick t0 = 0 and transfer the 893
bound to any other finite value of t0 by adjusting the value of 894
the constant Ct0 . Specifically, for t < 0, we write this inverse 895
operator as 896
I∗0,0 f (t) = I∗0 f (t)− f (0) 897
=∫ +∞
tf (τ )dτ −
∫ ∞
−∞f (τ )dτ 898
= −∫ t
−∞f (τ )dτ. 899
This implies that 900
|I∗0,0 f (t)| =∣∣∣∣
∫ t
−∞f (τ )dτ
∣∣∣∣ ≤ ‖ f ‖∞,α
∫ t
−∞1
1 + |τ |α dτÊ 901
≤(
2α
α − 1
) ‖ f ‖∞,α
1 + |t|α−1 902
for all t < 0. For t > 0, I∗0,0 f (t) = ∫ ∞t f (τ )dτ so that the 903
above upper bounds remain valid. 904
The interpretation of the above result is that the inverse 905
operator I∗ω0,t0 reduces inverse polynomial decay by one order. 906
Proposition 2 actually implies that the operator will preserve 907
the rapid decay of the Schwartz functions which are included 908
in L∞,α for any α ∈ R+. It also guarantees that I∗ω0,t0ϕ belongs 909
to L p for any Schwartz function ϕ. However, I∗ω0,t0 will spoil 910
the global smoothness properties of ϕ because it introduces a 911
discontinuity at t0, unless ϕ(−ω0) is zero in which case the 912
output remains in the Schwartz class. This allows us to state 913
the following theorem which summarizes the higher-level part 914
of those results for further reference. 915
Theorem 6: The operator I∗ω0,t0 defined by (22) is a continu- 916
ous linear map from R into R (the space of bounded functions 917
with rapid decay). Its adjoint Iω0,t0 is given by (21) and has the 918
property that Iω0,t0ϕ(t0) = 0. Together, these operators satisfy 919
the complementary left- and right-inverse relations 920
{I∗ω0,t0(D − jω0Id)∗ϕ = ϕ
(D − jω0Id)Iω0,t0ϕ = ϕ921
922
for all ϕ ∈ S. 923
Having a tight control on the action of I∗ω0,t0 over S allows 924
us to extend the right-inverse operator Iω0,t0 to an appropriate 925
subset of tempered distributions φ ∈ S ′ according to the rule 926
〈Iω0,t0φ, ϕ〉 = 〈φ, I∗ω0,t0ϕ〉. Our complete set of inverse oper- 927
ators is summarized in Table I together with their equivalent 928
Fourier-based definitions which are also interpretable in the 929
generalized sense of distributions. The first three entries of 930
the table are standard results from the theory of linear systems 931
(e.g., [49, Table 4.1]), while the other operators are specific 932
to this work. 933
D. Solution of Generic Stochastic Differential Equation 934
We now have all the elements to solve the generic stochastic 935
linear differential equation 936
N∑
n=0
anDns =M∑
m=0
bmDmw (23) 937
IEEE
Proo
f
UNSER et al.: UNIFIED FORMULATION OF GAUSSIAN SPARSE STOCHASTIC PROCESSES 11
TABLE I
FIRST-ORDER DIFFERENTIAL OPERATORS AND THEIR INVERSES
where the an and bm are arbitrary complex coefficients with the938
normalization constraint aN = 1. While this reminds us of the939
textbook formula of an ordinary N th-order differential system,940
the non-standard aspect in (23) is that the driving term is a941
innovation process w, which is generally not defined point-942
wise, and that we are not imposing any stability constraint.943
Eq. (23) thus covers the general case (12) where L is a shift-944
invariant operator with the rational transfer function945
simplification using the left-inverse property of L−1∗. The1230
statement about independence follows from the exponential1231
nature of the characteristic function and the property that1232
f (0) = 0, which allows for the factorization of the charac-1233
teristic function when the support of the kernels are distinct1234
7A technical remark is in order here: the substitution of a non-smoothfunction such as φi ∈ R in the characteristic noise functional Pw is legitimateprovided that the domain of continuity of the functional can be extendedfrom S to R, or, even less restrictively, to L p when f is p-admissible (seeTheorem 3).
(independence of the noise at every point). The correlation 1235
formula is obtained by direct application of the first result 1236
in Property 1 with ϕ1 = ψi1,k1 = L∗φi1(· − 2i1 k1) and 1237
ϕ2 = ψi2 ,k2 = L∗φi2 (· − 2i2 k2). 1238
These results provide a complete characterization of the 1239
statistical distribution of sparse stochastic processes in some 1240
matched wavelet domain. They also indicate that the repre- 1241
sentation is intrinsically sparse since the transformed-domain 1242
statistics are infinitely divisible. Practically, this translates 1243
into the wavelet domain pdfs being heavier tailed than a 1244
Gaussian (unless the process is Gaussian) (cf. argumentation in 1245
Section III-D). 1246
To make matters more explicit, we consider the case where 1247
the innovation process is SαS. The application of Property 2 1248
with f (ω) = −|ω|αα! yields fi (ω) = −|σiω|α
α! with disper- 1249
sion parameter σi = ‖φi‖Lα . This proves that the wavelet 1250
coefficients of a generalized SαS stochastic process follow 1251
SαS distributions with the spread of the pdf at scale i being 1252
determined by the Lα norm of the corresponding wavelet 1253
smoothing kernels. This strongly suggests that, for α < 2, 1254
the process is compressible in the sense that the essential part 1255
of the “energy content” is carried by a tiny fraction of wavelet 1256
coefficients, as illustrated in Fig. 1. 1257
It should be noted, however, that the quality of the decou- 1258
pling is strongly dependent upon the spread of the wavelet 1259
smoothing kernels φi which should be chosen to be max- 1260
imally localized for best performance. In the case of the 1261
first-order system (cf. example in Section II), the basis func- 1262
tions for i fixed are not overlapping which implies that the 1263
wavelet coefficients within a given scale are independent. 1264
This is not so across scale because of the cone-shaped region 1265
where the support of the kernels φi1 and φi2 overlap, which 1266
induces dependencies. Incidentally, the inter-scale correlation 1267
of wavelet coefficients is often exploited for improving coding 1268
performance [50] and signal reconstruction by imposing joint 1269
sparsity constraints [51]. 1270
VI. LÉVY PROCESSES REVISITED 1271
We now illustrate our method by specifying classical Lévy 1272
processes—denoted by W (t)—via the solution of the (mar- 1273
where β(0,0) = β0 ∗ β0 is the causal B-spline of degree 1 1349
(triangle function). Since D∗ = −D, this confirms that 1350
the underlying smoothing kernels are dilated versions of a 1351
B-spline of degree 1. Moreover, since the wavelet-domain 1352
sampling is critical, there is no overlap of the basis 1353
functions within a given scale which implies that the 1354
wavelets coefficients are independent on a scale-by-scale basis 1355
(cf. Property 3). If we now compare the situation with that 1356
of the Lévy increments, we observe that the wavelet analysis 1357
involves one more layer of smoothing of the innovation with 1358
β0 (due to the factorization property of β(0,0)) which slightly 1359
complicates the statistical calculations. 1360
While the smoothing effect on the innovation is qualitatively 1361
the same in both instances, there are fundamental differences, 1362
too. In the wavelet case, the underlying discrete transform 1363
is orthogonal, but the coefficients are not fully decoupled 1364
because of the inter-scale dependencies which are unavoidable, 1365
as explained in Section V-D. By contrast, the decoupling of 1366
the Lévy increments is perfect, but the underlying discrete 1367
transform (finite difference transform) is non-orthogonal. In 1368
our companion paper, we shall see how this latter strategy is 1369
extendable to the much broader family of sparse processes via 1370
the definition of the generalized increment process. 1371
C. Examples of Lévy Processes 1372
Realizations of four different Lévy processes are shown in 1373
Fig. 1 together with their Lévy triplets(b1, b2, v(a)
). The 1374
first signal is a Brownian motion (a.k.a. Wiener process) that 1375
is obtained by integration of a white Gaussian noise. This 1376
classical process is known to be nowhere differentiable in the 1377
classical sense, despite the fact that it is continuous everywhere 1378
(almost surely) as all the members of the Lévy family. While 1379
the sampled version of �0W is i.i.d. in all cases, it does not 1380
yield a sparse representation in this first instance because the 1381
underlying distribution remains Gaussian. The second process,
AQ:3
1382
which may be termed Lévy-Laplace motion, is specified by 1383
the Lévy density v(a) = e−|a|/|a| which is not in L1. By 1384
taking the inverse Fourier transform of (34), we can show that 1385
its increment process has a Laplace distribution [18]; note that 1386
IEEE
Proo
f
16 IEEE TRANSACTIONS ON INFORMATION THEORY
Fig. 3. Examples of Lévy motions W (t) with increasing degrees of sparsity. (a) Brownian motion with Lévy triplet (0, 1, 0). (b) Lévy-Laplace motion with(0, 0, e−|a|
|a|). (c) Compound Poisson process with
(0, 0, λ 1√
2πe−a2/2)
with λ = 132 . (d) Symmetric Lévy flight with
(0, 0, 1/|a|α+1)
and α = 1.2.
this type of generalized Gaussian model is often used to justify1387
sparsity-promoting signal processing techniques based on �11388
minimization [52]–[54]. The third piecewise-constant signal is1389
a compound Poisson process. It is intrinsically sparse since a1390
good proportion of its increments is zero by construction (with1391
probability e−λ). Interestingly, this is the only type of Lévy1392
process that fulfills the finite rate of innovation property [17].1393
The fourth example is an alpha-stable Lévy motion (a.k.a.1394
Lévy flight) with α = 1.2. Here, the distribution of �0W1395
is heavy-tailed (SαS) with unbounded moments for p > α.1396
Although this may not be obvious from the picture, this is1397
the sparsest process of the lot because it is �α-compressible1398
in the strongest sense [45]. Specifically, we can compress the1399
sequence such as to preserve any prescribed portion r < 1 of1400
its average �α energy by retaining an arbitrarily small fraction1401
of samples as the length of the signal goes to infinity.1402
D. Link With Conventional Stochastic Calculus1403
Thanks to (30), we can view a white noise w = W as the1404
weak derivative of some classical Lévy processes W (t) which1405
is well-defined pointwise (almost everywhere). This provides1406
us with further insights on the range of admissible innovation1407
processes of Section II.C which constitute the driving terms of1408
the general stochastic differential equation (12). This funda-1409
mental observation also makes the connection with stochastic1410
calculus8 [55], [56], which avoids the notion of white noise1411
by relying on the use of stochastic integrals of the form1412
s(t) =∫
R
h(t, t ′)dW (t ′)1413
where W is a random (signed) measure associated to some1414
canonical Brownian motion (or, by extension, a Lévy process)1415
and where h(t, t ′) is an integration kernel that formally cor-1416
responds to our inverse operator L−1 (see Theorem 5).1417
8The Itô integral of conventional stochastic calculus is based on Brownianmotion, but the concept can also be generalized to Lévy driving terms usingthe more advanced theory of semimartingales [55].
VII. CONCLUSION 1418
We have set the foundations of a unifying framework that 1419
gives access to the broadest possible class of continuous- 1420
time stochastic processes specifiable by linear, shift-invariant 1421
equations, which is beneficial for signal processing purposes. 1422
We have shown that these processes admit a concise represen- 1423
tation in a wavelet-like basis. We have applied our framework 1424
to the description of the classical Lévy processes, which, in 1425
our view, provide the simplest and most basic examples of 1426
sparse processes, despite the fact that they are non-stationary. 1427
We have also hinted at the link between Lévy increments 1428
and splines, which is the theme that we shall develop in full 1429
generality next [32]. 1430
We have demonstrated that the proposed class of 1431
stochastic models and the corresponding mathematical 1432
machinery (Fourier analysis, characteristic functional, and 1433
B-spline calculus) lends itself well to the derivation of 1434
transform-domain statistics. The formulation suggests a variety 1435
of new processes whose properties are compatible with the 1436
currently-dominant paradigm in the field which is focused on 1437
the notion of sparsity. In that respect, the sparse processes that 1438
are best matched to conventional wavelets9 are those generated 1439
by N-fold integration (with proper boundary conditions) of a 1440
non-gaussian innovation. These processes, which are the solu- 1441
tion of an unstable SDE (pole of multiplicity N at the origin), 1442
are intrinsically self-similar (fractal) and non-stationary. Last 1443
but not least, the formulation is backward compatible with the 1444
classical theory of Gaussian stationary processes. 1445
ACKNOWLEDGMENT 1446
The authors are thankful to Prof. Robert Dalang (EPFL 1447
Chair of Probabilities), Julien Fageot and Dr. Arash Amini 1448
for helpful discussions. 1449
9A wavelet with N vanishing moments can always be rewritten as ψ =DNφ with φ ∈ L2(R) where the operator L = DN is scale-invariant.
IEEE
Proo
f
UNSER et al.: UNIFIED FORMULATION OF GAUSSIAN SPARSE STOCHASTIC PROCESSES 17
REFERENCES1450
[1] A. Papoulis, Probability, Random Variables, and Stochastic Processes.1451
New York, NY, USA: McGraw-Hill, 1991.1452
[2] R. Gray and L. Davisson, An Introduction to Statistical Signal Process-1453
ing. Cambridge, U.K.: Cambridge Univ. Press, 2004.1454
[3] E. J. Candès and M. B. Wakin, “An introduction to compressive1455
sampling,” IEEE Signal Process. Mag., vol. 25, no. 2, pp. 21–30,1456
Mar. 2008.1457
[4] A. M. Bruckstein, D. L. Donoho, and M. Elad, “From sparse solutions of1458
systems of equations to sparse modeling of signals and images,” SIAM1459
Rev., vol. 51, no. 1, pp. 34–81, 2009.1460
[5] S. Mallat, A Wavelet Tour of Signal Processing: The Sparse Way, 3rd ed.1461
San Diego, CA, USA: Academic Press, 2009.1462
[6] J.-L. Starck, F. Murtagh, and J. M. Fadili, Sparse Image and Signal1463
[55] P. Protter, Stochastic Integration and Differential Equations. New York, 1595
NY, USA: Springer-Verlag, 2004. 1596
[56] P. J. Brockwell, “Lèvy-driven CARMA processes,” Ann. Inst. Statist. 1597
Math., vol. 53, no. 1, pp. 113–124, 2001. 1598
IEEE
Proo
f
18 IEEE TRANSACTIONS ON INFORMATION THEORY
Michael Unser (M’89–SM’94–F’99) received the M.S. (summa cum laude)1599
and Ph.D. degrees in Electrical Engineering in 1981 and 1984, respectively,1600
from the Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland.1601
From 1985 to 1997, he worked as a scientist with the National Institutes1602
of Health, Bethesda USA. He is now full professor and Director of the1603
Biomedical Imaging Group at the EPFL.1604
His main research area is biomedical image processing. He has a strong1605
interest in sampling theories, multiresolution algorithms, wavelets, the use of1606
splines for image processing, and, more recently, stochastic processes. He has1607
published about 250 journal papers on those topics.1608
Dr. Unser is currently member of the editorial boards of IEEE1609
J. SELECTED TOPICS IN SIGNAL PROCESSING, Foundations and Trends1610
in Signal Processing, SIAM J. Imaging Sciences, and the PROCEEDINGS1611
OF THE IEEE. He co-organized the first IEEE International Symposium on1612
Biomedical Imaging (ISBI2002) and was the founding chair of the technical1613
committee of the IEEE-SP Society on Bio Imaging and Signal Processing1614
(BISP).1615
He received three Best Paper Awards (1995, 2000, 2003) from the IEEE1616
Signal Processing Society, and two IEEE Technical Achievement Awards1617
(2008 SPS and 2010 EMBS). He is an EURASIP Fellow and a member1618
of the Swiss Academy of Engineering Sciences.1619
Pouya D. Tafti was born in Tehran in 1981. He received his BSc degree 1620
in Electrical Engineering from Sharif University of Technology, Tehran, in 1621
2003, his MASc in Electrical and Computer Engineering from McMaster 1622
University, Hamilton, Ontario, in 2006, and his PhD in Computer, Information, 1623
and Communication Sciences from EPFL, Lausanne, in 2011. From 2006 to 1624
2012 he was with the Biomedical Imaging Group at EPFL, where he worked 1625
on vector field imaging and statistical models for signal and image processing. 1626
He currently resides in Germany where he works as a data scientist. 1627
Qiyu Sun received the BSc and PhD degree in mathematics from Hangzhou 1628
University, China in 1985 and 1990 respectively. He is a full professor with the 1629
Department of Mathematics, University of Central Florida. His prior position 1630
was with Zhejiang University (China), National University of Singapore 1631
(Singapore), Vanderbilt University, and University of Houston. 1632
His research interests include sampling theory, Wiener’s lemma, wavelet 1633
and frame theory, linear and nonlinear inverse problems, and Fourier analysis. 1634
He has published more than 100 papers on mathematics and signal processing, 1635
and written a book An Introduction to Multiband Wavelets (Zhejiang Univer- 1636
sity Press, 2001) with Ning Bi and Daren Huang. He is on the editorial 1637
board of the journals Advance in Computational Mathematics, Numerical 1638
Functional Analysis and Optimization and Sampling Theory in Signal and 1639
Imaging Processing. 1640
IEEE
Proo
f
AUTHOR QUERIES
AQ:1 = Please provide the expansion for “CARMA.”AQ:2 = Please supply index terms/keywords for your paper. To download the IEEE Taxonomy, go to
http://www.ieee.org/documents/taxonomy_v101.pdf.AQ:3 = Fig. 3 is not cited in body text. Please indicate where it should be cited.AQ:4 = Please provide the volume no. issue no., and page range for ref. [32].