-
Short-and-Sparse Deconvolution – A Geometric Approach
Yenson Lau∗,:, Qing Qu∗,7, Han-Wen Kuo:, Pengcheng Zhou3, Yuqian
Zhang5, and JohnWright:,;
:Department of Electrical Engineering and Data Science
Institute, Columbia University7Center for Data Science, New York
University
3Department of Statistics and Center for Theoretical
Neurosicence, Columbia University5Department of Computer Science,
Cornell University
;Department of Applied Physics and Applied Mathematics, Columbia
University
July 27, 2020
Abstract
Short-and-sparse deconvolution (SaSD) is the problem of
extracting localized, recurring motifs in signalswith spatial or
temporal structure. Variants of this problem arise in applications
such as image deblurring,microscopy, neural spike sorting, and
more. The problem is challenging in both theory and practice, as
nat-ural optimization formulations are nonconvex. Moreover,
practical deconvolution problems involve smoothmotifs (kernels)
whose spectra decay rapidly, resulting in poor conditioning and
numerical challenges. Thispaper is motivated by recent theoretical
advances [ZLK`17, KZLW19], which characterize the
optimizationlandscape of a particular nonconvex formulation of
SaSD. This is used to derive a provable algorithm whichexactly
solves certain non-practical instances of the SaSD problem. We
leverage the key ideas from this theory(sphere constraints,
data-driven initialization) to develop a practical algorithm, which
performs well on dataarising froma range of application areas.
Wehighlight key additional challenges posed by the
ill-conditioningof real SaSD problems, and suggest heuristics
(acceleration, continuation, reweighting) to mitigate them.
Ex-periments demonstrate both the performance and generality of the
proposed method.
Index terms— sparse blinddeconvolution, convolutional dictionary
learning, computational imaging, non-convex optimization,
alternating descent methods.
∗These authors contributed equally to this work.
1
-
1 IntroductionSignals in medical/scientific/natural imaging can
often be modeled as superpositions of basic, recurring mo-tifs (see
Figure 1 for an illustration). For example, in calcium imaging
[SGHK03, GK12], the excitation of neu-rons produces short pulses of
fluorescence, repeating at distinct firing times. In the material
and biologicalsciences, repeated motifs often encode crucial
information about the subject of interest; e.g., in
nanomaterialsthese motifs correspond to defects in the crystal
lattice due to doping [CSL`18]. In all of these applications,the
motifs of interest are short, and they are sparsely distributed
within the sample of interest. Signals withthis short-and-sparse
structure also arise in natural image processing: when a blurry
image is taken due tothe resolution limit or malfunction of imaging
procedure, it can be modeled as a short blur pattern applied toa
visually plausible sharp image [CW98, RBZ06a, LWDF11a].
50 100 150 200 250 300 350 400
50
100
150
200
250
300
350
400
«
«
«
f
f
f
Figure 1: Natural signals with short-and-sparse structure. In
calcium imaging (top), each neuronal spikeinduces a fluoresence
pattern measuring a transient increase in calcium concentration. In
photography(middle), photos with sharp edges (sparse in the
gradient domain) are often obfuscated by blurring dueto shaking the
camera. In scanning tunneling microscopy (bottom), dopants embedded
in some basematerial produce individual electronic signatures. For
each of these cases, the observed signal can bemodeled as a
convolution between a short kernel and a sparse activation map.
Mathematically, an observed signal y with this short-and-sparse
(SaS) structure can be modeled as a convo-lution1 of a short signal
a0 P Rn0 and a much longer sparse signal x0 P Rm pm " n0q:
y “ a0 f x0. (1.1)
In all of the above applications, the signalsa0 andx0 are not
known ahead of time. The short-and-sparse deconvo-lution (SaSD)
problem asks us to recover these two signals from the observation
y. This is a challenging inverseproblem: natural optimization
formulations are nonconvex and have many equivalent solutions. The
kernela0 is often smooth, and hence attenuates high frequencies.
Although study of the SaSD problem stretchesback several decades
and across several disciplines [Hay94, LB95, KH96], the need for
efficient, reliable, andgeneral purpose optimization methods
remains.
One major challenge associated with developing methods for SaSD
arises from our relatively limited un-derstanding of the global
geometric structure of nonconvex optimization problems. Our goal is
to recover thisground truth pa0,x0q (perhaps up to some trivial
ambiguities), which typically requires us to obtain a
globallyoptimal solution to a nonconvex optimization problem. This
is impossible in general. Fortunately, recent the-oretical evidence
[ZKW18, KZLW19] guarantees that the SaSD problem can solved
efficiently under certain
1For simplicity we follow the convention of [KZLW19] and use
cyclic convolution throughout this paper, unless otherwise
specified.The choice is superficial; any algorithms and results
discussed here should also apply to linear convolution with minor
modifications.
2
-
idealized assumptions. Using an appropriate selection of
optimization domain and a specific initializationscheme, these
results yield provable methods that solve certain instances of SaSD
in polynomial time.
Unfortunately, practical SaSD problems raise additional
challenges beyond the assumptions in theory,causing the provable
methods [ZKW18, KZLW19] to fail on real problem instances. While
the emphasisof [ZLK`17, KZLW19] is on theoretical guarantees, here
we focus on practical computation. We show how tocombine ideas from
this theory with heuristics that better address the properties of
practical deconvolutionproblems, to build a novel method that
performs well on data arising in a range of application areas. Many
ofour design choices are natural and have a strong precedent in the
literature. We will show how these naturalchoices help to cope with
the (complicated!) geometry of practically occurring deconvolution
problems. Acritical issue in moving from theory to practice is the
poor conditioning of naturally-occurring deconvolutionproblems: we
show how to address this with a combination of ideas from sparse
optimization, including mo-mentum, continuation, and reweighting.
The end result is a general purpose method, which we demonstrateon
data for spike recovery [FZP17] and neuronal localization [PSG`16]
from calcium imaging data, as wellas fluorescence microscopy
[RBZ06a].
Organization of the paper. The remainder of the paper is
organized as follows. Section 2 introduces keyaspects of SaSD, and
Section 3 shows how they play out in a theoretical analysis of
SaSD, culminating in aprovable algorithm grounded in geometric
intuition. In Section 4, we discuss how to combine this
intuitionwith additional heuristics to create practical methods.
Section 5 revisits and demonstrates these ideas in asimulated
setting. Section 6 illustrates the performance of our method on
data drawn from a number ofapplications. Finally, Section 7 reviews
the literature, and poses interesting future directions.
Reproducible research. The code for implementations of our
algorithms can be found online:
https://github.com/qingqu06/sparse_deconvolution.
For more details of our work on SaSD, we refer interested
readers to our project website
https://deconvlab.github.io/.
2 Two Key Intuitions for SaS DeconvolutionWe begin by describing
two basic intuitions for SaS deconvolution, which play an important
role in the geom-etry of optimization and the design of efficient
methods.
´1 1
1 ´2 0
´1 0 2
“
f
f
y “ αsℓra0s α´1s´ℓrx0sf
Figure 2: Scaling-shift symmetry. The SaS convolution model
exhibits a scaled shift symmetry: αsℓra0sand α´1s´ℓrx0s have the
same convolution as a0 and x0. Therefore, the ground truth pa0,x0q
can only byidentified up to some scale and shift ambiguity.
3
https://github.com/qingqu06/sparse_deconvolutionhttps://deconvlab.github.io/
-
(a) spiky (easiest) (b) generic (easy) (c) lowpass
(difficult)
µs « 0 µs « n0´1{2 µs « constant
θ « n0´1{2 θ « n0´3{4 θ « n0´1
Figure 3: Sparsity-coherence tradeoff [KZLW19]: examples with
varying coherence parameter µspa0qand sparsity rate θ (i.e.,
probability a given entry is nonzero). Smaller shift-coherence
µspa0q allows SaSDto be solved with higher θ, and vice versa. In
order of increasing difficulty: (a) when a0 is a Dirac
deltafunction, µspa0q “ 0; (b) when a0 is sampled from a uniform
distribution on the sphere Sn0´1, its shift-coherence is roughly
µspa0q « n0´1{2 ; (c) when a0 is low-pass, µspa0q Ñ const. as n0
grows.
Symmetry structure. The SaSmodel exhibits a basic scaled shift
symmetry: for any nonzero scalarα and cyclicshift sℓ r¨s
y “ a0 f x0 “ p˘αsℓ ra0sq f`
˘α´1s´ℓ rx0s˘
.
In other words, shifting a0 to the right by ℓ samples and
shifting x0 to the left by the same amount leavesthe convolution a0
f x0 unchanged (see Figure 2). We can therefore only expect to
recover the ground truthpa0,x0q up to some scaling and some shift.
As a result, natural optimization formulations for SaSD
exhibitmultiple global minimizers, corresponding to these scaled
shifts of the ground truth. Due to the existence ofmultiple
discrete global minimizers, natural formulations are nonconvex.
Fortunately, this symmetry structureoften creates leads to benign
objective landscapes for optimization; two such examples for SaSD
are [ZKW18,KZLW19].
Sparsity-coherence tradeoff. Clearly, not all SaSD problems are
equally easy to solve. Problemswith denserx0 are more challenging.
Moreover, there is a basic tradeoff between the sparsity of the
spike train x0 and theproperties of the kernel a0. If a0 is smooth
(e.g., Gaussian), then each occurrence of a0 would, on average,need
to be relatively far apart to be distinguishable; i.e. x0 would
have to be sparser. Conversely, denserinstances of x0 should be
allowable if a0 is “spikier”.2 One way of formalizing this tradeoff
is through theshift-coherence of the kernel a0, which measures the
“similarity” between a0 and its cyclic-shifts:
µspa0q.“ max
ℓ “0
ˇ
ˇ
ˇ
ˇ
B
a0}a0}2
,sℓ ra0s}a0}2
Fˇ
ˇ
ˇ
ˇ
P r0, 1s . (2.1)
Asµspa0q increases, the shifts ofa0 becomemore correlated
andhence closer together on the sphere. [KZLW19]uses this quantity
to study the behavior of a particular nonconvex formulation of
SaSD. For generic choicesof x0, such as x0 „ BGpθq drawn from a
Bernoulli-Gaussian distribution, the sparsity-coherence tradeoff
of[KZLW19] guarantees recoverability when the sparsity rate θ is
sufficiently small relative to µspa0q. Intu-itively speaking, this
implies that SaSD problems with smaller µspa0q tend to be “easier”
to solve (Figure 3).
2Similar tradeoffs occur in non-blind deconvolution where a0 is
known (e.g. [CFG14]) and in other inverse problems.
4
-
(a) a single shift sℓ1 ra0s (b) two shifts sℓ1 ra0s, sℓ2 ra0s
(c) multiple shifts
Figure 4: Geometry of Approximate Bilinear Lasso loss φABLpaq
near superpositions of shifts of a0[KZLW19]. Top: function values
of φABLpaq visualized as height. Bottom: heat maps of φABLpaq on
thesphere Sn´1. (a) the region near a single shift is strongly
convex; (b) the region between two shifts con-tains a saddle-point,
with negative curvature pointing towards each shift and positive
curvature pointingaway; (c) region near the span of several shifts
of a0.
In the next section, we will use the idealized formulation of
[KZLW19] to illustrate how these basic prop-erties of the SaSD
problem shape the landscape of optimization. In later sections, we
will borrow these ideasto develop practical, general purpose
methods. The major challenge in moving from theory to practice is
incoping with highly coherent a0: in most practical applications,
a0 is smooth and hence µspa0q is large.
3 Problem Formulation and Nonconvex GeometryIn this section, we
summarize some recent algorithmic theory characterizing the
optimization landscape ofan idealized nonconvex formulation for
SaSD [ZLK`17, KZLW19], with the goal of applying the
geometricintuition from this theory towards designing practical
optimization methods.
3.1 The Bilinear Lasso and its marginalizationA natural idea for
solving SaSD is to minimize a reconstruction loss ψpa,xq between a
f x and y, plus asparsity-promoting regularizer gpxq on x. This can
be achieved, for instance, by minimizing the squaredreconstruction
error in combination with an ℓ1-penalty on x,
mina,x
ΨBLpa,xq.“ 12 }y ´ a f x}
22 ` λ }x}1 , s.t. a P S
n´1. (3.1)
This Bilinear Lasso problem (BL) resembles the Lasso estimator
in statistics [Tib96], and is a nonconvex opti-mization problem.
The sparsity of the solution for x is controlled by the
regularization penalty λ: a larger λleads to sparser x, and vice
versa3. We constrain a onto the sphere Sn´1, which reduces the
scaling ambiguityinto a sign ambiguity. We also increase the
dimension of a to n “ 3n0 ´ 2; this creates an objective
landscapethat allows various descent methods to recover a full
shift of a0 and avoid any shift-truncation effects, uponthe
application of a simple data initialization scheme.
In this paper, we will also frequently refer to the marginalized
Bilinear Lasso cost
φBLpaq.“ min
xΨBLpa,xq “ min
x
12 }y ´ a f x}
22 ` λ }x}1 . (3.2)
It is clear that minimizing ΨBLpa,xq is equivalent to minimizing
φBLpaq over a P Sn´1.3[KZLW19] suggests a good choice λ P Op1{
?θnq, where θ P p0, 1q denotes the sparsity level.
5
-
3.2 Structured nonconvexity and geometric propertiesTo
understand the nonconvex optimization landscape of the Bilinear
Lasso, it is natural to study the marginal-ized objective in
Equation (3.2). The benefit of this approach is twofold: (i) for a
fixed a, the Lasso problem inEquation (3.2) is convex w.r.t. x, and
(ii) the short kernel a lives on a low dimensional manifold — the
spacea P Sn´1 is wheremeasure concentrates when x0 is generic
random and has high dimension (m " n0). Unfor-tunately, φBLpaq
remains challenging for analysis; a major culprit is that the Lasso
estimator in Equation (3.2)does not usually admit closed-form
solutions.
ApproximateBilinear Lasso. Whena is incoherent (µspaq « 0),
however, we approximately have }a f x}22 «}x}22. Carrying this
approximation through to Equation (3.1) yields an Approximate
Bilinear Lasso (ABL) ob-jective4 φABLpa,xq “ minx ΨABLpa,xq, which
satisfies φABLpaq « φBLpaq whenever µspaq « 0 [KZLW19].For the
purposes of our discussion, this objective serves as a valid
simplification of the Bilinear Lasso whenthe true kernel is itself
incoherent (µspa0q « 0). Although such incoherence assumptions are
stringent andimpractical, φABLpaq admits a simple analytical form
and is more amenable to analysis as a result.
Geometry ofφABL in the span of a few shifts. Under the
assumptions that a0 is incoherent andx0 is generic,φABLpaq enjoys a
number of nice properties on the sphere Sn´1. In particular, Kuo et
al. [KZLW19] providesa geometrical characterization of the
optimization landscape φABLpaq near the span of several shifts5 of
a0:
1. Near a single shift of a0. Within a local neighborhood of
each shift sℓra0s, the optimization landscape ofφABLpaq exhibits
strong convexity (Figure 4a), with a unique minimizer corresponding
to a shift sℓra0s.
2. In the vicinity of two shifts. Near the span of two
shifts,
Stℓ1,ℓ2u “␣
α1sℓ1ra0s ` α2sℓ2ra0s : α1, α2 P R(
č
Sn´1,
the only local minimizers are approximately sℓ1ra0s and sℓ2ra0s.
A saddle point as exists at the symmetricsuperposition of the
shifts (i.e. as “ α1sℓ1ra0s ` α2sℓ2ra0s with α1 « α2), but can be
escaped by takingadvantage of the large negative curvature present6
(Figure 4b).
3. In the vicinity of multiple shifts. The geometric properties
for two shifts carry over to those of multiple shiftsof a0. Any
local minimizers over
SI.“
␣ř
ℓPI αℓsℓ ra0s : αℓ P R(Ş
Sn´1 (3.3)
are again close to signed shifts (Figure 4c). Any saddle-points
present sit at symmetric superpositionsof two or more shifts, and
exhibit strong negative curvature in directions towards the
participating shifts.Additionally, the function value of φABLpaq
increases when moving away from SI .
[KZLW19] proves that these geometric properties ofφABL hold for
sufficiently small7 |I|whenever the sparsity-coherence tradeoff n0θ
Æ µ´1{2s pa0q is satisfied. This bound is stringent, however, and
shows that the ABLformulation is unsuited for practical
applications where µspa0q often approaches one as n0 grows.
The benign optimization landscape of φABLpaq provides strong
implications for optimization. Indeed, ifwe could initialize a near
SI , iterates of many local descent methods such as [Gol80, CGT00,
BAC18, NP06]can exploit gradient and negative curvature to remain
near SI , and eventually converge to the target solution
4As our focus here is on solving the Bilinear Lasso, we
intentionally omit the concrete form of ΨABLpaq and φABLpaq.
Readers mayrefer to Section 2 of [KZLW19] for more details.
5When optimizing over Sn´1, n “ 3n0 ´ 2, we denote ℓ-th (full)
shift with the abuse of notation sℓra0s “ r0ℓ;a0;0n´ℓ´n0 s P Sn´1,
for ℓ P t0, . . . , n ´ n0u. Each shift is a length-m cyclic shift
of a0, truncated to a length-n window without removing any entries
froma0.
6Here, negative curvature means that the Hessian exhibits
negative eigenvalues, such that the function can be decreased by
followingthe negative eigenvector direction.
7It is sufficient for |I| “ Opθn0q, where θ is the probability
that any entry of x0 is nonzero [KZLW19].
6
-
– a signed-shift of a0. Finding a good initialization is also
deceptively simple: since x0 is sparse, any length-n0truncation of
the observation y is itself approximately a superposition of a few
shifts of a0,
y “ÿ
ℓPsupppx0qpx0qℓ ¨ sℓ ra0s . (3.4)
Therefore, if we simply chose n0 consecutive entries of y, (e.g.
ryi, yi`1, . . . , yi`n0´1s, i P rm´ n0s) randomlyfrom the
observation y and initialize a0 by setting
ap0q “ PSn´1`“
0n0´1 ; yi, yi`1, . . . , yi`n0´1 ; 0n0´1‰˘
, (3.5)
then ap0q P Rn is close to a subsphere SI spanned by
roughlyOpn0θq shifts of a0. Moreover, any truncation ef-fects are
absorbed by the zero-padding in Equation (3.5). In [KZLW19], this
initialization scheme is improvedand made rigorous, and interested
readers may refer to Appendix B.3 for details.
(a) Approximate Bilinear Lasso φABLpaq (b) Bilinear Lasso
φBLpaq
Figure 5: Approximate Bilinear Lasso vs. Bilinear Lasso: Given
an incoherent truth kernel a0 „U`
Sn0´1˘
, we plot the heat maps of objective landscapes of (a) the
Approximate Bilinear Lasso and (b)Bilinear Lasso losses, restricted
to the subsphere spanned by a0, s1 ra0s, and s2 ra0s, shown as red
dots onthe heatmap. The curvature properties of both objective
landscapes are empirically similar at key locations,e.g., near and
between shifts.
Optimization over the sphere. For both the Bilinear Lasso and
ABL, a unit-norm constraint on a is enforcedto break the scaling
symmetry between a0 and x0. Choosing the ℓ2-norm, however, has
surprisingly strongimplications for optimization. The ABL
objective, for example, is piecewise concave whenever a is
sufficientlyfar away from any shift of a0, but the sphere induces
positive curvature near individual shifts to create
strongconvexity. These two properties combine to ensure
recoverability of a0. In contrast, enforcing ℓ1-norm con-straints
often leads to spurious minimizers for deconvolution problems
[LWDF11b, BVG13, ZLK`17].
Implications for the Bilinear Lasso. The ABL is an example of a
formulation for SaSD posessing a (region-ally) benign optimization
landscape, which guarantees that efficient recovery is possible
when a0 is incoher-ent. Applications of sparse deconvolution,
however, are often motivated by sharpening or resolution
tasks[HBZ09, CFG14, CE16] where the motif a0 is smooth and coherent
(µspa0q is large). The ABL objective is apoor approximation of the
Bilinear Lasso in such cases and therefore fails to yield practical
algorithms.
In such cases, the Bilinear Lasso should be optimized directly,
and Figure 5 shows that its loss surface doesindeed share similar
symmetry breaking properties with the ABL objective. In the next
section, we applythe geometric intuition gained from the ABL
formulation, in combination with a number of
computationalheuristics, to create an optimization method for SaSD
that performs well in general problem instances.
4 Designing Practical Nonconvex Optimization AlgorithmsSeveral
algorithms for SaSD type problems have been developed for specific
applications, such as imagedeblurring [LWDF11b, BDH`13, CE16],
neuroscience [RPQ15, FZP17, SFB18], and image super-resolution
7
-
[BK02, SGG`09, YWHM10]. In this section, however, we will
instead leverage the intuition from Section 3and build optimization
methods for the Bilinear Lasso
mina, x
ΨBLpa,xq.“ 12 }y ´ a f x}
22
loooooooomoooooooon
smooth ψpa,xq
` λ ¨ }x}1loomoon
nonsmooth gpxq
, s.t. a P Sn´1, (4.1)
that perform well in general settings for SaSD, as the Bilinear
Lasso more accurately accounts for interactionsbetween a f x when
a0 is shift-coherent. In such situations, optmization of ΨBL will
also suffer from slowconvergence and poor resolution of x0,
whichwewill address in this sectionwith a number of heuristics.
Thisleads to an efficient and practical algorithms for solving
sparse deconvolution problems.
4.1 Solving the Bilinear Lasso via alternating descent
methodEfficient global optimization of the nonconvex objective in
Equation (4.1) is a nontrivial task, largely dueto the existence of
spurious local minima and saddle points. In the following, we
introduce a simple first-order method dealing with these issues. As
suggested by our discussion of the geometry of the DroppedQuadratic
in Section 3, we avoid such spurious minimizers using a data-driven
initialization scheme intro-duced in Section 3.2. On the other
hand, our study in Section 3 implies that all saddle points exhibit
largenegative curvature and can hence be effectively escaped by
first-order methods8 [LPP`17, JGN`17, GBW18].
Starting from the data-driven initialization, we optimize the
Bilinear Lasso using a first-order alternatingdescent method (ADM).
The basic idea of our ADM algorithm is to alternate between taking
first-order descentsteps on Ψpa,xq w.r.t. one variable while the
other is fixed:
Fix a and take a descent step on x. At each iteration k, with
fixed apkq, ADM first descends the objectiveΨBLpa,xq by taking a
proximal gradient step w.r.t. xwith an appropriate stepsize tk
xpk`1q Ð proxλtkg´
xpkq ´ tk ¨ ∇xψ´
apkq,xpkq¯¯
, (4.2)
where proxgp¨q denotes the proximal operator of gp¨q [Nes13a].
Since the subproblem ofminimizingΨBLpa,xqonly w.r.t. x is the Lasso
problem, the proximal step taken in Equation (4.2) here is
classical9 [BT09, PB`14].
Fix x and take a descent step on a. Next, we fix the iterate
xpk`1q and we take a Riemannian gradient step[AMS09] w.r.t. a over
the sphere Sn´1, with stepsize τk ą 0,
apk`1q Ð PSn´1´
apkq ´ τk ¨ grada ψ´
apkq,xpk`1q¯¯
, (4.3)
where grada ψpa,xq denotes the Riemannian gradient of ψpa,xq
w.r.t. a, and PSn´1 p¨q is a projection operatoronto the sphere
Sn´1. The Riemannian gradient grada ψpa,xq can be interpreted as
the standard gradientprojected to the (Euclidean) tangent space10
of Sn´1 at point a, and the projection operator PSn´1 p¨q
ensuresthat our iterate stays on the sphere11.
ADM simply alternates between steps of Equation (4.2) and
Equation (4.3) until convergence, and canseamlessly incorporate
other acceleration techniques that we will discuss in the later
part of this section. Werefer readers to Appendix B.1 for more
implementation details.
The geometric intuition gained in Section 3 is based on the
marginalized objective φBLpaq over the sphereSn´1, whereas here we
simply a descent step onΨBLpa,xq w.r.t. x rather than minimize x
explicitly to reduce
8In [ZLK`17] and [KZLW19], they employed second-order
trust-region [CGT00, BAC18] and curvilinear search [Gol80,
GMWZ17]methods for solving SaSD. Although second-order methods can
also escape strict saddle points by directly exploiting the
Hessian, theyare much more expensive computationally and hence not
practical for large datasets.
9The Equation (4.2) can also be rewritten and interpreted as
xpk`1q “ xpkq ´ tkGtk`
xpkq˘
with the composite gradient mapping Gtk[Nes13a]. Gtk behaves
like the “gradient” on the smooth Moreau envelope of ΨBLpa,xq, as a
function of x.
10The tangent space is a n´ 1 dimensional Euclidean linear
space, containing all the tangent vectors at a P Sn´1. We refer the
readersto [AMS09, Section 3] for more concrete definitions.
11The Riemannian gradient step is a specific manifold retraction
operator on the sphere, which takes a point from the tangent space
atsome point a and pushes it to a new point on the manifold. We
refer interested readers to Section 3 of [AMS09] for more
details.
8
-
computational complexity per iteration. Nonetheless, the
sequence of gradients ∇aΨBLpapkq,xpkqq on a ap-proximates
∇φBLpapkqq as k Ñ 8, since ADM is guaranteed to converge to some
stationary point [BST14,PS16]. Therefore, ADM on ΨBLpa,xq
eventually becomes equivalent to Riemannian gradient descent
onφBLpaq.
4.2 Heuristics for improving the geometry of Bilinear
LassoAlthough the Bilinear Lasso is able to account for the
interactions between a0 and x0 under high coherence,the smooth term
}a f x ´ y}22 nonetheless becomes ill-conditioned as µpa0q
increases, leading to slow conver-gence for practical problem
instances. Here we will discuss a number of heuristics which will
help to obtainfaster algorithmic convergence and produce better
solutions in such settings.
(a) Standard gradient descent (b) With momentum acceleration
Figure 6: Momentum acceleration. The left figure shows the
behavior of standard gradient descent whichoscillates on functions
of ill-conditionedHessian; the right figure shows that by
incorporating the previoussteps the momentum acceleration
alleviates the oscillation effects and achieves faster
convergence.
4.2.1 Accelerating first-order descent under high coherence
When µspa0q is large, the Hessian ofΨBL becomes ill-conditioned
as a converges to single shifts. the objectivelandscape contains
“narrow valleys” in which first-order methods tend to exhibit
severe oscillations (Fig-ure 6a) [Nes13b]. For a nonconvex problem
such as the Bilinear Lasso, iterates of first-order methods
couldencounter many narrow and flat valleys along the descent
trajectory, resulting in slow convergence.
One remedy here is to add momentum [Pol64, BT09] to standard
first-order iterations. For example, whenupdating x, we could
modify the iterate in Equation (4.2) by
wpkq “ xpkq ` β ¨´
xpkq ´ xpk´1q¯
looooooooomooooooooon
inertial term
, (4.4)
xpk`1q “ prox tkg´
wpkq ´ tk∇xψ´
apkq,wpkq¯¯
. (4.5)
Here, the inertial term incorporates the momentum from previous
iterations, and β P p0, 1q controls the iner-tia12. In a similar
fashion, we can modify the iterate [AMS09] for updating13 a in
Equation (4.3). We termthe new algorithm inertial alternating
descent method (iADM), and we refer readers to Appendix B.1.2 for
moredetails.
As illustrated in Figure 6b, the additional inertial term
improves convergence by substantially reducingoscillation effects
for ill-conditioned problems. The acceleration of momentummethods
for convex problemsare well-known in practice14. Recently,
momentummethods has also been proven to improve convergence
fornonconvex and nonsmooth problems [PS16, JNJ18].
12Setting β “ 0 here removes momentum and reverts to standard
proximal gradient descent.13It modifies iPALM [PS16] to perform
updates on a via retraction on the sphere.14In the setting of
strongly convex and smooth function fpzq, the momentum method
improves the iteration complexity from
O pκ logp1{εqq to O p?κ logp1{εqq with κ being the condition
number, while leaving the computational complexity approximately
un-
changed [B`15].
9
-
(a) λ “ 5 ˆ 10´1 (b) λ “ 5 ˆ 10´2 (c) λ “ 5 ˆ 10´3
Figure 7: Low-dimensional functional landscape of Bilinear
Lassowith varyingλ. Each subfigure showsthe objective φBLpaq, with
a restricted to the subsphere St0,1,2u defined in Equation (3.3),
with varyingchoices of λ ą 0. The kernel a0 is incoherent and drawn
uniformly from the sphere. The red dots denotethe location of the
kernel and its shifts.
4.2.2 A practical method for SaSD based on homotopy
continuation
It is also possible to improve optimization bymodifying the
objectiveΨBL directly through the sparsity penaltyλ. Variations of
this idea appear in both [ZLK`17] and [KZLW19], and can also help
to mitigate the effectsof large shift-coherence in practical
problems.
When solving (3.1) in the noise-free case, it is clear that
larger choices of λ encourage sparser solutions forx. Conversely,
smaller choices of λ place local minimizers of the marginal
objective φBLpaq .“ minx ΨBLpa,xqcloser to signed-shifts of a0 by
emphasizing reconstruction quality. When µpa0q is large, however,
φBL be-comes ill-conditioned as λ Ñ 0 due to the poor spectral
conditioning of a0, leading to severe flatness nearlocal minimizers
(Figure 7) and the creation spurious local minimizers when noise is
present. At the expenseof precision, larger values of λ limit x to
a small set of support patterns and simplify the landscape ofφBL.
Itis therefore important both for fast convergence and accurate
recovery for λ to be chosen appropriately.
When problem parameters – such as the severity of noise, or p0
and θ – are not known a priori, a homotopycontinuation method
[HYZ08, WNF09, XZ13] can be used to obtain a range of solutions for
SaSD. Using the ini-tialization (3.5), a rough estimate ppap1q,
pxp1qq is first obtained by solving (3.1) with iADM using a large
choicefor λp1q; this estimate is refined by gradually decreasing
λpnq to produce the solution path
␣
ppapnq, pxpnq;λpnqq(
.By ensuring that x remains sparse along the solution path,
homotopy provides the objective ΨBL with (re-stricted) strong
convexity w.r.t. both a and x throughout optimization [ANW10]. As a
result, homotopyachieves linear convergence for SaSD where
sublinear convergence is expected otherwise (Figures 13 and
14).
Algorithm for SaSD. We summarize our discussion by presenting a
practical algorithm for solving SaSD(Algorithm 1), which
initializes a using Equation (3.5) and subsequently find a local
minimizer of the Bilin-ear Lasso using homotopy continuation,
combined with the accelerated first-order iADM method, with
anappropriate choice of λ. However, we note should be possible to
substitute iADM with any first or second-order descent method (e.g.
the Riemannian trust-region method [ABG07, CSL`18]). We compare
some ofthese different choices in Section 5.2.
For Algorithm 1, we usually set β “ 0.9 to incorporate
sufficient momentum for iADM; setting β toolarge, however, can
cause iADM to diverge. The stepsizes tk and τk in iADM are obtained
by backtracking(linesearch) [NW06, PS16]. We often set the initial
penalty λ0 “
›
›
›C˚
ιnÑmpap0qy›
›
›
8large enough to ensure sparse
x, and choose λ‹ based on problem dimension and noise level
(often λ‹ “ 0.1{?n is good choice). Typically,
a good choice is to set the decaying parameter η “ 0.9 and the
precision factor δ “ 0.1. We refer readers toAppendices for more
implementation details.
4.3 Extension for convolutional dictionary learningThe
optimization methods we introduced for SaSD here can be naturally
extended to tackle sparse blinddeconvolution problems with multiple
unknown kernels/motifs (a.k.a. convolutional dictionary
learning
10
-
Algorithm 1 Solving SaSD with homotopy continuationInput:
Measurement y P Rm; momentum parameter β P r0, 1q; initial and
final sparse penalties λ0, λ‹(λ0 ą λ‹); decay penalty parameter η P
p0, 1q; precision factor δ P p0, 1q and tolerance ε‹.
Output: final solution pa‹,x‹q.Set iteration numberK Ð
X
logpλ‹{λ0q { log η\
.Initialize pap0q P Rn using Equation (3.5), pxp0q “ 0m, and
λp0q “ λ0, εp0q “ δλp0q;for k “ 1, . . . ,K do
Solvemin
aPSn´1,xΨλpk´1q pa,xq
.“1
2}y ´ a f x}22 ` λ
pk´1q }x}1
to precision εpk´1q “ δλpk´1q via iADM, using`
papk´1q, pxpk´1q˘
as warm start´
papkq, pxpkq¯
Ð iADM´
papk´1q, pxpk´1q;y, λpk´1q, β¯
.
Update λpkq Ð ηλpk´1q.end forFinal round: starting from
`
papKq, pxpKq˘
, optimize Ψλ‹ pa,xq with penalty λ‹ to precision ε‹ via
ppa‹, px‹q Ð iADM´
papKq, pxpKq;y, λ‹, β¯
.
“ f ` f
“ f ` fy a01 x01 a02 x02Figure 8: Convolutional dictionary
learning. Simultaneous recovery for multiple unknown
kernelsta0kuNk“1 and sparse activation maps tx0ku
Nk“1 from y “
řNk“1 a0k f x0k.
[CF17, GCW18]), which have broad applications in microscopy data
analysis [YHV17, ZCB`14, CSL`18]and neural spike sorting [ETS11,
RPQ15, SFB18]. As illustrated in Figure 8, the new observation y in
this taskis the sum of N convolutions between short kernels
ta0kuNk“1 and sparse maps tx0ku
Nk“1,
y “Nÿ
k“1a0k f x0k, a0k P Rn0 , x0k P Rm, p1 ď k ď Nq.
The natural extension of SaSD, then, is to recover ta0kuNk“1 and
tx0kuNk“1 up to signed, shift, and permutation
ambiguities, leading to the SaS convolutional dictionary
learning (SaS-CDL) problem. The SaSD problem canbe seen as a
special case of SaS-CDLwithN “ 1. Based on the Bilinear Lasso
formulation in Equation (4.1) forsolving SaSD, we constrain all
kernels a0k over the sphere, and consider the following nonconvex
objective:
mintakuNk“1, txku
Nk“1
1
2
›
›
›
›
›
y ´Nÿ
k“1ak f xk
›
›
›
›
›
2
2
` λNÿ
k“1}xk}1 , s.t. ak P S
n´1 p1 ď k ď Nq. (4.6)
When the kernels ta0kuNk“1 are incoherent enough to each other,
we anticipate that all local minima are nearsigned shifts of the
ground truth. Similar to the idea of solving the Bilinear Lasso in
Equation (4.1), weoptimize Equation (4.6) via ADMand its variants,
by taking alternating descent steps on takuNk“1 and txku
Nk“1
with one fixed. We refer readers to Appendix B and Appendix C
for more technical details.
11
-
4.4 Additional modifications for practical settingsHere briefly
summarize some prevalent issues that appear in with real datasets
and how the our SaS modeland associated deconvolution method can be
adjusted to deal with these additional challenges.
• Linear vs. cyclic convolution. In this work, we follow the
convention of [KZLW19] and mainly discussSaSD in the context of
cyclic convolution. The linear convolution, however, is a better
model for many prac-tical SaSD tasks (e.g. involving natural images
or time series). Despite this, there is no loss of generality asany
statements about cyclic convolution can easily be carried over to
linear convolution; by zero-paddingx appropriately, one can always
rewrite a linear convolution as a cyclic convolution. This is also
conve-nient practically as convolution operations should be
implemented using Fast Fourier transform techniques(which map
directly to cyclic convolution) to reduce computational complexity
for each iteration.
• Resolution of x0 under noise. We introduce a reweighting
technique [CWB08] to deal with noisy datasets.The method adaptively
sets large penalty on small entries of x to suppress noisy small
entries, and setsmall penalty on large entries to promote sparse
solutions of x. We refer readers to Appendix B.3 for
morealgorithmic details.
• Dealing with extra data structure. In many problems such as
calcium imaging [PSG`16] and spike sort-ing [SFB18], the sparse
spike train x0 is usually nonnegative. As we shall see in Section
5, by enforcingnonnegative constraint on x for ADM, it often
enables recovery of denser x0. Additionally, measurementin practice
often contains unknown low frequency DC component b, such that y “
a0 fx0 ` b. We add anextra minimization in ADM to deal with b. We
refer readers to Appendix B.3 for more technical details.
5 Synthetic ExperimentsIn this section, we experimentally
demonstrate several core ideas presented in this work on both
incoher-ent and coherent kernels. Incoherent kernels are randomly
drawn by a0 „ UniformpSn0´1q, which leads toµspa0q P O
´b
logn0n0
¯
diminishing w.r.t. dimension n0. Coherent kernels are
descretized from the Gaussian
window function a0 “ gn0,0.5, where gn0,σ.“ PSn0´1
`“
exp`
´ p2i´n0´1q2
σ2pn0´1q2˘‰n0
i“1
˘
; in this case µspa0q Ñ 1 as n0grows. This allows us to
illustrate some of the difficulties of optimization encountered by
the Bilinear Lassounder high coherence, as well as the
effectiveness of heuristics proposed in Section 4 for alleviating
thesedifficulties.
5.1 Recovery of true kernel under coherence
(a) incoherent kernel (b) coherent kernel
Figure 9: Incoherent vs. coherent kernels. The subfigures from
left to right present the optimizationlandscape of φBLpaq w.r.t. a
P Sn´1 defined in (3.2), restricted to a subspace spanned by three
shifts of a0.The left figure shows the landscape of incoherent
kernel, and the right one presents that of the coherentkernel. The
red dots denote the location of the shifts of ground truth a0.
12
-
Low-dimensional plots of function landscapes. As µspa0q
increases, the shifts of a0 lie closer together onthe sphere. We
show how this affects the optimization landscape of the Bilinear
Lasso φBLpaq over a P Sn´1by plotting the objective restricted in
the subsphere spanned by three shifts15 of a0 P Sn0´1 with n0 “
20,m “ 2 ˆ 103, θ “ n0´3{4, and λ “ 0.5. From Figure 9, we see that
φBL exhibits clear symmetry breakingstructure between the shifts of
a0 in the incoherent case. As µspa0q increases, however, adjacent
shifts of a0lie close together and symmetry breaking becomes more
difficult. Practically speaking, recovering a preciseshift of a0
becomes less important when recovering smooth, highly coherent
kernels. Nonetheless Figure 9suggests that the target minimizers of
φBL become non-discretized in these cases.
Recovery performance. Next, we corroborate our observation of
sparsity-coherence tradeoff by comparingrecovery performance for
incoherent vs. coherent kernels. We fix m “ 100n0, and plot the
probability forsuccessful recovery, which occurs if
minℓPr2n0s
␣
1 ´ˇ
ˇ
@
a0, ι˚n0Ñnsℓ ra‹s
Dˇ
ˇ
(
ď 10´2,
w.r.t. dimension n0 and sparsity level θ. For each pn0, θq, we
randomly generate ten independent instances ofthe data y “ a0fx0.
Here a‹ denotes the optimal solution produced byminimizingΨBL with
λ “ 10´2{
?θn0.
From Figure 10, we see that successful recovery is likely when
sparsity θ is sufficiently small compared ton0 in general.
Furthermore, recoveringa0 in the coherent setting is noticablymore
difficult than the incoherentsetting, and typically requires lower
sparsity rates θ. Finally, enforcing extra structure such as
nonnegativityin appropriate settings enables recovery with denser
of x0 (Figures 10a and 10c).
5.2 Demonstration of data-driven initialization and homotopy
accelerationOur next experiments study the effectiveness of the
data-driven (DD) initialization from Equation (3.5) andthe
heuristics introduced in Section 4, namely momentum acceleration
and homotopy. Throughout this sub-section, we set the kernel length
n0 “ 100 and the number of samples m “ 104. We generate the datay “
a0 fx0 `b1m with both coherent and incoherent a0, x0 „ BRpθq with
sparsity level θ “ n0´3{4, and b is aconstant unkown bias. No noise
is added. We stop each algorithm either when the preset maximum
iterationis reached, or when differences between two consecutive
iterates (in ℓ2 norm) is smaller than threshold 10´6.
Effectiveness of data-driven initialization. We compare the ADM
and iADM methods using the data-driven initialization Equation
(3.5) vs. uniform random initializations for a. From Figures 11 and
12, wesee that both methods converge faster to solutions of higher
quality with data-driven initialization, as a resultof ap0q being
initialized near the superposition of a few shifts of a0.
Convergencewith acceleration and homotopy. Nextwe compare the
convergence speeds of the ADM,withand without momentum (iADM) and
homotopy continuation. We use Equation (3.5) to initialize a, and x
isinitialized as zero. From Figures 13 and 14, we see applying
acceleration and homotopy leads to in significantimprovements over
vanilla ADM in terms of convergence rate, especially when a0 is
coherent.
5.3 Comparison with existing methodsFinally, we compare iADM,
and iADMwith homotopy, against a number of existing methods for
minimizingφBL. The first is alternating minimization [KZLW19],
which at each iteration k minimizes apkq with xpkq fixedusing
accelerated (Riemannian) gradient descent with backtracking, and
vice versa. The next method is thepopular alternating direction
method of multipliers (ADMM) [BPC`11]. Finally, we compare against
iPALM[PS16] with backtracking, using the unit ball constraint on a0
instead of the unit sphere.
For each method, we deconvolve signals with n0 “ 50,m “ 100, and
θ “ n´3{40 for both coherent andincoherent a0. For both iADM, iADM
with homotopy, and iPALM we set α “ 0.3. For homotopy, we setλp1q “
maxℓ|xsℓrap0qs,yy|, λ‹ “ 0.3?n0λ , and δ “ 0.5. Furthermore we set
η “ 0.5 or η “ 0.8 and for ADMM,
15For incoherent kernel, we generate the kernel a0 with the last
two entries zero, and consider the subspace spanned bya0, s1 ra0s ,
s2 ra0s. For the coherent kernel, we consider the subspace spanned
by a0, srn0{3s ra0s , sr2n0{3s ra0s.
13
-
(a) incoherent a0 and x0 „i.i.d. BRpθq (b) coherent a0 and x0
„i.i.d. BRpθq
(c) incoherent a0 and x0 „i.i.d. Bpθq (d) coherent a0 and x0
„i.i.d. Bpθq
Figure 10: phase transitions for solving SaS-BD: (a) shows the
case when a0 is incoherent, and x0 „i.i.d.BRpθq; (b) shows the case
when a0 is coherent, and x0 „i.i.d. BRpθq; (c) shows the case when
a0 isincoherent, and x0 „i.i.d. Bpθq; (d) shows the case when a0 is
coherent, and x0 „i.i.d. Bpθq. For signalx0 „i.i.d. Bpθq,
positivity constraint is enforced. For each subfigure, brighter
means higher probability ofsuccessful recovery, while darker means
higher probability of failure.
we set the slack parameter to ρ “ 0.7 or ρ “ 0.5 for incoherent
and coherent a0 respectively. From Figure 15,we can see that ADMM
performs better than iADM in the incoherent case, but becomes less
reliable in thecoherent case. In both cases, iADM with homotopy is
the best performer. Finally, we observe roughly equalperformance
between iPALM and iADM.
6 Experiments for Real ApplicationsIn this section, we
demonstrate experimentally the effectiveness of the proposed
methods for both SaSD andSaS-CDL on a wide variety of applications
in computational imaging and neuroscience. Our goal here is
notnecessarily to outperform state of the art methods, which are
often tailored to specific applications. Rather, wehope to provide
evidence that the intuition and heuristics highlighted in Sections
3 and 4 arewidely applicable,and can serve as a robust starting
point for tackling SaS problems broadly in areas of imaging
science.
14
-
(a) function value convergence (b) iterate convergence
Figure 11: Comparison of initialization methods for solving
SaS-BD on incoherent random kernel a0:(a) shows the function value
ΨBLpa,xq convergence; (b) shows the iterate convergence on a, where
a‹denotes a shift correction of each iterate a. Here, ADM-DD and
iADM-DD denote the ADM and iADMmethods using data-driven
initialization, and ADM-randn and iADM-randn denote the ADM and
iADMmethods using initializations drawn uniformly random from the
sphere Sn0´1.
(a) function value convergence (b) iterate convergence
Figure 12: Comparison of initialization methods for solving
SaS-BD on coherent smooth Gaussiankernel a0: (a) shows the function
value ΨBLpa,xq convergence; (b) shows the iterate convergence on
a,where a‹ denotes a shift correction of each iterate a. Here,
ADM-DD and iADM-DD denote the ADM andiADMmethods using data-driven
initialization, and ADM-randn and iADM-randn denote the ADM
andiADMmethods using initializations drawn uniformly random from
the sphere Sn0´1.
6.1 Sparse deconvolution of time sequences in neuroscience6.1.1
Sparse deconvolution of calcium imaging
It is well known that neurons process and transmit information
via discrete spiking activity. Whenever a neu-ron fires, it
produces a transient change in chemical concentrations in the
immediate environment. Transientsin calcium (Ca2`) concentration,
for example, can be measured using calcium fluoresence imaging. The
re-sulting fluoresence signal can be modeled as the convolution
between the short transient response a0 and the
15
-
(a) function value convergence (b) iterate convergence
Figure 13: Comparison of algorithm convergence for solving
SaS-BD on incoherent random kernel a0:(a) shows the function value
ΨBLpa,xq convergence; (b) shows the iterate convergence on a, where
a‹denotes a shift correction of each iterate a. The algorithms we
compared here are ADM, iADM, and itshomotopy accelerations.
(a) function value convergence (b) iterate convergence
Figure 14: Comparison of algorithm convergence for solving
SaS-BD on coherent smooth Gaussiankernel a0: (a) shows the function
value ΨBLpa,xq convergence; (b) shows the iterate convergence on
a,where a‹ denotes a shift correction of each iterate a. The
algorithms we compared here are ADM, iADM,and its homotopy
accelerations.
spike train in the form of nonnegative, sparse map x0,
yloomoon
raw fluorescence trace
“ a0loomoon
transient response
f x0loomoon
action potentials
` b1mloomoon
bias
` nloomoon
noise
, x0 ě 0. (6.1)
The task of recovering the spike train x0 from such SaS signals
are frequently of interest in the neuroscience,and can naturally be
cast as a SaSD problem. An advantage of this approach is its
ability to estimate transientresponse (which is rarely known a
priori) simultaneously. This is important when neurons exhibit
densebursts of spiking activity, which is an especially challenging
setting for deconvolution tasks.
16
-
(a) Incoherent a0 (b) Coherent a0
Figure 15: Algorithmic comparison. (a) Convergence of various
methods minimizing ΨBL with incoher-ent a0 over FFT operations used
(for computing convolutions). The y-axis denotes the log of the
anglebetween apkq and the nearest shift of a0, and each marker
denotes five iterations. (b) Convergence forcoherent a0.
Simulated data. Recent work [VPM`10, PSG`16, FZP17] suggests
that the calcium dynamics y can be wellapproximated by using a
autoregressive (AR) process of order r,
yptq “rÿ
i“1γiypt´ iq ` x0ptq ` b` nsptq,
where x0ptq is the number of spikes that the neuron fired at
t-th timestep, nsptq is noise, and tγiuri“1 areautoregressive
parameters. [PSG`16, FZP17] showed that the AR(r) model is
equivalent to Equation (6.1)with a parameterized kernel a0. The
order r is chosen to be a small positive integer, usually r “ 1 or
r “ 2.When r “ 1, the AR(1) kernel is a one-sided exponential
function
a0ptq “ exp p´t{τq , t ě 0, (6.2)
for some τ ą 0. TheAR(1)model serves as a good approximation of
the calciumdynamicswhen the temporalresolution of imaging sensors
is low. In contrast, the AR(2) model serves as a more accurate
model for hightemporal resolution calcium dynamics, with
a0ptq “ exp p´t{τ1q ´ exp p´t{τ2q , t ě 0, (6.3)
where τ1 and τ2 are some parameterswith τ1 ą τ2 ą 0. As
illustrated in Figure 16, for high temporal resolutioncalcium
dynamics, the AR(2) model tends to be a better model which captures
the short rise-time of calciumtransients by the difference of two
exponential functions.
Here we demonstrate the effectiveness of the proposed methods on
synthetic data for both AR(1) andAR(2) models. We generate a
sequence of simulated calcium dynamics y with length T “ 100psq and
sam-pling rate f “ 100Hz (i.e. m “ 104 samples in total). We
generate the kernel a0 P Rn0 with length T “ 1psq(i.e. n0 “ 100):
for the AR(1) model, we set τ “ 0.25 in Equation (6.2); for AR(2)
model, we set τ1 “ 0.2 andτ2 “ 0.03 in Equation (6.3). Each kernel
is normalized so they lie on the sphere. The sparse spike train x0
isgenerated from Bernoulli distribution x0 „i.i.d. Bpθq with
sparsity rate θ “ n0´4{5. We set the bias b “ 1 andnoise n „ N
`
0, σ2I˘
, σ “ 5 ˆ 10´2 in Equation (6.1).We test and compare the
proposed iADM and its reweighted variant (see Appendix B.2) for
deconvolving
the data, with λ “ 10´1. Reweighting is especially effective
under noise contamination (Section 4.4), asdemonstrated by Figure
16 where it provides more accurate predictions of the unknown
neuron kernels forboth AR(1) andAR(2)models. From Figures 17 and
18, we can clearly see that deconvolution ismore difficultunder the
AR(2) model. In such cases reweighting can significantly improve
resolution of spiking activity,allowing accurate estimation of
firing times even in under dense bursts.
17
-
(a) AR(1) model (b) AR(2) model
Figure 16: Recovery of transient response a0 for calcium
imaging. The left figure denotes kernel a0 forthe AR(1) model, and
the right figure shows the kernel a0 for AR(2) model.
(a) raw data vs. estimated calcium dynamics
(b) spike train, iADM algorithm, minℓ }x0 ´ sℓ rx‹s}2 “
6.2541
(c) spike train, reweighted-iADM algorithm, minℓ }x0 ´ sℓ rx‹s}2
“ 2.7989
Figure 17: Estimation of spike train x0 for AR(1) model. The
first figure shows the estimation of calciumdynamics, the second
figure shows the estimation of the spiking trainx0 by
iADMalgorithm, and the thirdfigure demonstrates the reweighting
variant of iADM. minℓ }x0 ´ sℓ rx‹s}2 denotes the distance
betweenthe target x0 and estimated solution x‹. As we observe, the
proposed methods can accurately predict thespiking locations even
when spikes overlap.
Real calcium imaging dataset. Finally, we demonstrate the
effectiveness of proposed methods on the realcalcium imaging
dataset16. The data has been resampled to sampling rate f “ 100Hz,
and linear driftingtrends are removed from calcium traces using
robust regression [TBF`16]. Since these measurements
arecontaminated by large system noise, as is often the case in
realistic settings, we choose a large sparsity penaltyλ “ 6ˆ10´1
for Equation (3.1). Figure 19 shows the recovered kernel by the
proposed iADMand its reweight-ing variant. Figure 20 shows the
estimated spike train. By comparison, the reweighting method
appears toproduce better estimation of spiking activity.
16The data is obtain from the spikefinder website,
http://spikefinder.codeneuro.org/.
18
http://spikefinder.codeneuro.org/
-
(a) raw data vs. estimated calcium dynamics
(b) spike train, iADM algorithm, minℓ }x0 ´ sℓ rx‹s}2 “
14.2118
(c) spike train, reweighted-iADM algorithm, minℓ }x0 ´ sℓ rx‹s}2
“ 15.6756
Figure 18: Estimation of spike train x0 for AR(2) model. The
first figure shows the estimation of calciumdynamics, the second
figure shows the estimation of the spiking trainx0 by
iADMalgorithm, and the thirdfigure demonstrates the reweighting
variant of iADM. We use minℓ }x0 ´ sℓ rx‹s}2 to denote the
distancebetween the target x0 and estimated solution x‹. In
comparison with the original iADM algorithm, thereweighting method
is very effective in suppressing noise.
(a) iADM (b) reweighted iADM
Figure 19: Recovery of transient response a0 for real dataset.
Left figure shows the recovered kernel bythe iADM algorithm, right
figure shows the recovered kernel by its reweighting variant.
6.1.2 Spike sorting by convolutional dictionary learning
Electrophysiological activity recorded by electrodes usually
record superpositions of waveforms generatedfrom multiple neurons
simultaneously [RPQ15]. The goal of spike sorting is to estimate
the spiking timesfrom the measurement and decompose the spiking
activities of the specific neurons. We refer interestedreaders to
[Lew98, RPQ15] for amore detailed overview of this problem.
Traditional spike sorting approaches[QNBS04, CMB`17, YSE`18, CRQ18]
are often time consuming, lack standardization, and involving
manualintervention, which makes it difficult to maintain data
provenance and assess the quality of scientific results.
In the following, we introduce a fully automated approach based
on SaS-CDL andnonconvex optimization;this is similar to the
approach taken by [SFB18]. Mathematically, the measured waveform
can be modeledas a superposition of convolutions of individual
neuron waveform templates and their corresponding spike
19
-
(a) raw data vs. estimated calcium dynamics
(b) spike train, iADM algorithm
(c) spike train, reweighted-iADM algorithm
Figure 20: Estimation of spike train x0 for real calcium imaging
dataset. The first figure shows theestimation of calcium dynamics,
the second figure shows the estimation of the spiking train x0 by
iADMalgorithm, and the third figure demonstrates the reweighting
variant of iADM.
trains,
yloomoon
voltage signal
“Nÿ
k“1a0k
loomoon
waveform template
f x0kloomoon
sparse spike train
` b1mloomoon
bias
` nloomoon
noise
,
where each waveform templates ta0kuNk“1 P Rn0 correspond to
different neurons, and therefore exhibit differ-ent kernel shapes.
Given the signal y, the task of spike sorting is to recover all
ta0kuNk“1 and tx0ku
Nk“1; this is
a classic example of the SaS-CDL problem as discussed in Section
4.3.The difficulty of spike sorting (or SaS-CDL) is not only
captured by the shift-coherence of the individual
waveforms a0k individually, but also by the shift-coherence
between different waveforms from ta0kuNk“1. Theproblem increases
with the cross-correlation of differing kernels. Let A0 “
“
a01 ¨ ¨ ¨ a0N‰
. Quantitatively,we can define mutual incoherence of A0 by
µm pA0q “ max1ďiăjďN
›
›C˚a0ia0j›
›
8 ,
which is essentially the largest shift-correlation between all
kernels. The SaS-CDL problem becomes easywhen µm pA0q is small, and
vice versa. In the following, we demonstrate the effectiveness of
the proposedmethods for spike sorting on one easy dataset (with
small µm pA0q) and one difficult dataset (with largeµm pA0q).
Wedemonstrate the proposed reweighting variant of
iADMalgorithmon a classical spike-sortingdataset17.The signal is
sampled at a frequency of f “ 24kHz, and each time sequence records
spiking activities of 3different types of neurons. The waveform
templates are constructed using a database of 594 different
averagespike shapes compiled from recordings in the neocortex and
basal ganglia. A more detailed description ofdataset can be found
in Section 4 of [QNBS04]. We test the proposed method on two signal
sequences oflengthm “ 105, each measures the spiking activities of
three different types of neurons with length n0 “ 72:one signal
sequence is easy to deconvolve with lowmutual coherence µm pA0q,
and another is relatively more
17It can be downloaded online at
https://vis.caltech.edu/~rodri/Wave_clus/Wave_clus_home.htm.
20
https://vis.caltech.edu/~rodri/Wave_clus/Wave_clus_home.htm
-
difficult with larger µm pA0q. The data is contaminated by
random noise, with noise level 0.05 (i.e., the stan-dard deviation
relative to the amplitude of the spike classes). The recovered
waveform and sparse spike trainfor the “easy” case are shown in
Figure 21 and Figure 22, respectively. And the results for the
“difficult”case are shown in Figure 21 and Figure 22. As we
observe, the proposed method successfully recovers thewaveform
templates and spiking locations for each type of neuron. As the
latter “difficult” signal sequencecontains neuron waveform of
similar shapes, we observe slightly more false alarms in spike
detection.
(a) Neuron 1 (b) Neuron 2 (c) Neuron 3
Figure 21: Recovered neuron waveform template of “easy” dataset.
The data contains three neurons ofdistinctwaveforms, with noise
level 0.05. Each subfigure corresponds to the recovered waveform
templateof one specific type of neuron.
6.2 Microscopy imaging and data analysisFinally, we apply our
proposedmethod towards applications in microscopy, and demonstrate
its effectivenessin image super-resolution and decomposition
problem settings.
6.2.1 Sparse blind deconvolution for super-resolution
fluorescence microscopy
Fluorescence microscopy is a widely used imaging method in
biomedical research [Hel07, FST08], and hasenabled numerous
breakthroughs in neuroscience [GK12], biology and biochemstry
[LC11, NN14, BBM`16].The spatial resolution of
fluorescencemicroscopy is however limited by diffraction:
thewavelength of the light(i.e., several hundred nanometers) is
often larger than the typical molecular length scales in cells,
preventinga detailed characterization of most subcellular
structures.
A computational imaging technique recently developed to overcome
this resolution limit is stochastic
opticalreconstructionmicroscopy18 (STORM) [RBZ06b, HWBZ08, HBZ10].
Instead of activating all the fluorophores atthe same time, STORM
randomly activates subsets of photoswitchable fluorescent probes to
seperate the fluo-rophores present intomultiple frames of sparsely
activatedmolecules (see Figure 26 and Figure 27). From
thepurspective of the sparsity-coherence tradeoff, this effectively
reduces the sparsity of x0, making deconvolu-tion easier to solve.
Therefore, if the location of these molecules can be precisely
determined computationallyfor each frame, synthesizing all
deconvolved frames produces a super-resolutionmicroscopy image with
nearnanoscale resolutions.
For each frame, the localization task can be formulated as a
sparse deconvolution problem, i.e.,
Yloomoon
frame
“ A0loomoon
point spread function
f X0loomoon
sparse point sources
` Nloomoon
noise
,
where we want to recover X0 given Y . The classical approaches
solve the problem by fitting the blurredspots with Gaussian
point-spread functions (PSFs) using either maximum-likelihood or
Bayesian estimation
18Similar methods with different names have been developed at
the same time by using different fluorophores and microscopes,such
as photoactivated localization microscopy (PALM) [BPS`06], and
fluorescence photoactivation localization microscopy
(fPALM)[HGM06].
21
-
(a) raw data vs. estimated sequence
(b) spike train for Neuron 1
(c) spike train for Neuron 2
(d) spike train for Neuron 3
Figure 22: Detected spike train of “easy” dataset. The data
contains three neurons of distinct waveforms,with noise level 0.05.
The first subfigure shows the estimation of the raw data sequence.
The second tofourth subfigures show the predicted spike train for
each neuron, respectively.
(a) Neuron 1 (b) Neuron 2 (c) Neuron 3
Figure 23: Recovered neuron waveform template of “difficult”
dataset. The data contains three neu-rons of similar waveforms,
with noise level 0.05. Each subfigure corresponds to the recovered
waveformtemplate of one specific type of neuron.
techniques [QLL`10, HUK11, ZZEH12]. These approaches suffer from
several limitations: (i) estimation iscomputationally expensive and
poor in quality when dense clusters of fluorophores are activated;
(ii) for 3Dimaging, the PSF exhibits aberration across the focus
plane [SN06], making it almost impossible to directlyestimate it
from the data.
To deal with these challenges, we solve the single-molecule
localization problem using our proposedmethod for SaSD to jointly
estimate the PSF A0 and the point source map X0. Our frames come
from the
22
-
(a) raw data vs. estimated sequence
(b) spike train for Neuron 1
(c) spike train for Neuron 2
(d) spike train for Neuron 3
Figure 24: Detected spike train of “difficult” dataset. The data
contains three neurons of similar wave-forms, with noise level
0.05. The first subfigure shows the estimation of the raw data
sequence. The secondto fourth subfigures show the predicted spike
train for each neuron, respectively.
single-molecule localization microscopy (SMLM) benchmarking
dataset19. We apply the reweighted iADMalgorithm on the 2D real
video sequence ”Tubulin”, which contains 500 high density frames.
The fluorescencewavelength is 690 nanometer (nm), the imaging
frequency is f “ 25Hz, and each frame is of size 128 ˆ 128.The
single-molecule localization problem is solved on the same 128 ˆ
128 pixel grid20, where each pixel is of100 nm resolution. Figure
25 shows the recovered PSF, Figure 26 presents the recovered
activation map foreach individual time frame, and Figure 27
presents the aggregated super-resolution image. These results
showthat our approach can automatically predict the PSF and the
activation map for each video frame, producinghigher resolution
microscopy images without manual intervention.
(a) PSF in 2D (b) PSF in 3D
Figure 25: Estimated PSF for STORM imaging. The left hand side
shows the estimated 8 ˆ 8 PSF in 2D,the right hand side visualizes
the PSF in 3D.
19All the data can be downloaded at
http://bigwww.epfl.ch/smlm/datasets/index.html.20Usually, the
localization problem is solved on a finer grid (e.g., gridwith 4´10
times better resolution) so that the resulting resolution
can reach 20 ´ 30 nm. We will discuss potential methods to deal
with this finer-grid SaSD problem in Section 7.2 as future
work.
23
http://bigwww.epfl.ch/smlm/datasets/index.html
-
(a) Frame 1, Time = 0s (b) Frame 100, Time = 4s
(c) Frame 200, Time = 8s (d) Frame 300, Time = 12s
(e) Frame 400, Time = 16s (f) Frame 500, Time = 20s
Figure 26: Predicted activation map for each individual frame.
For each subfigure, the left hand sideshows the original video
frame, and the right hand side presents the predicted activation
map using ourSaSD solver.
(a) original image (b) reconstructed image
Figure 27: Aggregated result for STORM imaging. The left hand
side shows the original microscopyimage, and the right hand side
presents the super-resolution image obtained by our method. The
pixelresolution is 100 nm.
24
-
6.2.2 Convolutional dictionary learning for microscopy data
analytics
Recent advances in imaging and computational techniques have
resulted in the ability to obtain microscopicdata in unprecedented
detail and volume. SaSD and its extensions are found to be
well-suited for extractingmotifs and location information from such
datasets from neuroscience, material science and beyond, as wehave
seen from Section 6.1.2. In certain settings for microscopy, the
observed image can also be decomposedas
Yloomoon
microscopy image
“Kÿ
k“1A0kloomoon
motif k
f X0kloomoon
activation map
` Nloomoon
noise
.
and useful information can be obtained by solving the resulting
2D SaS-CDL problem [PSG`16, CSL`18]. Inthis section, we demonstrate
our proposed method for SaS-CDL on two different imaging
modalities.
(a) two-photon calcium image Y
(b) estimated kernel Ak pk “ 1, 2q
(c) predicted activation map Xk pk “ 1, 2q
(d) classified image Yk “ Ak f Xk pk “ 1, 2q
Figure 28: Localization and classification for calcium
microscopy images. (a) shows the original image;(b) shows the
estimated kernel shape for the neuron (left) and dendrite (right);
(c) presents the predictedactivation map for the neuron (left) and
dendrite (right); (d) presents the reconstructed image Yk “Ak f Xk
pk “ 1, 2q for the neuron (left) and dendrite (right).
Neuronal localization for 2D calcium imaging. Tracking the spike
locations of neurons in 2D calcium imag-ing video sequences is a
challenging task due to the presence of (non)rigid motion,
overlapping sources, andirregular background noise [PSG`16, GFK`17,
GFG`19]. Herewe showhow the SaS-CDL problem can serveas a basis for
distinguishing between overlapping sources. Figure 28a shows a
single 512 ˆ 512 frame fromthe two-photon fluorescence calcium
microscopy dataset obtained by Allen Institute for Brain Science21.
The
21The data can be found at
http://observatory.brain-map.org/visualcoding/search/overview.
25
http://observatory.brain-map.org/visualcoding/search/overview
-
frame shows the cross sections of two types of neuronal
components, the somata and the denrdrites, whosefluorophores that
are activated at the given time frame. It is clear that these two
components are primarilydistinguished by their size. We decompose
the frame into the somatic and dendritic components by
solvingSaS-CDL with the proposed method, giving us a rough estimate
of the “average” somatic or dendritic motif(Figure 28b), as well as
the location of each component (Figure 28c). This allows the image
to be decomposedinto images consisting of somata or dendrites
exclusively (Figure 28d). Therefore SaS-CDL can either be thebasis
for a preprocessing step to remove undesired components, such as
the dendrites, from a microscopyimage. Furthermore, this
deconvolution technique allows the individual activation map to be
tracked foreach video frame, opening a new way for nonrigid motion
to be corrected across frames by synthesizing allactivation maps.
We left this as a promising future research direction.
(a) STM image Y
(b) estimated kernel Ak pk “ 1, 2q
(c) predicted activation map Xk pk “ 1, 2q
(d) classified image Yk “ Ak f Xk pk “ 1, 2q
Figure 29: Defect detection for STM images. (a) shows the
original STM image; (b) shows the estimatedkernel shape for the
defects; (c) presents the predicted activation map for the defects;
(d) presents thereconstructed image Yk “ Ak f Xk pk “ 1, 2q for the
defect.
Defect detection in scan tunneling microscopy (STM) image.
Modern high-resolution microscopes, suchas the scanning tunneling
electron microscope, are commonly used to study specimens that have
dense andaperiodic spatial structure [CLE93, RCG`07, RSP`09].
Extracting meaningful information from images ob-tained from such
microscopes remains a formidable challenge [KBF`03]. For instance,
Figure 29a presents aSTM NaFeAs sample image (with size 128 ˆ 128)
of a Co-doped iron arsenide crystal lattice. A method
forautomatically acquiring the signatures of the defects (motifs)
and their locations is highly desirable [CSL`18].Herewe apply our
proposedmethod to solve SaS-CDL and extract both the defect
signatures (Figure 29b) andtheir locations (see Figure 29c), as
well as decomposing the image into contributions based on the
individualdefects (Figure 29d).
26
-
7 Conclusion and Discussion7.1 Relationship to the literature
and conclusionNonconvex optimization. Unlike convex optimization
problems, nonconvex functions usually have numer-ous spurious local
minima, and one may also encounter “flat” saddle points that are
very difficult to escape[SQW15]. In theory, even finding a local
minimum of a general nonconvex function is NP-hard [MK87]
–nevermind the globalminimum. However, recent advancements in
nonconvex optimization [SQW15,GHJY15]showed that typical nonconvex
problems in practice are often more structured, so that they often
have muchmore benign geometric landscapes than the worst case: (i)
all saddle points can be efficiently escaped byusing negative
curvature information; (ii) the equivalent “good” solutions
(created by the intrinsic symme-try) are often the global
optimizers of the nonconvex objective. This type of benign
geometric structure hasbeen discovered for many nonconvex problems
in signal processing and machine learning, such as phaseretrieval
[CLS15, SQW18, QZEW17], dictionary learning [QSW14, SQW16a,
SQW16b], low rank matrix re-covery [GLM16, Chi16] (orthogonal)
tensor decomposition [GHJY15], and phase synchronization
problems[BAC18], etc. Inspired by similar benign geometric
structure for a simplified nonconvex Dropped Quadraticformulation,
this work provides an efficient and practical nonconvex
optimization method for solving blindsparse deconvolution
problems.
Blind deconvolution. The blind deconvolution problem is an
ill-posed problem in its most general form.Nonetheless, problems in
practice often exhibits intrinsic low-dimensional structures,
showing promises forefficient optimization. Motivated by a variety
of applications, many low-dimensional models for (blind)
de-convolution problems have been studied in the literature.
[ARR14, Chi16, LS15, LLB16, KK17, AD18, Li18]studied the problem
when the unknown signals a0 and x0 either live in known
low-dimensional subspaces,or are sparse in some known dictionary.
These results assumed that the subspace/dictionary are chosen
atrandom, such that the problem does not exhibit the signed shift
ambiguity and can be provably solved viaconvex relaxation22.
However, the assumption of random subspace/dictionary model is
often unrealistic inpractice. Recently, [WC16, LB18, QLZ19]
consider sparse blind deconvolution with multiple
measurements,where they show the problem can be efficiently solved
to global optimality when the kernel is invertible. Incontrast, the
SaS model studied in this work exhibits much broader
applications.
Because of the shift symmetry, the SaS model does not appear to
be amenable for convexification, andit exhibits a more complicated
nonconvex geometry. To tackle this problem, Wipf et al. [WZ14]
imposesℓ2 regularization on a0 and provides an empirically reliable
algorithm. Zhang et al. [ZLK`17] studies thegeometry of a
simplified nonconvex objective over the sphere, and proves that in
the dilute limit in whichx0 is a single spike, all strict local
minima are close to signed shift truncations of a0. Zhang et al.
[ZKW18]formulated the problem as an ℓ4 maximization problem over
the sphere23. They proved that on a restrictedregion of the sphere
every local minimizer is near a truncated signed shift of a0, when
a0 is well-conditionedand x0 is sparse. Kuo et al. [KZLW19] studies
a Dropped Quadratic simplification of the Bilinear Lassoobjective,
which provably obtains exact recovery for an incoherent kernel a0
and sparsex0. However, both theℓ4 maximization and Dropped
Quadratic objectives are still quite far from practical
formulations for solvingSaSD. In contrast, as demonstrated in this
work, optimizing the Bilinear Lasso formulation turns out to bemuch
more effective in practice.
Geometry inspired optimization method for SaSD. Inspired by the
benign geometric structure of the non-convex objective, we proposed
efficient nonconvex optimization methods that directly optimizes
the BilinearLasso. The new approach exploits the geometry by (i)
using data driven initializations to avoid spurious lo-cal
minimizers, (ii) adopting momentum accelerating for coherent
kernels, and (iii) adaptively shrinking thepenalty parameter λ to
achieve faster convergence and higher accuracy solutions. Our
vanilla algorithm is asimple alternating descent method, which is
inspired by the recent PALMmethods [BST14, PS16]. In compar-ison
with classical alternatingminimizationmethods for sparse blind
deconvolution [CW00, SM12, ZLK`17],our approach does not require
solving expensive Lasso subproblems, and the iterates make fast
progress to-wards the optimal solution. On the other hand, as
ourmethod is first-order in nature, it is muchmore efficient
22Some recent work [LLSW18, MWCC17] show this problem can also
be provably solved via nonconvex approaches.23A similar objective
is considered for the multichannel sparse blind deconvolution
problem [LB18].
27
-
than the second-order trust-region [CGT00, BAC18] and
curvilinear search [Gol80] methods considered in[ZLK`17,
KZLW19].
Convolutional dictionary learning. Furthermore, our approach has
natural extensions for tackling the SaS-CDL problemwhenmultiple
unknown kernels present. By consider a similar nonconvex objective
analogousto SaSD, our geometric inspired algorithm empirically
solves the SaS-CDL problem to global optimality in avery efficient
manner. The newmethod joins recent algorithmic development for
solving CDL [CPR13, HA15,PRSE17, GCW18, LGCWY18, MCCM18, ZSE19].
Again, most24 of the previous approaches [GCW18] deployan
alternating minimization strategy, which exactly solves the
expensive Lasso subproblem for each iteration.In contrast, our
method is much more simple, efficient and effective, demonstrated
by experiments on realdatasets.
7.2 Discussion and future workMoving forward, we believe this
work has opened up several future directions that could be of great
empiricaland theoretical interests.
Geometric analysis of Bilinear Lasso. The Bilinear Lasso
formulation is one of most natural formulationsfor solving the SaSD
problem. In light of our empirical success of solving the Bilinear
Lasso, analyzing andunderstanding its global nonconvex landscapes
is of great importance. As discuss in Section 3, the
DroppedQuadratic formulation studied in [KZLW19] has commonalities
with the Bilinear Lasso: both exhibit localminima at signed shifts,
and both exhibit negative curvature in symmetry breaking
directions. However, amajor difference (and hence, major challenge)
is that gradient methods for Bilinear Lasso do not retract to
asubspaces – they retract to amore complicated, nonlinear set. As
the empirical success we possessed here, bet-ter understandings of
the geometric structure for the Bilinear Lasso in much needed. A
better understandingwill also shed light on SaS-CDL with multiple
unknown kernels.
Parameterized sparse blind deconvolution. In this work, we
studied the blind deconvolution problemwithno prior knowledge of
the kernel/motif a0. However, in many application, one can often
obtain some sideinformation, where the kernel is often determined
by only a few parameters associated with the underlyingphysical
processes. For example, in the calcium imaging problem we studied
in Section 6, an auto regression(AR) model is often used to
characterize the spiking and decaying process of the kernel, which
is only de-termined by one or two parameters [VPM`10, FZP17]. Thus,
how to estimate these kernel parameters raisesa challenging but
interesting question. Our preliminary investigation shows that
nonconvex optimizationlandscapes of this parameterized “semi-blind”
sparse deconvolution problem also possess benign
geometricproperties for certain types of kernels (see Figure 30 for
an illustration).
SaSD meets super-resolution. In many imaging applications, it is
often desirable to solve blind deconvo-lution and super-resolution
problems simultaneously. In other words, let D r¨s be a
downsampling operator,we want to recover the high-resolution kernel
a0 and sparse activation map x0 from the low-resolution
mea-surement of the form y “ D ra0 f x0s. This type of problem
appears often due to the resolution/hardwarelimit of the imaging
system, and therefore fine details of both a0 and x0 are missing
due to downsampling.For instance, in Section 6 we show that the
spatial resolution of fluorescent microscopy is constraint by
thediffraction limit of the light [HBZ09]. If we can solve this
super-resolution SaSD problem, we can obtainmuchhigher resolution
image of living cells in vivo. However, our preliminary
investigations show that optimizingthe natural nonconvex
formulation
mina,x
1
2}y ´ D ra f xs}22 ` λ }x}1 , s.t. a P S
n´1
tends to produce downsampled a0 and x0. How to solve this
problem is largely open and remains a veryinteresting question. One
possibility is to enforce extra constraints on a0, such as
penalizing TV -norm topromote smoothness.
24The recent work [MCCM18] resembles some similarities to ours.
However, the problem setting and formulation are still quite
differ-ent.
28
-
Figure 30: Nonconvex landscape of parameterized SaSD, with AR(2)
kernel and two unknown param-eters. The kernel a0ptq “ exp p´t{τ‹1
q ´ exp p´t{τ‹2 q is parameterized by two parameters τ‹1 “ 0.2
andτ‹2 “ 0.1. We generate the data y “ a0ptqfx0, where x0 „i.i.d.
Bpθq with θ “ 10´2. We plot the marginal-ized function landscape of
Ψxpaq “ minx 12 }y ´ apτq f x}
22 ` λ }x}1 w.r.t. τ1 and τ2, where λ “ 10
´3,n0 “ 150 and m “ 104. The figures on the left and right hand
sides are 3D and 2D plots of the functionlandscape, respectively.
As we can see, the ground truth pτ‹1 , τ‹2 q is the global
minimizer to the nonconvexobjective, but the landscape near region
of the ground truth is very flat and therefore very difficult to
makeprogress on minimizing the nonconvex objective.
Dealing with structured data. Data in practice often possesses
much richer structure than the basic SaSmodel we studied here. For
instance, in calcium imaging, the signal we obtained often has
drifting/motionissues across time frames, and it also exhibits
low-rank background DC components [PSG`16, GFK`17]. InSTORM optical
microscopy, there are rich spatial and temporal correlations within
and between video frames[SMSE18]. Moreover, in many microscopy
imaging data analysis problems, the motif we want to locate
oftenexhibits unknown deformations and random rotations, and its
shape is often asymmetric. How to deal withthese extra structures
raises a variety of challenging problems for future research.
AcknowledgementThis work was funded by NSF 1343282, NSF CCF
1527809, and NSF IIS 1546411. We would like to thankGongguo Tang,
Shuyang Ling, Carlos Fernandez-Granda, Ruoxi Sun, Liam Paniski for
fruitful discussions.QQ also acknowledges supports from Microsoft
PhD fellowship and the Moore-Sloan Foundation.
29
-
References[ABG07] Pierre-Antoine. Absil, Christopher G. Baker,
and Kyle A. Gallivan. Trust-region methods on Riemannian
manifolds. Foundations of Computational Mathematics,
7(3):303–330, 2007.[AD18] Ali Ahmed and Laurent Demanet. Leveraging
diversity and sparsity in blind deconvolution. IEEE Transac-
tions on Information Theory, 64(6):3975–4000, 2018.[AMS09]
Pierre-Antoine. Absil, Robert Mahoney, and Rodolphe Sepulchre.
Optimization Algorithms on Matrix Mani-
folds. Princeton University Press, 2009.[ANW10] Alekh Agarwal,
Sahand Negahban, and Martin J Wainwright. Fast global convergence
rates of gradient
methods for high-dimensional statistical recovery. InAdvances
inNeural Information Processing Systems, pages37–45, 2010.
[ARR14] Ali Ahmed, Benjamin Recht, and Justin Romberg. Blind
deconvolution using convex programming. IEEETransactions on
Information Theory, 60(3):1711–1732, 2014.
[B`15] Sébastien Bubeck et al. Convex optimization: Algorithms
and complexity. Foundations and Trends® in Ma-chine Learning,
8(3-4):231–357, 2015.
[BAC18] Nicolas Boumal, Pierre-Antoine Absil, and Coralia
Cartis. Global rates of convergence for nonconvex opti-mization on
manifolds. IMA Journal of Numerical Analysis, 39(1):1–33, 2018.
[BBM`16] Alistair N Boettiger, Bogdan Bintu, Jeffrey R Moffitt,
Siyuan Wang, Brian J Beliveau, Geoffrey Fudenberg,Maxim Imakaev,
Leonid A Mirny, Chao-ting Wu, and Xiaowei Zhuang. Super-resolution
imaging revealsdistinct chromatin folding for different epigenetic
states. Nature, 529(7586):418, 2016.
[BDH`13] David Briers, Donald D Duncan, Evan R Hirst, Sean J
Kirkpatrick, Marcus Larsson, Wiendelt Steenbergen,Tomas Stromberg,
and Oliver B Thompson. Laser speckle contrast imaging: theoretical
and practical limi-tations. Journal of biomedical optics,
18(6):066018, 2013.
[Bec17] Amir Beck. First-order methods in optimization, volume
25. SIAM, 2017.[BK02] Simon Baker and Takeo Kanade. Limits on
super-resolution and how to break them. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 24(9):1167–1183,
2002.[BPC`11] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato,
Jonathan Eckstein, et al. Distributed optimization and
statistical learning via the alternating direction method of
multipliers. Foundations and Trends® in Machinelearning,
3(1):1–122, 2011.
[BPS`06] Eric Betzig, George H Patterson, Rachid Sougrat, O Wolf
Lindwasser, Scott Olenych, Juan S Bonifacino,Michael W Davidson,
Jennifer Lippincott-Schwartz, and Harald F Hess. Imaging
intracellular fluorescentproteins at nanometer resolution. Science,
313(5793):1642–1645, 2006.
[BST14] Jérôme Bolte, Shoham Sabach, and Marc Teboulle. Proximal
alternating linearized minimization for non-convex and nonsmooth
problems. Mathematical Programming, 146(1-2):459–494, 2014.
[BT09] Amir Beck andMarc Teboulle. A fast iterative
shrinkage-thresholding algorithm for linear inverse problems.SIAM
Journal on Imaging Sciences, 2(1):183–202, 2009.
[BVG13] Alexis Benichoux, Emmanuel Vincent, and Rémi Gribonval.
A fundamental pitfall in blind deconvolutionwith sparse and
shift-invariant priors. In ICASSP-38th International Conference on
Acoustics, Speech, and SignalProcessing-2013, 2013.
[CE16] Patrizio Campisi and Karen Egiazarian. Blind image
deconvolution: theory and applications. CRC press, 2016.[CF17] Il
Yong Chun and Jeffrey A Fessler. Convolutional dictionary learning:
Acceleration and convergence. IEEE
Transactions on Image Processing, 27(4):1697–1712, 2017.[CFG14]
Emmanuel J Candès and Carlos Fernandez-Granda. Towards a
mathematical theory of super-resolution.
Communications on pure and applied Mathematics, 67(6):906–956,
2014.[CGT00] Andrew R. Conn, Nicholas I.M. Gould, and Philippe L.
Toint. Trust region methods, volume 1. SIAM, 2000.[Chi16] Yuejie
Chi. Guaranteed blind sparse spikes deconvolution via lifting and
convex optimization. IEEE Journal
of Selected Topics in Signal Processing, 10(4):782–794,
2016.[CLE93] MF Crommie, CP Lutz, and DM Eigler. Imaging standing
waves in a two-dimensional electron gas. Nature,
363(6429):524, 1993.[CLS15] Emmanuel J. Candès, Xiaodong Li, and
Mahdi Soltanolkotabi. Phase retrieval via wirtinger flow:
Theory
and algorithms. IEEE Transactions on Information Theory,
61(4):1985–2007, April 2015.
30
-
[CMB`17] Jason E Chung, Jeremy FMagland, Alex H Barnett, Vanessa
M Tolosa, Angela C Tooker, Kye Y Lee, Kedar GShah, Sarah H Felix,
Loren M Frank, and Leslie F Greengard. A fully automated approach
to spike sorting.Neuron, 95(6):1381–1394, 2017.
[CPR13] Rakesh Chalasani, Jose C Principe, and Naveen
Ramakrishnan. A fast proximal method for convolutionalsparse
coding. In The 2013 International Joint Conference on Neural
Networks (IJCNN), pages 1–5. IEEE, 2013.
[CRQ18] Fernando Chaure, Hernan Gonzalo Rey, and Rodrigo Quian
Quiroga. A novel and fully automatic spikesorting implementation
with variable number of features. Journal of Neurophysiology,
2018.
[CSL`18] Sky C Cheung, John Y Shin, Yenson Lau, Zhengyu Chen, Ju
Sun, Yuqian Zhang, John N Wright, and Ab-hay N Pasupathy.
Dictionary learning in fourier transform scanning tunneling
spectroscopy. arXiv preprintarXiv:1807.10752, 2018.
[CW98] Tony F Chan and Chiu-Kwong Wong. Total variation blind
deconvolution. IEEE transactions on Image Pro-cessing,
7(3):370–375, 1998.
[CW00] Tony F Chan and Chiu-Kwong Wong. Convergence of the
alternating minimization algorithm for blinddeconvolution. Linear
Algebra and its Applications, 316(1-3):259–285, 2000.
[CWB08] Emmanuel J Candes, Michael B Wakin, and Stephen P Boyd.
Enhancing sparsity by reweighted ℓ1 mini-mization. Journal of
Fourier analysis and applications, 14(5-6):877–905, 2008.
[ETS11] Chaitanya Ekanadham, Daniel Tranchina, and Eero P
Simoncelli. A blind sparse deconvolution method forneural spike
identification. In Advances in Neural Information Processing
Systems, pages 1440–1448, 2011.
[FST08] Marta Fernández-Suárez and Alice Y Ting. Fluorescent
probes for super-resolution imaging in living cells.Nature Reviews
Molecular cell Biology, 9(12):929, 2008.
[FZP17] Johannes Friedrich, Pengcheng Zhou, and Liam Paninski.
Fast online deconvolution of calcium imagingdata. PLoS
Computational Biology, 13(3):e1005423, 2017.
[GBW18] Dar Gilboa, Sam Buchanan, and John Wright. Efficient
dictionary learning with gradient descent. arXivpreprint
arXiv:1809.10313, 2018.
[GCW18] Cristina Garcia-Cardona and Brendt Wohlberg.
Convolutional dictionary learning: A comparative reviewand new
algorithms. IEEE Transactions on Computational Imaging,
4(3):366–381, 2018.
[GFG`19] Andrea Giovannucci, Johannes Friedrich, Pat Gunn,
Jeremie Kalfon, Brandon L Brown, Sue Ann Koay, Jian-nis Taxidis,
Farzaneh Najafi, Jeffrey L Gauthier, Pengcheng Zhou, et al. Caiman
an open source tool forscalable calcium imaging data analysis.
Elife, 8:e38173, 2019.
[GFK`17] AndreaGiovannucci, Johannes Friedrich,Matt Kaufman,
AnneChurchland, Dmitri Chklovskii, LiamPanin-ski, and Eftychios A
Pnevmatikakis. Onacid: Online analysis of calcium imaging data in
real time. InAdvances in Neural Information Processing Systems,
pages 2381–2391, 2017.
[GHJY15] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping
from saddle points—online stochastic gradientfor tensor
decomposition. In Proceedings of The 28th Conference on Learning
Theory, pages 797–842, 2015.
[GK12] Christine Grienberger and Arthur Konnerth. Imaging
calcium in neurons. Neuron, 73(5):862–885, 2012.[GLM16] Rong Ge,
Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local
minimum. In Advances in
Neural Information Processing Systems, pages 2973–2981,
2016.[GMWZ17] Donald Goldfarb, Cun Mu, John Wright, and Chaoxu
Zhou. Using negative curvature in solving nonlinear
programs. Computational Optimization and Applications,
68(3):479–502, 2017.[Gol80] DonaldGoldfarb. Curvilinear path
steplength algorithms forminimizationwhich use directions of
negative
curvature. Mathematical Programming, 18(1):31–40, 1980.[HA15]
Furong Huang and Animashree Anandkumar. Convolutional dictionary
learning through tensor factoriza-
tion. In Feature Extraction: Modern Questions and Challenges,
pages 116–129, 2015.[Hay94] Simon S Haykin. Blind deconvolution.
Prentice Hall, 1994.[HBZ09] Bo Huang, Mark Bates, and Xiaowei
Zhuang. Super-resolution fluorescence microscopy. Annual Review
of
Biochemistry, 78:993–1016, 2009.[HBZ10]