-
NE35CH23-Ganguli ARI 14 May 2012 15:29
Compressed Sensing, Sparsity,and Dimensionality inNeuronal
InformationProcessing and Data AnalysisSurya Ganguli1 and Haim
Sompolinsky2,31Department of Applied Physics, Stanford University,
Stanford, California 94305;email: [email protected] and
Lily Safra Center for Brain Sciences, Interdisciplinary Center for
NeuralComputation, Hebrew University, Jerusalem 91904, Israel;
email: [email protected] for Brain Science, Harvard
University, Cambridge, Massachusetts 02138
Annu. Rev. Neurosci. 2012. 35:485–508
First published online as a Review in Advance onApril 5,
2012
The Annual Review of Neuroscience is online
atneuro.annualreviews.org
This article’s doi:10.1146/annurev-neuro-062111–150410
Copyright c© 2012 by Annual Reviews.All rights reserved
0147-006X/12/0721-0485$20.00
Keywordsrandom projections, connectomics, imaging,
memory,communication, learning, generalization
AbstractThe curse of dimensionality poses severe challenges to
both techni-cal and conceptual progress in neuroscience. In
particular, it plaguesour ability to acquire, process, and model
high-dimensional data sets.Moreover, neural systems must cope with
the challenge of processingdata in high dimensions to learn and
operate successfully within a com-plex world. We review recent
mathematical advances that provide waysto combat dimensionality in
specific situations. These advances shedlight on two dual questions
in neuroscience. First, how can we as neu-roscientists rapidly
acquire high-dimensional data from the brain andsubsequently
extract meaningful models from limited amounts of thesedata? And
second, how do brains themselves process information intheir
intrinsically high-dimensional patterns of neural activity as
wellas learn meaningful, generalizable models of the external world
fromlimited experience?
485
Ann
u. R
ev. N
euro
sci.
2012
.35:
485-
508.
Dow
nloa
ded
from
ww
w.a
nnua
lrevi
ews.o
rgby
Ohi
o St
ate
Uni
vers
ity L
ibra
ry o
n 10
/02/
12. F
or p
erso
nal u
se o
nly.
-
NE35CH23-Ganguli ARI 14 May 2012 15:29
ContentsINTRODUCTION. . . . . . . . . . . . . . . . .
486ADVANCES IN THE THEORY OF
HIGH-DIMENSIONALSTATISTICS. . . . . . . . . . . . . . . . . . .
. 488The Compressed Sensing
Framework: Incoherence andRandomness . . . . . . . . . . . . . .
. . . . 488
L1 Minimization: A NonlinearRecovery Algorithm . . . . . . . . .
. . 490
Dimensionality Reduction byRandom Projections . . . . . . . . .
. . 491
Compressed Computation . . . . . . . . 492Approximate Sparsity
and Noise . . 493Sparse Models of
High-Dimensional Data . . . . . . . 494Dictionary Learning . . .
. . . . . . . . . . . 495
COMPRESSED SENSING OFTHE BRAIN . . . . . . . . . . . . . . . . .
. . . 495Rapid Functional Imaging . . . . . . . . 495Fluorescence
Microscopy . . . . . . . . . 496Gene-Expression Analysis . . . . .
. . . . 496Compressed Connectomics . . . . . . . 497
COMPRESSED SENSINGBY THE BRAIN . . . . . . . . . . . . . . . . .
497Semantic Similarity and Random
Projections . . . . . . . . . . . . . . . . . . . . 498
Short-Term Memoryin Neuronal Networks . . . . . . . . . 498
SPARSE EXPANDED NEURONALREPRESENTATIONS . . . . . . . . . .
499Neuronal Implementations of L1
Minimization . . . . . . . . . . . . . . . . . 500Compression
and Expansion in
Long-Range BrainCommunication . . . . . . . . . . . . . . .
501
LEARNING INHIGH-DIMENSIONALSYNAPTIC WEIGHTSPACES . . . . . . . .
. . . . . . . . . . . . . . . . . 501Neural Learning of
Classification . . . . . . . . . . . . . . . . . . 502Optimality
and Sparsity
of Synaptic Weights . . . . . . . . . . . 502DISCUSSION . . . .
. . . . . . . . . . . . . . . . . . 503
Dimensionality Reduction: CSversus Efficient Coding . . . . . .
. . 503
Expansion and Sparsification:Compressed Sensing
versusIndependent ComponentsAnalysis . . . . . . . . . . . . . . .
. . . . . . . . 503
Beyond Linear Projections:Neuronal Nonlinearities . . . . . . .
504
INTRODUCTIONFor most of its history, neuroscience has
madewonderful progress by considering problemswhose descriptions
require only a small numberof variables. For example, Hodgkin &
Huxley(1952) discovered the mechanism of the nerveimpulse by
studying the relationship betweentwo variables: the voltage and the
current acrossthe cell membrane. But as we have started toexplore
more complex problems, such as thebrain’s ability to process images
and sounds,neuroscientists have had to analyze many vari-ables at
once. For example, any given gray-scaleimage requires N analog
variables, or pixel in-tensities, for its description, where N
could beon the order of 1 million. Similarly, such images
could be represented in the firing-rate patternsof many neurons,
with each neuron’s firing ratebeing a single analog variable. The
number ofvariables required to describe a space of objectsis known
as the dimensionality of that space;i.e., the dimensionality of the
space of all possi-ble images of a given size equals the number
ofpixels, whereas the dimensionality of the spaceof all possible
neuronal firing-rate patterns ina given brain area equals the
number of neu-rons in that area. Thus our quest to understandhow
networks of neurons store and process in-formation depends
crucially on our ability tomeasure and understand the relationships
be-tween high-dimensional spaces of stimuli andneuronal activity
patterns.
486 Ganguli · Sompolinsky
Ann
u. R
ev. N
euro
sci.
2012
.35:
485-
508.
Dow
nloa
ded
from
ww
w.a
nnua
lrevi
ews.o
rgby
Ohi
o St
ate
Uni
vers
ity L
ibra
ry o
n 10
/02/
12. F
or p
erso
nal u
se o
nly.
-
NE35CH23-Ganguli ARI 14 May 2012 15:29
However, the problem of measuring andfinding statistical
relationships between pat-terns becomes more difficult as their
dimen-sionality increases. This phenomenon is knownas the curse of
dimensionality. One approach toaddressing this problem is to
somehow reducethe number of variables required to describethe
patterns in question, a process known as di-mensionality reduction.
We can do this, for ex-ample, with natural images, which are a
highlyrestricted subset of all possible images, so thatthey can be
described by many fewer variablesthan the number of pixels. In
particular, natu-ral images are often sparse in the sense that
ifyou view them in the wavelet domain (roughlyas a superposition of
edges), only a very smallnumber of K wavelet coefficients will have
sig-nificant power, where K can be on the orderof 20,000 for a
1-million-pixel image. This ob-servation underlies JPEG
compression, whichcomputes all possible wavelet coefficients
andkeeps only the K largest (Taubman et al. 2002).Similarly,
neuronal activity patterns that actu-ally occur are often a highly
restricted subset ofall possible patterns (Ganguli et al. 2008a,
Yuet al. 2009, Machens et al. 2010) in the sensethat they often lie
along a low K-dimensionalmanifold embedded in N-dimensional
firing-rate space; by this we mean that only K numbersare required
to uniquely specify any observedactivity pattern across N neurons,
where K canbe much smaller than N. As a concrete exam-ple, consider
the set of visual activity patternsin N neurons in response to a
bar presentedat a variety of orientations. As the
orientationvaries, the elicited firing-rate responses traceout a
circle, or a one-dimensional manifold inN-dimensional space.
More generally, given a class of apparentlyhigh-dimensional
stimuli, or neuronal activ-ity patterns, how can either we or
neural sys-tems extract a small number of variables to de-scribe
these patterns without losing too muchimportant information?
Machine learning pro-vides a variety of algorithms to perform
thisdimensionality reduction, but they are oftencomputationally
expensive in terms of running
time. Moreover, how neuronal circuits couldimplement many of
these algorithms is notclear. However, recent advances in an
emerg-ing field of high-dimensional statistics (Donoho2000,
Baraniuk 2011) have revealed a sur-prisingly simple yet powerful
method of per-forming dimensionality reduction: One canrandomly
project patterns into a lower-dimensional space. To understand the
centralconcept of a random projection (RP), it is usefulto think of
the shadow of a wire-frame object inthree-dimensional space
projected onto a two-dimensional screen by shining a light beam
onthe object. For poorly chosen angles of light, theshadow may lose
important information aboutthe wire-frame object. For example, if
the axis oflight is aligned with any segment of wire, thatentire
length of wire will have a single pointas its shadow. However, if
the axis of light ischosen randomly, it is highly unlikely that
thesame degenerate situation will occur; instead,every length of
wire will have a correspondingnonzero length of shadow. Thus the
shadow,obtained by this RP, generically retains muchinformation
about the wire-frame object.
In the context of image acquisition, an RPof an image down to an
M-dimensional spacecan be obtained by taking M measurements ofthe
image, where each measurement consists ofa weighted sum of all the
pixel intensities, andallowing the weights themselves to be
chosenrandomly (for example, drawn independentlyfrom a Gaussian
distribution). Thus the orig-inal image (i.e., the wire-frame
structure) isdescribed by M measurements (i.e., its shadow)by
projecting against a random set of weights(i.e., a random light
angle). Now, the fieldof compressed sensing (CS) (Candes et
al.2006, Candes & Tao 2006, Donoho 2006; seeBaraniuk 2007,
Candes & Wakin 2008,Bruckstein et al. 2009 for reviews) shows
thatthe shadow can contain enough information toreconstruct the
original image (i.e., all N pixelvalues) as long as the original
image is sparseenough. In particular, if the space of the imagesin
question can be described by K variables,then as long as M is
slightly larger than K, CS
www.annualreviews.org • Sparsity and Dimensionality 487
Ann
u. R
ev. N
euro
sci.
2012
.35:
485-
508.
Dow
nloa
ded
from
ww
w.a
nnua
lrevi
ews.o
rgby
Ohi
o St
ate
Uni
vers
ity L
ibra
ry o
n 10
/02/
12. F
or p
erso
nal u
se o
nly.
-
NE35CH23-Ganguli ARI 14 May 2012 15:29
provides an algorithm (called L1 minimization,described below)
to reconstruct the image.Thus for typical images, we can
simultaneouslysense and compress 1-million-pixel images with∼20,000
random measurements. As we reviewbelow, these CS results have
significant impli-cations for data acquisition in neuroscience.
Furthermore, in the context of neuronal in-formation processing,
an RP of neuronal activ-ity in an upstream brain region consisting
of Nneurons can be achieved by synaptic mappingto a downstream
region consisting of M < Nneurons, where the downstream neurons’
firingrates are obtained by linearly summing the fir-ing rates of
the upstream neurons through a setof random synaptic weights. Thus
the down-stream activity constitutes a shadow of the up-stream
activity through an RP determined bythe synaptic weights (i.e.,
angle of light). Aswe review below, the theory of CS and RPscan
provide a theoretical framework for under-standing one of the most
salient aspects of neu-ronal information processing: radical
changesin the dimensionality, and sometimes sparsity,of neuronal
representations, often within a sin-gle stage of synaptic
transformation.
Finally, another application of CS is theproblem of modeling
high-dimensional data.This is challenging because such models
havehigh-dimensional parameter spaces, necessi-tating many example
data points to learnthe correct parameter values. Neural sys-tems
face a similar challenge in searchinghigh-dimensional synaptic
weight spaces tolearn generalizable rules from limited experi-ence.
We review how regularization techniques(Tibshirani 1996, Efron et
al. 2004) closely re-lated to CS allow statisticians and neural
sys-tems alike to rapidly learn sparse models ofhigh-dimensional
data from limited examples.
ADVANCES IN THE THEORYOF HIGH-DIMENSIONALSTATISTICSBefore we
describe the applicability of CS andRPs to the acquisition and
analysis of dataand to neuronal information processing and
learning, we first give in this section a moreprecise overview
of recent results in high-dimensional statistics. We begin by
giving anoverview of the CS framework and define themathematical
notation we use throughout thisreview. Subsequently, a reader who
is inter-ested mainly in applications can skip the restof this
section. Here, we discuss how to re-cover the sparse signals from
small numbers ofmeasurements, even in the presence of approx-imate
sparsity and noise, and we discuss RPsand sparse regression in more
detail. Finally, wediscuss dictionary learning, an approach to
findbases in which ensembles of signals are sparse.
The Compressed Sensing Framework:Incoherence and RandomnessWe
now formalize the intuitions given in theintroduction and describe
the mathematical no-tation that we use throughout this review
(seealso Figure 1). We let u0 be an N-dimensionalsignal that we
wish to measure. Thus u0 is avector with components u0i for i = 1,
. . . , N ,where each u0i can take an analog value. In theexample
of an image, u0i would be the gray-scaleintensity of the ith pixel.
The M linear mea-surements of u0 are of the form xµ = bµ · u0for µ
= 1, . . . , M . Here we think of xµ as ananalog outcome of
measurement µ obtained bycomputing the overlap or dot product
betweenthe unknown signal u0 and a measurement vec-tor bµ. We can
summarize the relationship be-tween the signal and the measurements
via thematrix relationship x = Bu0. Here B is anM × N measurement
matrix, whose µth rowis the vector bµ, and x is a measurement
vectorwhose µ‘th component is xµ. Now the true sig-nal u0 is sparse
in a basis given by the columnsof an N × N matrix C. By this we
mean thatu0 = Cs0, where s0 is a sparse N-dimensionalvector, in the
sense that it has a relatively smallnumber K of nonzero elements,
though we donot know ahead of time which K of the N com-ponents are
nonzero. For example, when u0
is an image in the pixel basis, s0 could be thewavelet
coefficients of that same image, and thecolumns of C would comprise
a complete basis
488 Ganguli · Sompolinsky
Ann
u. R
ev. N
euro
sci.
2012
.35:
485-
508.
Dow
nloa
ded
from
ww
w.a
nnua
lrevi
ews.o
rgby
Ohi
o St
ate
Uni
vers
ity L
ibra
ry o
n 10
/02/
12. F
or p
erso
nal u
se o
nly.
-
NE35CH23-Ganguli ARI 14 May 2012 15:29
s0
C B L1 C
A = BC
u0 x ŝ ŝu
Figure 1Framework of compressed sensing (CS). A high-dimensional
signal u0 is sparse in a basis given by thecolumns of a matrix C so
that u0 = Cs0, where s0 is a sparse coefficient vector. Through a
set ofmeasurements given by the rows of B, u0 is compressed to a
low-dimensional space of measurements x. Ifthe measurements are
incoherent with respect to the sparsity basis, then L1 minimization
can recover a goodestimate ŝ of the sparse coefficients s0 from x,
and then an estimate of u0 can be recovered by expanding inthe
basis C.
of orthonormal wavelets. Finally, the overall re-lationship
between the measurements and thesparse coefficients is given by x =
As0, whereA = BC. We often refer to A also as the mea-surement
matrix.
An important question is, given a sparsitybasis C, what should
we choose as our measure-ment basis B? Consider what might happen
ifwe measured signals in the same basis in whichthey were sparse.
For example, in the case of animage, one could directly measure M
randomlychosen wavelet coefficients of the image inwhich M is just
a little larger than K. The prob-lem, of course, is that for any
given image, it ishighly unlikely that all the K coefficients
withlarge power coincide with the M coefficientswe chose to
measure. So unless the number ofmeasurements M equals the
dimensionality ofthe image, N, we will inevitably miss
importantcoefficients. In the wire-frame shadow exampleabove, this
is the analog of choosing a poorangle of light (i.e., measurement
basis) thataligns with a segment of wire (i.e., sparsitybasis),
which causes information loss.
To circumvent this problem, one of the keyideas of CS is that we
should make our measure-ments as different as possible from the
domainin which the signal is sparse (i.e., shine light atan angle
that does not align with any segmentof wire frame). In particular,
the measurementsshould have many nonzero elements in the
domain in which the image is sparse. This no-tion of difference
is captured by the mathemat-ical definition of incoherence, or a
small valueof the maximal inner product between rows ofB and
columns of C, so that no measurementvector should look like any
sparsity vector. CSprovides mathematical guarantees that one
canachieve perfect recovery with a number of mea-surements M that
is only slightly larger than K,as long as the M measurement vectors
are suf-ficiently incoherent with respect to the sparsitydomain
(Candes & Romberg 2007).
An important observation is that any setof measurement vectors,
which are themselvesrandom, will be incoherent with respect to
anyfixed sparsity domain. For example, the ele-ments of each such
measurement vector can bedrawn independently from a Gaussian
distribu-tion. Intuitively, it is highly unlikely for a ran-dom
vector to look like a sparsity vector (i.e.,just as it is unlikely
for a random light angleto align with a wire segment). One of the
keyresults of CS is that with such random measure-ment vectors,
only
M > O(K log(N /K )) 1.
measurements are needed to guarantee perfectsignal
reconstruction with high probability(Candes & Tao 2005,
Baraniuk et al. 2008,Candes & Plan 2010). Thus random
mea-surements constitute a universal measurement
www.annualreviews.org • Sparsity and Dimensionality 489
Ann
u. R
ev. N
euro
sci.
2012
.35:
485-
508.
Dow
nloa
ded
from
ww
w.a
nnua
lrevi
ews.o
rgby
Ohi
o St
ate
Uni
vers
ity L
ibra
ry o
n 10
/02/
12. F
or p
erso
nal u
se o
nly.
-
NE35CH23-Ganguli ARI 14 May 2012 15:29
strategy in the sense that they will work forsignals that are
sparse in any basis. Indeed,the sparsity basis need not even be
knownyet when the measurements are chosen. Itsknowledge is required
only after measurementsare taken, during the nonlinear
reconstructionprocess. And remarkably, investigators havefurther
shown that no measurement matricesand no reconstruction algorithm
can yieldsparse signal recovery with substantially
fewermeasurements (Candes & Tao 2006, Donoho2006) than that
shown in Equation 1.
L1 Minimization: A NonlinearRecovery AlgorithmGiven only our
measurements x, how can werecover the unknown signal u0? One could
po-tentially do this by inverting the relationshipbetween
measurements and signal by solvingfor an unknown candidate signal u
in the equa-tion x = Bu. This is a set of M equations, onefor each
measurement, with N unknowns, onefor each component of the
candidate signal u. Ifthe number of independent measurements M
isgreater than or equal to the dimensionality N ofthe signal, then
the set of equations x = Bu hasa unique solution u = u0; thus,
solving theseequations will recover the true signal u0. How-ever,
if M < N , the set of equations x = Buno longer has a unique
solution. Indeed thereis generically an N − M dimensional space
ofcandidate signals u that satisfy the measurementconstraints. How
might we find the true signalu0 in this large space of candidate
signals?
If we know nothing further about the truesignal u0, then the
situation is indeed hopeless.However, if u0 = Cs0 where s0 is
sparse, wecan try to exploit this prior knowledge as fol-lows (see
Figure 1). First, the measurementsare linearly related to the
sparse coefficients s0
through the M equations x = As0, where A =BC is an M × N matrix.
Again, when M < N ,there is a large N − M dimensional space
ofsolutions s to the measurement constraint x =As. However, not all
of them will be sparse, aswe expect the true solution s0 to be.
Thus onemight try to construct an estimate ŝ of s0 by
solving the optimization problem
ŝ = arg mins
N∑
i=1
V (s i ) subject to x = As, 2.
where V(s) is any cost function that penalizesnonzero values of
s. A natural choice is V (s ) = 0if s = 0 and V (s ) = 1 otherwise.
With thischoice, Equation 2 says that our estimate ŝ isobtained by
searching, in the space of all can-didate signals s that satisfy
the measurementconstraints x = As, for the one that has thesmallest
number of nonzero elements. This ap-proach, while reasonable given
the prior knowl-edge that the true signal s0 has a small numberof
nonzero coefficients, unfortunately yields acomputationally
intractable combinatorial op-timization problem; to solve it, one
must essen-tially search over all subsets of possible
nonzeroelements in s.
An alternative approach, adopted by CS,is to solve a related and
potentially easierproblem, by choosing V (s ) = |s |. The
quantity∑N
i=1 |s i | is known as L1 norm of s; hence,this method is called
L1 minimization. Theadvantage of this choice is that the L1 normis
a convex function on the space of candidatesignals, which implies
that the optimizationproblem in Equation 2, with V (s ) = |s |,
hasno (nonglobal) local minima, and there areefficient algorithms
for finding the globalminimum using methods of linear program-ming
(Boyd & Vandenberghe 2004), messagepassing (Donoho et al.
2009), and neural circuitdynamics (see below). CS theory shows
thatwith an appropriate choice of A, L1 minimiza-tion exactly
recovers the true signal so thatŝ = s0, with a number of
measurements that isroughly proportional to the number of
nonzeroelements in the source, K, which can be muchsmaller than the
dimensionality N of the signal.
A popular and even simpler reconstruc-tion algorithm is L2
minimization in whichV (s ) = s 2 in Equation 2. This result can
ariseas a consequence of oft-used Gaussian priorson the unknown
signal and leads to an estimatethat is simply linearly related to
the measure-ments through the pseudoinverse relation
490 Ganguli · Sompolinsky
Ann
u. R
ev. N
euro
sci.
2012
.35:
485-
508.
Dow
nloa
ded
from
ww
w.a
nnua
lrevi
ews.o
rgby
Ohi
o St
ate
Uni
vers
ity L
ibra
ry o
n 10
/02/
12. F
or p
erso
nal u
se o
nly.
-
NE35CH23-Ganguli ARI 14 May 2012 15:29
0 100 200−2
−1
0
1
2
a b
c
s0
s0 s0 = ŝ
0 100 200−2
−1
0
1
2
d
ŝ
ŝ
0 100 200−2
−1
0
1
2
e
ŝ
i i
i ii
i
Figure 2Geometry of compressed sensing (CS). (a) A geometric
interpretation of L2 minimization. An unknown N = 2 dimensional
sparsesignal s0 with K = 1 nonzero components is measured using M =
1 linear measurements, yielding a one-dimensional space ofcandidate
signals consistent with the measurement constraints (red line). The
estimate ŝ is the candidate signal with the smallest L2norm and
can be found geometrically by expanding the locus of points with a
fixed and increasing L2 norm (the olive circles) until thelocus
first intersects the allowed space of candidate signals. This
intersection point is the L2 estimate ŝ, which is different from
the truesignal s0. (b) In the identical scenario as in panel a, L1
minimization recovers an estimate by expanding the locus of points
with the sameL1 norm (blue diamonds), and in this case, the
expanding locus first intersects the space of candidate signals at
the true signal s0 so thatperfect recovery ŝ = s0 is achieved. Of
course, a sparse signal could also have been located on the other
coordinate axis, in which caseL1 minimization would have failed to
recover s0 accurately. (c) An unknown sparse signal s0 of dimension
N = 200, withf = K/N = 0.2, i.e., 20% of its elements are nonzero.
(d ) An estimate ŝ (red dots) recovered from M = 120 random
linearmeasurements of s0 (α = N /T = 0.6, or 60% subsampling) by L2
minimization superimposed on the true signal s0. (e) From the
samemeasurements in panel d, L1 minimization yields an estimate ŝ
(red dots) that coincides with the true signal. Note that the
parameters off = 0.2 and α = 0.6 lie just above the phase boundary
for perfect recovery in Figure 3.
ŝ = (AT A)−1AT x. Figure 2 provides heuristicintuition for the
utility of L1 minimization andits superior performance over L2
minimizationin the case of sparse signals.
An interesting observation is that the boundin Equation 1
represents a sufficient conditionon the number of measurements M
for perfectsignal recovery. Alternately, recent work on thetypical
behavior of CS in the limit where M andN are large has revealed
that the performanceof CS is surprisingly insensitive to the
detailsof the measurement matrix A and the unknownsignal s0 and
depends only on the degree ofsubsampling α = M /N and the signal
sparsity
f = K/N . In the α − f plane, there is auniversal, critical
phase boundary αc ( f ) suchthat if α > αc ( f ), then L1
minimization willtypically yield perfect signal
reconstruction,whereas if α < αc ( f ), it will yield a
nonzeroerror (see Figure 3) (Donoho & Tanner2005a,b, Donoho et
al. 2009, Kabashima et al.2009, Ganguli & Sompolinsky
2010b).
Dimensionality Reductionby Random ProjectionsThe above CS
results can be understood usingthe theory of RPs. Geometrically,
the mapping
www.annualreviews.org • Sparsity and Dimensionality 491
Ann
u. R
ev. N
euro
sci.
2012
.35:
485-
508.
Dow
nloa
ded
from
ww
w.a
nnua
lrevi
ews.o
rgby
Ohi
o St
ate
Uni
vers
ity L
ibra
ry o
n 10
/02/
12. F
or p
erso
nal u
se o
nly.
-
NE35CH23-Ganguli ARI 14 May 2012 15:29
f
α
0 0.2 0.4 0.6 0.8 1.00
0.2
0.4
0.6
0.8
1.0
Figure 3Phase transition in compressed sensing (CS) (reproduced
from Ganguli &Sompolinsky 2010b). We use linear programming to
solve Equation 2 50 timesfor each value of α and f in increments of
0.01, with N = 500. The greytransition region shows when the
fraction of times perfect recovery occurs isneither 0 nor 1. The
red curve is the theoretical phase boundary αc ( f ). Asf → 0, this
boundary is of the form αc ( f ) = f log 1/ f .
x = As through a measurement matrix A canbe thought of as a
linear projection from ahigh N-dimensional space of signals down to
alow M-dimensional space of measurements. Inthis geometric picture,
the space of K-sparsesignals consists of a low-dimensional
(non-smooth) manifold, which is the union of allK-dimensional
linear spaces characterized byK nonzero values at specific
locations, as inFigure 4a. Candes & Tao (2005) show thatany
projection that preserves the geometryof all K-sparse vectors
allows one to recon-struct these vectors from the
low-dimensionalprojection efficiently and robustly using
L1minimization. The power of compression byRPs lies in the fact
that they preserve thegeometrical structure of this manifold.
Inparticular, Baraniuk et al. (2008) show that RPsdown to an M =
O(K log(N /K )) dimensionalspace preserve the distance between any
pairof K-sparse signals up to a small distortion.
However, we can move beyond sparsityand consider how well RPs
preserve the ge-ometric structure of other signal or data pat-terns
that lie on more general low-dimensionalmanifolds embedded in a
high-dimensional
space. An extremely simple manifold is a pointcloud consisting
of a finite set of points, asin Figure 4b. Suppose this cloud
consists ofP points sα , for α = 1, . . . , P , embedded inan
N-dimensional space, and we project themdown to the points xα = Asα
in a low M-dimensional space through an appropriatelynormalized RP.
How small can we make Mbefore the point cloud becomes distorted
inthe low-dimensional space so that pairwise dis-tances in the
low-dimensional space are nolonger similar to the corresponding
distancesin the high-dimensional space?
The celebrated Johnson-Lindenstrauss ( JL)lemma ( Johnson &
Lindenstrauss 1984, Indyk& Motwani 1998, Dasgupta & Gupta
2003)provides a striking answer. It states that RPswith M >
O(log P ) will yield, with high prob-ability, only a small
distortion in distance be-tween all pairs of points in the cloud.
Thus thenumber of projected dimensions M needs onlybe logarithmic
in the number of points P in-dependent of the embedding dimension
of thesource data, N.
Finally, we consider data distributed alonga nonlinear
K-dimensional manifold embeddedin N-dimensional space, as in Figure
4c. An ex-ample might be a set of images of a single objectobserved
under different lighting conditions,perspectives, rotations, and
scales. Another ex-ample would be the set of neural
firing-ratevectors in a brain region in response to a con-tinuous
family of stimuli. Baraniuk & Wakin(2009) and Baraniuk et al.
(2010) show thatM > O(K log NC) RPs preserve the geome-try of
the manifold with small distortion. HereC is a number related to
the curvature of themanifold so that highly curved manifolds
re-quire more projections. Overall, these resultsshow that
surprisingly small numbers of RPs,which can be chosen without any
knowledgeof the data distribution, can preserve geometricstructure
in data.
Compressed ComputationAlthough CS emphasizes the
reconstructionof sparse high-dimensional signals from
492 Ganguli · Sompolinsky
Ann
u. R
ev. N
euro
sci.
2012
.35:
485-
508.
Dow
nloa
ded
from
ww
w.a
nnua
lrevi
ews.o
rgby
Ohi
o St
ate
Uni
vers
ity L
ibra
ry o
n 10
/02/
12. F
or p
erso
nal u
se o
nly.
-
NE35CH23-Ganguli ARI 14 May 2012 15:29
a cb
Figure 4Random projections. (a) A manifold of K-sparse signals
(red) in N-dimensional space is randomly projected down to an
M-dimensionalspace (here K = 1, N = 3, M = 2). (b,c) Projection of
a point cloud, and a nonlinear manifold respectively.
low-dimensional projections, many importantproblems in signal
processing and learningcan be accomplished by performing
compu-tations directly in the low-dimensional spacewithout the need
to first reconstruct the high-dimensional signal. For example,
regression(Zhou et al. 2009), signal detection (Duarteet al. 2006),
classification (Blum 2006, Hauptet al. 2006, Davenport et al. 2007,
Duarteet al. 2007), manifold learning (Hegde et al.2007), and
nearest neighbor finding (Indyk &Motwani 1998) can all be
accomplished in alow-dimensional space given a relatively
smallnumber of RPs. Moreover, task performanceis often comparable
to what can be obtainedby performing the task directly in the
originalhigh-dimensional space. The reason for this
remarkable performance is that these com-putations rely on the
distances between datapoints, which are preserved by RPs. Thus
RPsprovide one way to cope with the curse ofdimensionality, and as
we discuss below, thiscan have significant implications for
neuronalinformation processing and data analysis.
Approximate Sparsity and NoiseAbove, we have assumed a
definition of spar-sity in which an N-dimensional signal s0 hasK
< N nonzero elements, with the other el-ements being exactly 0.
In reality, many of thecoefficients of a signal may be small, but
theyare unlikely to be exactly zero. We thus expectsignals not to
be exactly sparse but to be well
www.annualreviews.org • Sparsity and Dimensionality 493
Ann
u. R
ev. N
euro
sci.
2012
.35:
485-
508.
Dow
nloa
ded
from
ww
w.a
nnua
lrevi
ews.o
rgby
Ohi
o St
ate
Uni
vers
ity L
ibra
ry o
n 10
/02/
12. F
or p
erso
nal u
se o
nly.
-
NE35CH23-Ganguli ARI 14 May 2012 15:29
approximated by a K-sparse vector s0K , which isobtained by
keeping the K largest coefficientsof s0 and setting the rest of
them to 0. In addi-tion, we have to allow for measurement noiseso
that x = As0 + z, where z is a noise vectorwhose µ’th component is
zero mean Gaussiannoise with a fixed variance.
In the presence of noise, it no longer makessense to enforce
perfectly the measurementconstraints x = As. Instead, a common
ap-proach, known as the LASSO method, is tosolve the alternate
optimization problem
ŝ = arg mins
{
‖x − As‖2 + λT∑
i=1
V (s i )
}
, 3.
where V (s ) = |s | (the absolute value function)and λ is a
parameter to be optimized. The costfunction minimized here allows
deviations be-tween As, which are the noise-free measure-ment
outcomes generated by a candidate sig-nal s, and the actual noisy
measurements x.However, such deviations are penalized by
thequadratic term in Equation 3.
Several works (see e.g., Candes et al. 2006,Wainwright 2009,
Bayati et al. 2010, Candes &Plan 2010) have addressed the
performance ofthe LASSO in the combined situation of noiseand
departures from perfect sparsity. The mainoutcome is roughly that
for an appropriatechoice of λ, which depends on the signal-to-noise
ratio (SNR), the same conditions thatguaranteed exact recovery of
K-sparse signalsby L1 minimization in the absence of noise
alsoensure good performance of the LASSO forapproximately sparse
signals in the presenceof noise. In particular, whenever s0K is a
goodapproximation to s0, the LASSO estimate ŝ inEquation 3 is a
good approximation to s0, upto a level of precision that is allowed
by thenoise.
Sparse Models ofHigh-Dimensional DataL1-based minimization can
also be applied tothe modeling of high-dimensional data. A sim-ple
example is sparse linear regression. Suppose
that our data set consists of M N-dimensionalvectors, aµ, along
with M scalar response vari-ables xµ. The regression model assumes
that oneach observation, µ, xµ = aµ · s0 + zµ, where s0is an
N-dimensional vector of unknown regres-sion coefficients and zµ is
Gaussian measure-ment noise. This can be summarized in the ma-trix
equation x = As0 + z, where the M rows ofthe M ×N matrix A are the
N-dimensional datapoints, aµ. Now if the number of data points Mis
fewer than the dimensionality of the data N, itwould seem hopeless
to infer the regression co-efficients. However, in many
high-dimensionalregression problems, we expect that the regres-sion
coefficients will be sparse. For example,aµ could be a vector of
expression levels ofN = O(1000) genes measured in a microarrayunder
experimental condition µ, and xµ couldbe the response of a
biological signal of interest.However, only a small fraction of
genes are ex-pected to regulate any given signal of interest,and
hence we expect the regression coefficientss0 to be sparse.
This scenario is exactly equivalent to the caseof CS with noise.
Here the regression coeffi-cients s0 play the role of an unknown
sparsesignal to be recovered, the input data pointsaµ play the role
of the measurement vectors,and the scalar output or response xµ
plays therole of the measurement outcome in CS. Thesame LASSO
algorithm described in Equation3 can be used to infer the
regression coefficients(Tibshirani 1996). Here, the parameter λ is
notset by the SNR but rather is chosen to minimizesome measure of
the prediction error on a newinput. This estimate can be obtained
throughcross validation, for example. Efron et al. (2004)have
proposed efficient algorithms to computeŝ, optimizing over λ for a
given data set (A, x).
The technique of L1 regularization general-izes beyond linear
regression to the problem oflearning large statistical models with
expectedsparse parameter sets. Indeed it has been usedsuccessfully
in learning logistic regression (Leeet al. 2006b) and in various
graphical models(Lee et al. 2006a, Wainwright et al. 2007), aswell
as in point process models of neuronalspike trains (Kelly et al.
2010).
494 Ganguli · Sompolinsky
Ann
u. R
ev. N
euro
sci.
2012
.35:
485-
508.
Dow
nloa
ded
from
ww
w.a
nnua
lrevi
ews.o
rgby
Ohi
o St
ate
Uni
vers
ity L
ibra
ry o
n 10
/02/
12. F
or p
erso
nal u
se o
nly.
-
NE35CH23-Ganguli ARI 14 May 2012 15:29
Dictionary LearningAs Equations 2 and 3 imply, to reconstruct
asignal from a small number of random mea-surements using L1
minimization, we need toknow A = BC, which means that we need
toknow the basis C in which the signal is sparse.What if we have to
work with a new ensembleof signals and we do not yet know of a
basis inwhich these signals are sparse?
One approach is to perform dictionarylearning (Olshausen et al.
1996; Olshausen &Field 1996a,b, 1997) on the ensemble of
sig-nals. Suppose {xα} for α = 1, . . . , P is a col-lection of P
M-dimensional signals. We imag-ine that each signal is well
approximated by asparse linear combination of the columns of
anunknown M ×N matrix A, i.e., xα ≈ Asα for allα = 1, . . . , P ,
where sα is an unknown sparse N-dimensional vector. We refer to the
columns ofA as the dictionary elements. Thus, the
nonzerocoefficients of sα indicate which dictionary el-ements
linearly combine to form the signal xα .Here N can be larger than
M, in which case weare looking for an overcomplete basis, or
dic-tionary, to represent the ensemble of signals.Given our
training signals xα , we wish to findthe sparse codes sα and
dictionary A. These canpotentially be found by minimizing the
follow-ing energy function:
E(s1, . . . , sP , A) =P∑
α=1
(‖xα − Asα‖2 + λ‖sα‖1),
4.where ||sα||1 denotes the L1 norm of sα . Foreach α, this
second term enforces the sparsityof the code, whereas the first
quadratic costterm enforces the fidelity of the code and
thedictionary. Subsequent work (Kreutz-Delgadoet al. 2003; Aharon
et al. 2006a,b) has extendedthis basic formalism as well as derived
efficientalgorithms for solving Equation 4. Moreover,Aharon et al.
(2006b), Isely et al. (2010), andHillar & Sommer (2011) have
recently shownthat if the signals xα are indeed generated bysparse
noiseless codes through a dictionary A,under certain conditions
related to CS, dictio-nary learning will recover A, up to
permutationsand scalings of its columns.
COMPRESSED SENSINGOF THE BRAIN
Rapid Functional Imaging
In many ways, magnetic resonance imaging(MRI) is a well-suited
application for CS(Lustig et al. 2008). In MRI, a strong
staticmagnetic field with a linear spatial gradient,#H, causes
magnetic dipoles in a tissue sampleto align with the magnetic
field. A radio fre-quency excitation pulse then generates a
trans-verse complex magnetic moment at locationr, with amplitude
m(r) and a phase φ(r) pro-portional to r · #H. Depending on the
sam-ple preparation, the amplitudes m(r) correlatewith various
local properties of interest. Forexample, in functional MRI, it
correlates withthe concentration of oxygenated hemoglobin,which in
turn increases in response to neural ac-tivity. Thus, the
measurement goal is to extractthe spatial profile of m(r). A
detector coil mea-sures the spatial integral of the complex
magne-tization. Hence, it essentially measures a spatialFourier
transform of the profile with a Fourierwave vector k = (kx, ky, kz)
∝ #H.
The traditional approach to MR imaging hasbeen to sample the
image densely through aregular lattice in Fourier wave vector
space, ork-space, by generating a sequence of static lin-ear
gradient fields and radio frequency pulses.If the Fourier space is
sampled at the Nyquist-Shannon rate, then one can perform a lin-ear
reconstruction of the image m(r) simplyby performing an inverse
Fourier transformof the measurements. However, acquiring
eachFourier sample can take time, so any method toreduce the number
of such samples can dramat-ically reduce patient time in scanners,
as wellas increase the temporal resolution of dynamicimaging.
CS provides an interesting approach toreducing the number of
measurements. In theCS framework, the measurement basis B inFigure
1 consists of Fourier modes. CS willwork well if the MRI image is
sparse in a basis Cthat is incoherent with respect to B. For
exam-ple, many MRI images, such as angiograms, aresparse in the
position, or pixel basis. For such
www.annualreviews.org • Sparsity and Dimensionality 495
Ann
u. R
ev. N
euro
sci.
2012
.35:
485-
508.
Dow
nloa
ded
from
ww
w.a
nnua
lrevi
ews.o
rgby
Ohi
o St
ate
Uni
vers
ity L
ibra
ry o
n 10
/02/
12. F
or p
erso
nal u
se o
nly.
-
NE35CH23-Ganguli ARI 14 May 2012 15:29
images, one can subsample random trajectoriesin k-space and use
nonlinear L1 reconstructionto recover the image. For appropriately
chosenrandom trajectories, one can obtain high-quality images using
a tenth of the numberof measurements required in the
traditionalapproach (Lustig et al. 2008). Similarly, brainimages
are often sparse in a wavelet basis, andfor such images, random
trajectories in k-spacecan be found that speed up the rate at which
im-ages can be acquired by a factor of 2.4 comparedwith the
traditional approach (Lustig et al.2007). Moreover, dynamic movies
of oscilla-tory phenomena that are sparse in the temporalfrequency
domain can be obtained at high tem-poral resolution by sampling
randomly both ink-space and in time (Parrish & Hu 1995).
Fluorescence MicroscopySimultaneously imaging the dynamics of
mul-tiple molecular species at both high spatial andtemporal
resolution is a central goal of cellu-lar microscopy. CS-inspired
technologies suchas single-pixel cameras (Takhar et al. 2006,Duarte
et al. 2008) combined with fluores-cence microscopy techniques
(Wilt et al. 2009,Taraska & Zagotta 2010) provide one
promis-ing route toward such a goal (Coskun et al.2010; E. Candes,
personal communication).In fluorescence imaging, multiple
molecularspecies can be tagged with markers capable ofemitting
light at different frequencies. Imag-ing the molecules then
requires two key steps:First, the sample must be illuminated with
light,causing the tagged species to fluoresce, and sec-ond, the
emitted photons from the fluorescentspecies must be detected.
Traditionally, twomain methods have been used to accomplishboth
steps. In widefield (WF) microscopy, theentire image is illuminated
at once, and a largearray of detectors records the emitted
photons.In raster scan (RS) microscopy, each point ofthe image is
illuminated in sequence, so onlyone detector is required to collect
the emittedphotons at any given time.
WF can achieve high temporal resolutionbut requires many
photodetectors for high
spatial resolution. This is problematic for imag-ing
applications in which photons at manydifferent frequencies,
corresponding to differ-ent molecules, need to be simultaneously
mea-sured. This requires a prohibitively expensivehigh-density
array of photodetectors that canperform hyperspectral imaging,
i.e., measuremany spectral channels at once. One could em-ploy a
single such detector in RS mode, but thenachieving high spatial
resolution comes at thecost of low temporal resolution because of
therequired number of raster scans.
The single-pixel-camera approach exploitsthe potential spatial
sparsity of a fluorescenceimage to achieve both high spatial and
tempo-ral resolution. In this approach, the image isilluminated
using a sequence of random lightpatterns. This can be achieved by a
digital mi-cromirror device (DMD), which consists of aspatial array
of micrometer scale mirrors whoseangles can be rapidly and
individually adjusted.Light is reflected off this array into the
sample,and on each trial, a different configuration ofmirrors leads
to a different pattern of illumina-tion. A single hyperspectral
photodetector (thesingle pixel) then measures the total
emittedfluorescence. Owing to the randomness of thelight patterns,
the image can be reconstructed atthe micrometer spatial resolution
of the DMDusing a number of measurements that is muchsmaller than
the number of pixels (or resolvablespatial locations) in the image.
Thus compres-sive imaging retains the relative speed and
reso-lution of WF and the simplicity and achievablespectral range
of RS. As such, this rapidly evolv-ing method has the potential to
open up newexperimental windows into the dynamics of in-tracellular
molecular cascades within neurons.
Gene-Expression AnalysisThe use of microarrays to collect
large-scaledata sets of gene-expression levels across manybrain
regions is now a well-established enter-prise in neuroscience.
Suppose we want to mea-sure a vector s0 of concentrations of N
geneticsequences in a sample. A microarray consistsof N spots,
indexed by i = 1, . . . , N , where
496 Ganguli · Sompolinsky
Ann
u. R
ev. N
euro
sci.
2012
.35:
485-
508.
Dow
nloa
ded
from
ww
w.a
nnua
lrevi
ews.o
rgby
Ohi
o St
ate
Uni
vers
ity L
ibra
ry o
n 10
/02/
12. F
or p
erso
nal u
se o
nly.
-
NE35CH23-Ganguli ARI 14 May 2012 15:29
each spot i contains a unique complementarysequence that will
specifically bind with thesequence i in the sample. All N genetic
se-quences of interest in the sample are fluores-cently tagged and
exposed to all the spots. Eachspot binds a specific sequence, and
after the ex-cess unbound DNA is washed off, the vector
ofconcentrations s0 can be read off by imaging thefluorescence
levels of the spots.
Often this procedure is highly inefficient be-cause any
particular sample will contain only afew genetic sequences of
interest, i.e., the con-centration vector s0 is sparse. Dai et al.
(2009)proposed a CS-based approach in which onecan use M < N
spots, where each spot con-tains a random subset of the N sequences
ofinterest. Thus each spot, now indexed by µ =1, . . . , M , is
characterized by an N-dimensionalmeasurement vector aµ, where the
componentaµi reflects the binding affinity of sequence i inthe
sample to the contents of spot µ. After theCS microarray is exposed
to the sample, theM-dimensional vector of fluorescence levels xis
approximately related to the sample concen-tration s0 through the
linear relation x = As0,where the rows of A are the measurement
vec-tors aµ. Thus if each spot contains enough ran-domly chosen
complementary sequences, suchthat the measurements are incoherent
with re-gard to the basis of sequences, one can use theLASSO method
in Equation 3 to recover theconcentrations s0 from the fluorescence
mea-surements x. Dai et al. (2009) do a thoroughanalysis of this
basic framework. Overall, re-ducing the number of spots required to
collectgene expression data reduces both the cost andthe size of
the array, as well as the amount ofbiological sample material
required to make ac-curate concentration measurements.
Compressed ConnectomicsThe problem of reconstructing functional
cir-cuit connectivity from recordings of neuronalpostsynaptic
responses presents a considerablechallenge to neuroscience.
Consider, for ex-ample, a simple scenario in which we have
apopulation of N neurons that are potentially
presynaptic to a given neuron whose membranevoltage x we can
record intracellularly. Thesynaptic strengths from the N neurons to
therecorded neuron is an unknown N-dimensionalvector s0. The
traditional approach to estimat-ing this set of synaptic strengths
is to exciteeach potential presynaptic neuron one by oneand record
the resultant postsynaptic mem-brane voltage x. Each such
measurement re-veals the strength of one synapse. This brute-force
approach is highly inefficient because thesynaptic connectivity s0
is often sparse, withonly K < N nonzero elements, where K/N
is∼10%. Thus most measurements would simplyyield 0.
Hu & Chklovskii (2009) propose a CS-basedapproach to
recovering s0 by randomly stimu-lating F neurons out of N on any
given trialµ. This method corresponds to a random mea-surement
matrix A characterized by F nonzeroentries per row. Given that the
true weight vec-tor s0 is sparse, Hu & Chklovskii (2009)
pro-pose to use L1 minimization in Equation 2 torecover s0 from
knowledge of the inputs A andoutputs x. The authors find for a wide
rangeof parameters that F/N = 0.1 minimizes therequired number of
measurements, M, and forthis value of F, M = O(K log N )
measure-ments are required to recover s0. Thus randomstimulation of
10% of the population consti-tutes an effective measurement basis
for CS ofsynaptic connectivity (Hu & Chklovskii
2009).Alternative ideas have been proposed for CS ofconnectivity
using fluorescent synaptic markers(Mishchenko 2011).
COMPRESSED SENSINGBY THE BRAINThe problem of storing,
communicating, andprocessing high-dimensional neural
activitypatterns, or external stimuli, presents a funda-mental
challenge to any neural system. Thischallenge is complicated by the
widespreadexistence of convergent pathways, or bottle-necks, in
which information stored in a largenumber of neurons is often
compressed intoa small number of axons, or neurons in a
www.annualreviews.org • Sparsity and Dimensionality 497
Ann
u. R
ev. N
euro
sci.
2012
.35:
485-
508.
Dow
nloa
ded
from
ww
w.a
nnua
lrevi
ews.o
rgby
Ohi
o St
ate
Uni
vers
ity L
ibra
ry o
n 10
/02/
12. F
or p
erso
nal u
se o
nly.
-
NE35CH23-Ganguli ARI 14 May 2012 15:29
downstream system. For example, 1 million op-tic nerve fibers
carry information about theactivity of 100 times as many
photoreceptors.Only 1 million pyramidal tract fibers carry
in-formation from motor cortex to the spinal cord.And corticobasal
ganglia pathways undergo a10–1,000-fold convergence. In this
section wereview how the theory of CS and RPs yieldstheoretical
insight into how efficient storage,communication, and computation
are possibledespite drastic reductions in the dimensionalityof
neural representations through informationbottlenecks.
Semantic Similarityand Random ProjectionsHow much can a neural
system reduce thedimensionality of its activity patterns
withoutincurring a large loss in its ability to performrelevant
computations? A plausible minimal re-quirement is that any
reduction through a con-vergent pathway should preserve the
similaritystructure of the neuronal representations atthe source
area. This requirement is motivatedby the observation that in
higher perceptualor association areas in the brain
semanticallysimilar objects elicit similar neural activitypatterns
(Kiani et al. 2007). This similaritystructure of the neural code is
likely the basis ofour ability to categorize objects and
generalizeappropriate responses to new objects (Rogers
&McClelland 2004). Moreover, this similaritystructure is
remarkably preserved acrossmonkeys and humans, for example, in
imagerepresentations in the inferotemporal (IT)cortex (Kriegeskorte
et al. 2008).
When a semantic task involves a finite num-ber of activity
patterns, or objects, the JL lemmadiscussed above implies that the
required com-munication resources vary only logarithmicallywith the
number of patterns, independent ofhow many neurons are involved in
the sourcearea. For example, suppose 20,000 images canbe
represented by the corresponding popula-tion activity patterns in
the IT cortex. Thenthe similarity structure between all pairs of
im-ages can be preserved to 10% precision in a
downstream area using only ∼1000 neurons.Furthermore, this
result can be achieved with avery simple dimensionality-reduction
scheme,namely by a random synaptic connectivity ma-trix. Moreover,
any computation that relies onsimilarity structure, and can be
solved by the ITcortex, can also be solved by the
downstreamregion.
A more stringent challenge occurs whenconvergent pathways must
preserve the similar-ity structure of not just a finite set of
neuronalactivity patterns, but an arbitrarily large, possi-bly
infinite, number of patterns, as is likely thecase in any pathway
that represents informationabout continuous families of stimuli.
The theo-ries of CS and RPs of manifolds discussed abovereveal that
again drastic compression is possibleif the corresponding neural
patterns are sparseor lie on a low-dimensional manifold (for
exam-ple, as in Figure 4a–c). In this case, the numberof required
neurons in a randomly connecteddownstream area is proportional to
the intrin-sic dimension of the ensemble of neural activitypatterns
and depends only weakly (logarithmi-cally) on the number of neurons
in the sourcearea.
Hidden low-dimensional structure in neu-ral activity patterns
has been found in severalsystems (Ganguli et al. 2008a, Yu et al.
2009,Machens et al. 2010), and moreover, intrinsicspatiotemporal
fluctuations exhibited in manymodels of recurrent neuronal
circuits, includ-ing chaotic networks, are low dimensional(Rajan et
al. 2010, Sussillo & Abbott 2009).The ubiquity of this
low-dimensional structurein neuronal systems may be intimately
relatedto the requirement of communication andcomputation through
widespread anatomicalbottlenecks.
Short-Term Memoryin Neuronal NetworksAnother bottleneck is posed
by the task ofworking memory, where streams of sensory in-puts must
presumably be stored within the dy-namic reverberations of neuronal
circuits. Thisis a bottleneck from time into space: Long
498 Ganguli · Sompolinsky
Ann
u. R
ev. N
euro
sci.
2012
.35:
485-
508.
Dow
nloa
ded
from
ww
w.a
nnua
lrevi
ews.o
rgby
Ohi
o St
ate
Uni
vers
ity L
ibra
ry o
n 10
/02/
12. F
or p
erso
nal u
se o
nly.
-
NE35CH23-Ganguli ARI 14 May 2012 15:29
temporal streams of input must be stored in theinstantaneous
spatial activity patterns of a lim-ited number of neurons. The
influential ideaof attractor dynamics (Hopfield 1982) suggestshow
single stimuli can be stored as stable pat-terns of activity, or
fixed points, but such sim-ple fixed points are incapable of
storing tempo-ral sequences of information, like an
ongoingsentence, song, or motion trajectory. More re-cent proposals
( Jaeger 2001, Maass et al. 2002,Jaeger & Haas 2004) suggest
that recurrent net-works could store temporal sequences of in-puts
in their ongoing, transient activity. Thisnew paradigm raises
several theoretical ques-tions about how long memory traces can
last insuch networks, as functions of the network
size,connectivity, and input statistics. Several stud-ies have
addressed these questions in the case ofsimple linear neuronal
networks and Gaussianinput statistics. These studies show that the
du-ration of memory traces in any network cannotexceed the number
of neurons (in units of theintrinsic time constant) ( Jaeger 2001,
Whiteet al. 2004) and that no network can outper-form an equivalent
delay line or a nonnormalnetwork, characterized by a hidden
feedforwardstructure (Ganguli et al. 2008b).
However, a more ethologically relevanttemporal input statistic
is that of a sparse, non-Gaussian sequence. Indeed a wide variety
oftemporal signals of interest are sparse in somebasis, for
example, human speech in a waveletbasis. Recent work (Ganguli &
Sompolinsky2010a) has derived a connection betweenCS and short-term
memory by showing thatrecurrent neuronal networks can
essentiallyperform online, dynamical compressed sensingof an
incoming sparse sequence, yieldingsequence memory traces that are
longer thanthe number of neurons, again in units of theintrinsic
time constant. In particular, neuronalcircuits with M neurons can
remember sparsesequences, which have a probability f of
beingnonzero at any given time for an amountof time that is O( Mf
log(1/ f ) ). This enhancedcapacity cannot be attained by purely
feedfor-ward networks, or random Gaussian networkconnectivities,
but requires antisymmetric
connectivity matrices that generate complextransient activity
patterns and diverse temporalfiltering properties.
SPARSE EXPANDED NEURONALREPRESENTATIONSIn the previous section,
we have discussedhow CS and RPs can explain how convergentpathways
can compress neuronal representa-tions. However, in many
computations, neu-ral systems may need to expand these
low-dimensional compressed representations backinto
high-dimensional sparse ones. For exam-ple, such representations
reduce the overlapbetween activity patterns, thereby simplifyingthe
tasks of learning, discrimination, catego-rization, noise
filtering, and multiscale stimulusrepresentation. Indeed, like
convergence, theexpansion of neural representations
throughdivergent pathways is a widespread anatomi-cal motif. For
example, information in 1 mil-lion optic nerve fibers is expanded
into morethan 100 million primary visual cortical neu-rons. Also in
the cerebellum, a small number ofmossy fibers target a large number
of granulecells, creating a 100-fold expansion.
How do neural circuits transform com-pressed dense codes into
expanded sparse ones?A simple mechanism would be to project
thedense activity patterns into a larger pool of neu-rons via
random divergent projections and usehigh spiking thresholds to
ensure sparsity of thetarget activity patterns. Indeed, Marr
(1969)suggested this mechanism in his influential hy-pothesis that
the granule cell layer in the cere-bellar cortex performs sparse
coding of densestimulus representations in incoming mossyfibers to
facilitate learning of sensorimotorassociations at the Purkinje
cell layer. Althoughrandom expansion may work for some
compu-tations, sparse codes are generally most usefulwhen they
represent essential sparse features ofthe compressed signal. In the
next sections, wereview how CS methods for generating
sparseexpanded representations, which faithfullycapture hidden
structures in compressed data,can operate within neural
systems.
www.annualreviews.org • Sparsity and Dimensionality 499
Ann
u. R
ev. N
euro
sci.
2012
.35:
485-
508.
Dow
nloa
ded
from
ww
w.a
nnua
lrevi
ews.o
rgby
Ohi
o St
ate
Uni
vers
ity L
ibra
ry o
n 10
/02/
12. F
or p
erso
nal u
se o
nly.
-
NE35CH23-Ganguli ARI 14 May 2012 15:29
s0 x
s
x
L = ATAL = W
TW
AT W T
λ–λ
A
Source area
x u
Target area
a
b
c
Figure 5Neural L1 minimization and long-range brain
communication. (a) A two-layer circuit for performing L1
minimization and dictionarylearning. (b) Nonlinear transfer
function from inputs to firing rates of neurons in the second layer
in panel a. (c) A scheme for efficientlong-range brain
communication in which sparse activity s0 is compressed to a
low-dimensional dense representation x in a source areaand
efficiently communicated downstream to a target area with a small
number of axons, where it could be re-expanded into a newsparse
representation u through a dictionary learning circuit as in panel
a.
Neuronal Implementationsof L1 MinimizationGiven that solving the
optimization problem inEquation 3 with V (s ) = |s | has proven to
be anefficient method for sparse signal reconstruc-tion, whether
neuronal circuits can perform thiscomputation is a natural
question. Here we de-scribe one plausible two-layer circuit
solution(see Figure 5a) proposed in Rozell et al. (2008),inspired
by gradient descent in s on the costfunction in Equation 3. Suppose
that the lowM-dimensional input x is represented in thefirst layer
by the firing rates of a population ofM neurons such that the µth
input neuron hasa firing rate xµ. Now suppose that the
recon-structed sparse signal is represented by a largerpopulation
of N neurons where si is the firingrate of neuron i. In this
population, we denotethe synaptic potential for each neuron by
vi,which determines the neuron’s firing rate via astatic
nonlinearity F, s i = F (vi ).
The synaptic connectivity from the M inputneurons to the N
second-layer neurons comput-ing the sparse representation s is
given by theN ×M matrix AT such that the ith column of A,ai,
denotes the set of M synaptic weights fromthe input neurons to
neuron i in the second
layer. Finally, assume there is lateral inhibitionbetween any
pair of neurons i and j in the secondlayer, governed by synaptic
weights Lij, whichare related to the feedforward weight vectors
tothe pair of neurons, through Li j = ai ·a j . Thenthe internal
dynamics of the second-layer neu-rons obey the differential
equations
τdvid t
= −vi + ai · x −T∑
j=1
Li j s j , 5.
where x is the activity of the input layer. Rozellet al. (2008)
found that for an appropriate choiceof the static nonlinearity,
this dynamic is sim-ilar to a gradient descent on the cost
functiongiven by Equation 3. In particular, for L1 min-imization,
the static nonlinearity F is simply athreshold linear function with
threshold λ andgain 1 (see Figure 5b).
To obtain a qualitative understanding ofthis circuit, consider
what happens when thesecond-layer activity pattern is initially
inactiveso that s = 0 and an input x occurs in the firstlayer. Then
the internal variable vi (t) of eachsecond-layer neuron i will
charge up with a ratecontrolled by the overlap of the input x
withthe synaptic weight vector, ai, which is closelyrelated to the
receptive field (RF) of neuron
500 Ganguli · Sompolinsky
Ann
u. R
ev. N
euro
sci.
2012
.35:
485-
508.
Dow
nloa
ded
from
ww
w.a
nnua
lrevi
ews.o
rgby
Ohi
o St
ate
Uni
vers
ity L
ibra
ry o
n 10
/02/
12. F
or p
erso
nal u
se o
nly.
-
NE35CH23-Ganguli ARI 14 May 2012 15:29
i. As neuron i’s internal activation crossesthe threshold λ, it
starts to fire and inhibitsneurons with RFs similar to ai. This
sets up acompetitive dynamic in which a small numberof neurons with
RFs similar to the input x cometo represent it, yielding a sparse
representationŝ of the input x, which is the solution toEquation
3. In the case of zero noise, the abovecircuit dynamic needs to be
supplementedwith an appropriate dynamic update of thethreshold λ,
which eventually approaches zeroat the fixed point (Donoho et al.
2009). Finally,we note that several works (Olshausen et al.1996,
Perrinet 2010) have proposed synapticHebbian plasticity and
homeostasis rules thatsupplement Equation 5 and allow the circuitto
solve the full dictionary learning problem,Equation 4, without
prior knowledge of A.
An intriguing feature of the above dynamicis that the inhibitory
recurrent connections aretightly related to the feedforward
excitatorydrive. Koulakov & Rinberg (2011) suggest thatexactly
this computation may be implementedin the rodent olfactory bulb.
They propose thatreciprocal dendrodendritic synaptic
couplingbetween mitral cells and granule cells yieldsan effective
lateral inhibition between granulecells that is related to the
feedforward drivefrom mitral cells to granule cells, in
accordancewith the requirements of Equation 5. Thus thecomposite
olfactory circuit builds up a sparsecode for odors in the granule
cell population.Likewise, Hu et al. (2011) proposed thatsparse
coding is implemented within theamacrine/horizontal cell layers in
the retina.
Compression and Expansion inLong-Range Brain CommunicationA
series of papers (Coulter et al. 2010, Isely et al.2010, Hillar
& Sommer 2011) have integratedthe dual aspects of CS theory:
dimensionalityreduction of sparse neural representations, andthe
recoding of stimuli in sparse overcompleterepresentations into a
theory of efficientlong-range brain communication (see alsoTarifi
et al. 2011). According to this theory(see Figure 5c), each area in
a long-range
communication pathway has both dense andsparse representations.
Local sparse represen-tations are first compressed to
communicatethem using a small number of axons andpotentially
re-expanded in a downstream area.
Where in the brain might these transforma-tions occur? Coulter
et al. (2010) predict thatthis could occur within every cortical
column,with compressive projections, possibly random,occurring
between more superficial corticallayers and the output layer 5. A
key testablephysiological prediction would then be thatactivity in
more superficial layers is sparser thanactivity in deeper output
layers. Another pos-sibility is the transformation from sparse
high-dimensional representations of space in theCA3/CA1 fields of
the hippocampus to denser,lower-dimensional representations of
space inthe subiculum, which constitutes the major out-put
structure of the hippocampus. A functionalexplanation for this
representational dichotomycould be that the hippocampus is
performing anRP from CA3/CA1 to the subiculum, therebyminimizing
the number of axons requiredto communicate the results of
hippocampalcomputations to the rest of the brain.
Overall, these works suggest more generallythat random
compression and sparse coding canbe combined to yield computational
strategiesfor efficient use of the limited bandwidth avail-able for
long-range brain communication.
LEARNING INHIGH-DIMENSIONALSYNAPTIC WEIGHT SPACESLearning new
skills and knowledge is thoughtto be achieved by continuous
synaptic mod-ifications that explore the space of possibleneuronal
circuits, selecting through experiencethose that are well adapted
to the given task.We review how regularization techniquesused by
statisticians to learn high-dimensionalstatistical models from
limited amounts of datacan also be employed by synaptic
learningrules to search efficiently the high-dimensionalspace of
synaptic patterns to learn appropriaterules from limited
experience.
www.annualreviews.org • Sparsity and Dimensionality 501
Ann
u. R
ev. N
euro
sci.
2012
.35:
485-
508.
Dow
nloa
ded
from
ww
w.a
nnua
lrevi
ews.o
rgby
Ohi
o St
ate
Uni
vers
ity L
ibra
ry o
n 10
/02/
12. F
or p
erso
nal u
se o
nly.
-
NE35CH23-Ganguli ARI 14 May 2012 15:29
Neural Learning of ClassificationA simple model of neural
decision makingand classification is a single-layer
feedforwardnetwork in which the postsynaptic potential ofthe
readout neuron is a sum of the activity of itsafferents, weighted
by a set of synaptic weights,and the decision is signaled by firing
or notfiring depending on whether the potentialreaches threshold.
Such a model is equivalentto the classical perceptron (Rosenblatt
1958).Computationally, this model classifies N-dimensional input
patterns into two categoriesseparated by a hyperplane determined by
thesynaptic weights. These weights are learnedthrough
experience-dependent modificationsbased on a set of M training
input examplesand their correct classifications. Of course,the goal
of any organism is not to classify pastexperience correctly, but
rather to generalize tonovel experience. Thus an important
measureof learning performance is the generalizationerror, or the
probability of incorrectly clas-sifying a novel input, and a
central questionof learning theory is how many examples Mare
required to achieve a good generalizationerror given a number of N
unknown synapticweights that need to be learned.
This question has been studied exhaustively(Gardner 1988, Seung
et al. 1992) (see Engel &den Broeck 2001 for an overview), and
the gen-eral consensus finds that for a wide variety oflearning
rules, a small generalization error canoccur only when the number
of examples M islarger than the number of synapses N. This re-sult
has striking implications because it suggeststhat learning may
suffer from a curse of dimen-sionality: Given the large number of
synapsesinvolved in any task, this theory suggests weneed an
equally large number of training exam-ples to learn any task.
Recent work (Lage-Castellanos et al. 2009)has considered the
case when a categorizationtask can be realized by a sparse synaptic
weightvector, meaning that only a subset of inputs aretask
relevant, though which subset is a prioriunknown. The authors
showed that a simplelearning rule that involves minimization of
the
classification error on the training set, plus anL1
regularization on the synaptic weights of theperceptron, yields a
good generalization erroreven when the number of examples can be
lessthan the number of synapses. Thus a sparsityprior is one route
to combat the curse of dimen-sionality in learning tasks that are
realizable bya sparse rule.
Optimality and Sparsityof Synaptic WeightsConsider again the
perceptron learning toclassify a finite set of M input patterns. In
gen-eral, many synaptic weight vectors will classifythese inputs
correctly. We can, however, lookfor the optimal weight vector that
maximizesthe margin, or the minimal distance betweeninput patterns
and the category boundary.For such weights, the induced synaptic
po-tentials are as far as possible from threshold,and the resultant
classifications yield goodgeneralization and noise tolerance
(Vapnik1998).
A remarkable theoretical result is that ifsynapses are
constrained to be either excitatoryor inhibitory, then near
capacity, the optimalsolution is sparse, with most of the
synapsessilent (Brunel et al. 2004), even if the inputpatterns
themselves show no obvious sparsestructure. This result has been
proposed as afunctional explanation for the abundance ofsilent
synapses in the cerebellum and otherbrain areas.
When the sign of the weights are uncon-strained, the optimal
solutions are still sparse,but not in the basis of neurons.
Instead, the op-timal weight vector can be expressed as a
linearcombination of a small number of input pat-terns, known as
support vectors, the number ofsupport vectors being much smaller
than theirdimensionality. Indeed, several powerful learn-ing
algorithms, including support vector ma-chines (SVMs) (see Burges
1998, Vapnik 1998,Smola 2000 for reviews), exploit this form
ofsparsity to achieve good generalization fromrelatively few
high-dimensional examples.
502 Ganguli · Sompolinsky
Ann
u. R
ev. N
euro
sci.
2012
.35:
485-
508.
Dow
nloa
ded
from
ww
w.a
nnua
lrevi
ews.o
rgby
Ohi
o St
ate
Uni
vers
ity L
ibra
ry o
n 10
/02/
12. F
or p
erso
nal u
se o
nly.
-
NE35CH23-Ganguli ARI 14 May 2012 15:29
Finally, because a sufficiently large numberof RPs preserve
Euclidean distances, theyincur only a modest reduction in the
marginof the optimal category boundary separatingclasses (Blum
2006). Hence, classificationproblems can also be learned directly
in alow-dimensional space. In summary, thereis an interesting
interplay among sparsity,dimensionality, and the learnability of
high-dimensional classification problems: Any suchrapidly learnable
problem (i.e., one with a largemargin) is both (a) sparse, in the
sense that itssolution can be expressed in terms of a sparselinear
combination of input patterns, and(b) low-dimensional in the sense
that it can belearned in a compressed space after a RP.
DISCUSSION
Dimensionality Reduction:CS versus Efficient Coding
Efficient coding theories (Barlow 1961, Atick1992, Atick &
Redlich 1992, Barlow 2001)suggest that information bottlenecks in
thebrain perform optimal dimensionality re-duction by maximizing
mutual informationbetween the low-dimensional output and
thehigh-dimensional input (Linsker 1990). Thepredictions of such
information maximizationtheories depend on assumptions about
inputstatistics, neural noise, and metabolic con-straints. In
particular, infomax theories of earlyvision, based on Gaussian
signal and noiseassumptions, predict that
high-dimensionalspatiotemporal patterns of photoreceptoractivation
should be projected onto the linearsubspace of their largest
principal components.Furthermore, the individual projection
vectors,i.e., retinal ganglion cell (RGC) RFs, dependon the
stimulus SNR; in particular, at a highSNR, RFs should decorrelate
or whiten thestimulus. This is consistent with the center-surround
arrangement of RFs, which removesmuch of the low-frequency
correlations innatural images (Atick 1992, Atick & Redlich1992,
Borghuis et al. 2008).
What is the relation between infomaxtheories and CS? According
to CS theory, forsparse inputs, close to optimal
dimensionalityreduction is achieved when the projectionvectors are
maximally incoherent with respectto the basis in which the stimulus
is sparse. As-suming visual stimuli are approximately sparsein a
wavelet or Gabor-like basis, incoherentprojections are likely to be
spatially distributed.If sparseness is a prominent feature of
naturalvisual spatiotemporal signals, how can wereconcile the
observed RGC center-surroundRFs with the demand for incoherence?
Inco-herent or random projections are optimal forsignal ensembles
composed of a combinationof a few feature vectors in which the
identityof these vectors varies across signals. Thismay be an
adequate description of naturalimages after whitening. However,
prewhitenednatural images have strong second-ordercorrelations,
implying that they lie close to alow-dimensional linear space given
by theirprincipal components. Thus, the ensemble ofnatural images
is characterized by both linearlow-dimensional structure and sparse
structureimposed by higher-order statistics. In such en-sembles,
whether sensory stimuli or neuronalactivity patterns, when
second-order correla-tions are strong enough, the optimal
dimen-sionality reduction may indeed be close to thatpredicted by
Gaussian-based infomax, as hasbeen argued in recent work (Weiss et
al. 2007).
Expansion and Sparsification:Compressed Sensing
versusIndependent Components AnalysisWhat does efficient coding
theory predict re-garding the recoding of signals through
ex-pansive transformations, for example, from theoptic nerve to
visual cortex? Several modernefficient coding theories, such as
basis pur-suit, independent components analysis (ICA),maximizing
non-Gaussianity, and others, sug-gest that even after
decorrelation, naturalimages include higher-order statistical
depen-dencies that arise through linear mixing of sta-tistically
independent sources. The role of thecortical representation is to
further reduce the
www.annualreviews.org • Sparsity and Dimensionality 503
Ann
u. R
ev. N
euro
sci.
2012
.35:
485-
508.
Dow
nloa
ded
from
ww
w.a
nnua
lrevi
ews.o
rgby
Ohi
o St
ate
Uni
vers
ity L
ibra
ry o
n 10
/02/
12. F
or p
erso
nal u
se o
nly.
-
NE35CH23-Ganguli ARI 14 May 2012 15:29
redundancy of the signal by separating themixed signal into its
independent causes (i.e., anunmixing operation), essentially
generating afactorial statistical representation of the signal.
The application of ICA to natural im-ages and movies yields at
the output layer,single-neuron response histograms, which
areconsiderably sparser than those in the inputlayer. These
responses have Gabor-like RFssimilar to those of simple cells in V1
(Olshausenet al. 1996, Bell & Sejnowski 1997, van Hateren&
Ruderman 1998, van Hateren & van derSchaaf 1998, Simoncelli
& Olshausen 2001,Hyvarinen 2010). ICA algorithms have alsobeen
applied to natural sounds (Lewicki 2002),yielding a set of temporal
filters, resemblingauditory cortical RFs.
Although the algorithms and results ofICA and source extraction
by CS are oftensimilar, there are important differences. First,CS
results in signals that are truly sparse,i.e., most of the
coefficients are zero, whereasICA algorithms generally yield
signals withmany small values, i.e., distributions with
highkurtosis but no coefficients vanish (Olshausenet al. 1996, Bell
& Sejnowski 1997, Hyvarinen2010). Second, ICA emphasizes the
statisticalindependence of the unmixed sources (Barlow2001).
Sparseness is a special case; ICA can beapplied to reconstruct
dense sources as well.In contrast, signal extraction by CS relies
onlyon the assumed approximate sparseness ofthe signal, and not on
any statistical priors,and is similar in spirit to the seminal work
ofOlshausen et al. (1996). Indeed, a recent studysuggests that
sparseness may be a more usefulnotion than independence and that
the successof ICA in some applications is due to its abilityto
generate sparse representations rather thanto discover
statistically independent features(Daubechies et al. 2009).
Beyond Linear Projections:Neuronal Nonlinearities
The abundance of nonlinearities in neuronalsignaling raises the
question of the relevance ofthe CS linear projections to neuronal
informa-tion processing. One fundamental nonlinearityis the
input-output relation between synapticpotentials and action
potential firing of individ-ual neurons. This nonlinearity is often
approx-imated by the linear-nonlinear (LN) model(Dayan & Abbott
2001, Ostojic & Brunel 2011)in which the firing rate of a
neuron, x, is relatedto its input activity a through x = σ (a ·
s0),where s0 is the neuron’s spatiotemporal linearfilter and σ (·)
is a scalar sigmoidal function. Aslong as σ (·) is an invertible
function of its input,the nonlinearity in the measurement can
beundone to recover the fundamental linear rela-tion between the
synaptic input to the neuronand the source, given by As0; hence,
the resultsof CS should hold. More generally, it will bean
important challenge to evaluate the roleof dimensionality
reduction, expansion, andsparse coding in neuronal circuit models
thatincorporate additional nonlinearities, includingnonlinear
temporal coding of inputs, synapticdepression and facilitation, and
nonlinear feed-back dynamics through recurrent connections.
In summary, we have reviewed a relativelynew set of surprising
mathematical phenomenarelated to RPs of high-dimensional
patterns.But far from being a set of intellectual curiosi-ties,
these phenomena have important practicalimplications for data
acquisition and analysisand important conceptual implications for
neu-ronal information processing. It is likely thatmore surprises
await us, lurking in the proper-ties of high-dimensional spaces and
mappings,properties that could further change the waywe measure,
analyze, and understand the brain.
DISCLOSURE STATEMENTThe authors are not aware of any
affiliations, memberships, funding, or financial holdings thatmight
be perceived as affecting the objectivity of this review.
504 Ganguli · Sompolinsky
Ann
u. R
ev. N
euro
sci.
2012
.35:
485-
508.
Dow
nloa
ded
from
ww
w.a
nnua
lrevi
ews.o
rgby
Ohi
o St
ate
Uni
vers
ity L
ibra
ry o
n 10
/02/
12. F
or p
erso
nal u
se o
nly.
-
NE35CH23-Ganguli ARI 14 May 2012 15:29
ACKNOWLEDGMENTSS.G. and H.S. thank the Swartz Foundation,
Burroughs Wellcome Foundation, Israeli ScienceFoundation, Israeli
Defense Ministry (MAFAT), the McDonnell Foundation, and the
GatsbyCharitable Foundation for support, and we thank Daniel Lee
for useful discussions.
LITERATURE CITED
Aharon M, Elad M, Bruckstein A. 2006a. K-SVD: an algorithm for
designing overcomplete dictionaries forsparse representation. IEEE
Trans. Signal Proc. 54(11):4311
Aharon M, Elad M, Bruckstein A. 2006b. On the uniqueness of
overcomplete dictionaries, and a practical wayto retrieve them.
Linear Algebr. Appl. 416(1):48–67
Atick J. 1992. Could information theory provide an ecological
theory of sensory processing? Netw. Comput.Neural Syst.
3(2):213–51
Atick J, Redlich A. 1992. What does the retina know about
natural scenes? Neural Comput. 4(2):196–210Baraniuk R. 2007.
Compressive sensing. Signal Proc. Mag. IEEE 24(4):118–21Baraniuk R.
2011. More is less: signal processing and the data deluge. Science
331(6018):717–19Baraniuk R, Cevher V, Wakin M. 2010.
Low-dimensional models for dimensionality reduction and signal
recovery: a geometric perspective. Proc. IEEE
98(6):959–71Baraniuk R, Davenport M, DeVore R, Wakin M. 2008. A
simple proof of the restricted isometry property for
random matrices. Constr. Approx. 28(3):253–63Baraniuk R, Wakin
M. 2009. Random projections of smooth manifolds. Found. Comput.
Math. 9(1):51–77Barlow H. 1961. Possible principles underlying the
transformation of sensory messages. In Sensory Communi-
cation, ed. WA Rosenblith, pp. 217–34. New York: WileyBarlow H.
2001. Redundancy reduction revisited. Netw. Comput. Neural Syst.
12(3):241–53Bayati M, Bento J, Montanari A. 2010. The LASSO risk:
asymptotic results and real world examples. Neural
Inf. Process. Syst. (NIPS )Bell A, Sejnowski T. 1997. The
independent components of natural scenes are edge filters. Vis.
Res.
37(23):3327–38Blum A. 2006. Random projection, margins, kernels,
and feature-selection. In Subspace, Latent Structure and
Feature Selection, ed. C Saunders, M Grobelnik, S Gunn, J
Shawe-Taylor, pp. 52–68. Heidelberg, Germ.:Springer
Borghuis B, Ratliff C, Smith R, Sterling P, Balasubramanian V.
2008. Design of a neuronal array. J. Neurosci.28(12):3178–89
Boyd S, Vandenberghe L. 2004. Convex Optimization. New York:
Cambridge Univ PressBruckstein A, Donoho D, Elad M. 2009. From
sparse solutions of systems of equations to sparse modeling of
signals and images. Siam Rev. 51(1):34–81Brunel N, Hakim V,
Isope P, Nadal J, Barbour B. 2004. Optimal information storage and
the distribution of
synaptic weights: perceptron versus Purkinje cell. Neuron
43(5):745–57Burges C. 1998. A tutorial on support vector machines
for pattern recognition. Data Min. Knowl. Discov.
2(2):121–67Candes E, Plan Y. 2010. A probabilistic and RIPless
theory of compressed sensing. IEEE Trans. Inf. Theory
57(11)7235–54Candes E, Romberg J. 2007. Sparsity and incoherence
in compressive sampling. Invers. Probl. 23(3):969–85Candes E,
Romberg J, Tao T. 2006. Stable signal recovery from incomplete and
inaccurate measurements.
Commun. Pure Appl. Math. 59(8):1207–23Candes E, Tao T. 2005.
Decoding by linear programming. IEEE Trans. Inf. Theory
51:4203–15Candes E, Tao T. 2006. Near-optimal signal recovery from
random projections: universal encoding strategies?
IEEE Trans. Inf. Theory 52(12):5406–25Candes E, Wakin M. 2008.
An introduction to compressive sampling. IEEE Sig. Proc. Mag.
25(2):21–30Coskun A, Sencan I, Su T, Ozcan A. 2010. Lensless
wide-field fluorescent imaging on a chip using compressive
decoding of sparse objects. Opt. Express 18(10):10510–23
www.annualreviews.org • Sparsity and Dimensionality 505
Ann
u. R
ev. N
euro
sci.
2012
.35:
485-
508.
Dow
nloa
ded
from
ww
w.a
nnua
lrevi
ews.o
rgby
Ohi
o St
ate
Uni
vers
ity L
ibra
ry o
n 10
/02/
12. F
or p
erso
nal u
se o
nly.
-
NE35CH23-Ganguli ARI 14 May 2012 15:29
Coulter W, Hillar C, Isley G, Sommer F. 2010. Adaptive
compressed sensing—a new class of self-organizingcoding models for
neuroscience. Presented at IEEE Int. Conf. Acoust. Speech Signal
Process. (ICASSP),pp. 5494–97
Dai W, Sheikh M, Milenkovic O, Baraniuk R. 2009. Compressive
sensing DNA microarrays. EURASIP J.Bioinf. Syst. Biol.
2009:162824
Dasgupta S, Gupta A. 2003. An elementary proof of a theorem of
Johnson and Lindenstrauss. Random Struct.Algorithms 22(1):60–65
Daubechies I, Roussos E, Takerkart S, Benharrosh M, Golden C, et
al. 2009. Independent component analysisfor brain fMRI does not
select for independence. Proc. Natl. Acad. Sci.
106(26):10415–20
Davenport M, Duarte M, Wakin M, Laska J, Takhar D, et al. 2007.
The smashed filter for compressiveclassification and target
recognition. Proc. Comput. Imaging V SPIE Electron Imaging, San
Jose, CA
Dayan P, Abbott L. 2001. Theoretical Neuroscience. Computational
and Mathematical Modelling of Neural Systems.Cambridge, MA: MIT
Press
Donoho D. 2000. High-dimensional data analysis: the curses and
blessings of dimensionality. AMS MathChallenges Lecture, pp.
1–32
Donoho D. 2006. Compressed sensing. IEEE Trans. Inf. Theory
52(4):1289–306Donoho D, Maleki A, Montanari A. 2009.
Message-passing algorithms for compressed sensing. Proc. Natl.
Acad. Sci. USA 106(45):18914–19Donoho D, Tanner J. 2005a.
Neighborliness of randomly projected simplices in high dimensions.
Proc. Natl.
Acad. Sci. USA 102:9452–57Donoho D, Tanner J. 2005b. Sparse
nonnegative solution of underdetermined linear equations by
linear
programming. Proc. Natl. Acad. Sci. USA 102:9446–51Duarte M,
Davenport M, Takhar D, Laska J, Sun T, et al. 2008. Single-pixel
imaging via compressive sampling.
Signal Proc. Mag. IEEE 25(2):83–91Duarte M, Davenport M, Wakin
M, Baraniuk R. 2006. Sparse signal detection from incoherent
projections.
Proc. Acoust. Speech Signal Process. (ICASSP) 3:III–IIIDuarte M,
Davenport M, Wakin M, Laska J, Takhar D, et al. 2007. Multiscale
random projections for
compressive classification. Presented at IEEE Int. Conf. Image
Process (ICIP) Int. Conf. 6:VI161–64, SanAntonio, TX
Efron B, Hastie T, Johnstone I, Tibshirani R. 2004. Least angle
regression. Ann. Stat. 32(2):407–99Engel A, den Broeck CV. 2001.
Statistical Mechanics of Learning. London: Cambridge Univ.
PressGanguli S, Bisley J, Roitman J, Shadlen M, Goldberg M, Miller
K. 2008a. One-dimensional dynamics of
attention and decision making in lip. Neuron 58(1):15–25Ganguli
S, Huh D, Sompolinsky H. 2008b. Memory traces in dynamical systems.
Proc. Natl. Acad. Sci. USA
105(48):18970–74Ganguli S, Sompolinsky H. 2010a. Short-term
me