-
KAMA-NNs: low-dimensional rotation based neural networks
Krzysztof Choromanski* Aldo Pacchiano* Jeffrey Pennington*
Yunhao Tang*Google Brain Robotics UC Berkeley Google Brain Columbia
University
Abstract
We present new architectures for feedforwardneural networks
built from products of learnedor random low-dimensional rotations
that of-fer substantial space compression and com-putational
speedups in comparison to the un-structured baselines. Models using
them arealso competitive with the baselines and often,due to
imposed orthogonal structure, outper-form baselines accuracy-wise.
We propose touse our architectures in two settings. We showthat in
the non-adaptive scenario (randomneural networks) they lead to
asymptoticallymore accurate, space-efficient and faster esti-mators
of the so-called PNG-kernels (for anyactivation function defining
the PNG). Thisgeneralizes several recent theoretical resultsabout
orthogonal estimators (e.g. orthogonalJLTs, orthogonal estimators
of angular ker-nels and more). In the adaptive setting wepropose
efficient algorithms for learning prod-ucts of low-dimensional
rotations and showhow our architectures can be used to improvespace
and time complexity of state of theart reinforcement learning (RL)
algorithms(e.g. PPO, TRPO). Here they offer up to7x compression of
the network in comparisonto the unstructured baselines and
outperformreward-wise state of the art structured neuralnetworks
offering similar computational gainsand based on low displacement
rank matrices.
1 Introduction
Structured transforms play an important role inmany machine
learning algorithms. Several re-cently proposed scalable kernel
methods usingrandom feature maps [Rahimi and Recht, 2007]
Proceedings of the 22nd International Conference on Ar-tificial
Intelligence and Statistics (AISTATS) 2019, Naha,Okinawa, Japan.
PMLR: Volume 89. Copyright 2019 bythe author(s). * Equal
contribution.
apply structured matrices to either reduce time& space
complexity of kernels’ estimators or im-prove their accuracy
[Choromanska et al., 2016,Choromanski et al., 2018b, Bojarski et
al., 2017,Choromanski et al., 2017, Choromanski et al., 2018a,Yu et
al., 2016, Choromanski and Sindhwani, 2016,Vybíral, 2011, Zhang and
Cheng, 2013]. Struc-tured matrices are also applied in some ofthe
fastest known cross-polytope LSH algo-rithms [Andoni et al., 2015]
and neural networks[Sindhwani et al., 2015, Choromanski et al.,
2018a,Choromanski et al., 2018c]. In the latter set-ting they were
used in particular to scaleup architectures for mobile speech
recognition[Sindhwani et al., 2015], predictive state
recurrentneural networks [Choromanski et al., 2018a] andmore
recently, to encode policy architectures for RLtasks [Choromanski
et al., 2018c]. Compressed neuralnetworks encoded by structured
matrices enablepractitioners to train RL policies with the use
ofevolutionary strategy algorithms (recently becoming aserious
alternative to state of the art policy gradientmethods [Salimans et
al., 2017, Mania et al., 2018])on a single machine instead of
clusters of thou-sands of machines. Time & space
complexityreduction is obtained by applying structured ma-trices
where matrix-vector multiplication can beconducted in sub-quadratic
time with the use ofFast Fourier Transform (e.g. low displacement
rankmatrices from [Choromanski and Sindhwani, 2016,Sindhwani et
al., 2015, Choromanski et al., 2018c])or Fast Walsh-Hadamard
Transform (e.g. ran-dom Hadamard matrices from [Andoni et al.,
2015,Choromanski et al., 2017]).
However, time & space complexity reduction as well
asaccuracy improvements over unstructured baselines atthe same time
were recently theoretically proven onlyfor random Hadamard matrices
and only for the linearkernel (see: dimensionality reduction
mechanisms in[Choromanski et al., 2017]). Furthermore, it is
knownthat obtained accuracy gains are due to the orthogonal-ity and
similar guarantees cannot be achieved for lowdisplacement rank
matrices. Other orthogonal trans-forms for which accuracy
improvements were proven
-
KAMA-NNs: low-dimensional rotation based neural networks
(angular kernel estimators and asymptotically estima-tors of
certain classes of RBF-kernels) were built fromrandom orthogonal
matrices constructed via Gram-Schmidt orthogonalization [Yu et al.,
2016]. These donot offer any compression of the number of
parametersand computational gains.
We propose here a class of structured neural
networkarchitectures, called, KAMA-NNs, where matrices
ofconnections can be decomposed as products of low-dimensional
learned or random rotations. We showthat the best features of all
the aforementioned struc-tured families are encapsulated in this
class. First, itprovides orthogonality and we show that in the
non-adaptive scenario (random neural networks) it con-sequently
leads to the asymptotically more accurate,space-efficient and
faster estimators of the so-calledPNG (Pointwise Nonlinear
Gaussian) kernels (for anyactivation function defining the PNG) and
RBF-kernels.This generalizes several recent theoretical results
aboutorthogonal estimators (e.g. orthogonal JLTs, orthogo-nal
estimators for angular, gaussian kernels and more).Furthermore, it
achieves time & space complexity gainsover unstructured
baselines, matching or outperform-ing accuracy-wise low
displacement rank matrices. Fi-nally, in the adaptive setting,
where low-dimensionalrotations are learned, it improves state of
the art rein-forcement learning algorithms (e.g. PPO, TRPO). Inthe
RL setting our architectures offer up to 7x com-pression of the
network in comparison to the unstruc-tured baselines and outperform
reward-wise state ofthe art structured neural networks offering
similar com-putational gains and based on low displacement
rankmatrices [Choromanski et al., 2018c]. In the adaptivesetting,
we also give an interesting geometric interpre-tation of our
algorithms. We explain how optimizingover a sequence of low
dimensional rotations is akinto performing coordinate
descent/ascent on the man-ifold of all rotation-matrices. This
sheds light on theeffectiveness of the KAMA-NN mechanism also in
theadaptive setting.
We highlight our main contributions below:
• In Section 2 we formally introduce KAMA-NNarchitectures and
discuss their space and timecomplexity.
• In Section 3 we discuss the capacity of modelsbased on
products of low dimensional rotations inboth: adaptive and
non-adaptive setting and theconnection to random matrix theory.
• In Section 4 we establish the connection betweenrandom neural
networks and PNG kernels andshow that random KAMA-NNs lead to
asymptoti-cally more accurate estimators of these kernels.
• In Section 5 we analyze adaptive mechanism,where low
dimensional rotations defining KAMA-NN architectures are trained
and provide conver-gence results for optimizing certain classes of
black-box functions via KAMA-NNs.
• In Section 6 we given an exhaustive empiricalevaluation of
KAMA-NN architectures. In thenon-adaptive setting (random KAMA-NNs)
weperform an empirical study of the accuracy ofPNG kernels’
estimators with KAMA-NNs. In theadaptive setting we apply KAMA-NNs
to encodeRL policies and compare them to unstructured andother
structured architectures for different policygradient algorithms
and on several RL tasks.
2 KAMA-NN architectures
KAMA-NNs are feedforward neural networks, withmatrices of
connections constructed from products of 2-dimensional learned or
random rotations called Givensrotations. For fixed I, J ∈ {0, 1,
..., d− 1} (I 6= J) andΘ ∈ [0, 2π) we define a Givens rotation
GΘI,J ∈ Rd×das follows:
GΘI,J [i, j] =
1, if i = j and i 6∈ {I, J}0, if i 6= j and {i, j} 6= {I, J}cos
Θ, if i = j and i ∈ {I, J}sin Θ, if i = J, j = I− sin Θ, if i = I,
j = J
,
Givens random rotation is a Givens rotation, where Θ ∼Unif [0,
2π) and I, J are chosen uniformly at random.
Each matrix M ∈ Rd1×d2 of connections of the KAMA-NN is obtained
from the product of k Givens rotationsof max(d1, d2) rows and
columns each (where k maydiffer form layer to layer) by taking its
first d1 rows (d2columns) if d1 ≤ d2 (d1 > d2) and then
renormalizingthese min(d1, d2) rows (columns). The renormaliza-tion
is conducted by multiplying these rows (columns)by fixed min(d1,
d2) scalars: s1, ..., smin(d1,d2). Ran-dom KAMA-NNs apply matrices
M using Givens ran-dom rotations chosen independently and with
scalarss1, ..., smin(d1,d2) chosen independently from a
givenprobabilistic 1D-distribution Φ. In general these aswell as
angles Θ and indices I, J of Givens rotations arelearned. Notice
that products of independent Givensrandom rotations are sometimes
called Kac’s randomwalk matrices since they correspond to the Kac’s
ran-dom walk Markov Chain [Kac, 1954]. KAMA standsfor: Kacs
Asymptotic Matrix Approximators, since,as we will see in Section 3,
these constructions can beused to approximate many classes of
matrices.
-
Krzysztof Choromanski*, Aldo Pacchiano*, Jeffrey Pennington*,
Yunhao Tang*
Space and time complexity gains: Matrix-vectormutliplication
with matrices M = G1 · ...Gk ∈ Rd,where Gis are Givens rotations,
can be conducted intime O(k), since mutliplication by each Givens
rotationcan be trivially done in time O(1) (it requires exactlyfour
scalar multiplications and two additions). Further-more, the total
number of parameters needed to encodematrix M together with
renormalization parameterssi (see: above discussion) equals 3k + d
(two indicesand one angle per Givens rotation and d
renormaliza-tion scalars). We will see later (see: Section 3)
thatin practice k = O(d log(d)) Givens matrices suffice toencode
architectures capable of learning good qualitymodels. Thus KAMA-NNs
provide faster inferencethan unstructured counterparts and form a
class ofcompact yet expressible neural network architectures.
3 Capacity of KAMA-NNs
Products of low dimensional rotations have beenthe subject of
voluminous research, partially be-cause of their applications in
physics [Kac, 1954,Janvresse, 2001, Mischler and Mouhot, 2013].
Kac’srandom walk, where transitions from previous to
nextd-dimensional states are defined by independent Givensrandom
rotations, was introduced in [Kac, 1954]. Itwas shown that products
of such rotations converge(in certain sense) to the truly random
rotation, yet aremuch more efficient to compute. For instance, it
was re-cently proven that Kac’s random walk on the d-spheremixes in
d log(d) steps [Pillai and Smith, 2015]. Thisresult suggests that
products of relatively small numberof low dimensional rotations may
serve as a good proxyfor truly random rotation matrices sampled
from thedistribution corresponding to the Haar measure andproviding
solid theoretical guarantees at the same time(as opposed to other
structured random matrices giv-ing computational speedups such as
random Hadamardmatrices, but for which only vague theoretical
guaran-tees were given so far). This suggest
straightforwardapplications in machine learning (for instance to
pro-duce fast random feature map based estimators of RBFor PNG
kernels [Rahimi and Recht, 2007]), yet surpris-ingly to the best of
our knowledge, so far mechanismsbased on Givens random rotations
were proposed onlyin the context of dimensionality reduction and
Johnson-Lindenstrauss Transforms [Ailon and Chazelle, 2006].
Not much is also known regarding the adaptivesetting, where
Givens rotations are learned. In[Mathieu and LeCun, 2014] learned
Givens rotationswere applied to approximate Hessian matrices for
cer-tain optimization problems. It is believed though thatproducts
of relatively small number of learned Givensrotations can
accurately approximate matrices frommany classes of rotations. In
particular, we will focus
on the following family.Definition 1. Denote by GIV(d) the class
of allGivens rotations from Rd × Rd. For a constant C > 0,let GC
be a family of matrices in Rd×d defined as:GC = {G1 · ... ·Gk : Gi
∈ GIV(d), i = 1, ..., k}, wherek = dC d log(d)2 e.
Even though those families are not dense in the set ofall
rotation matrices, as we will show later, in prac-tice they can
accurately approximate many rotationmatrices in both adaptive and
non-adaptive setting.The renormalization scalars si defined by us
in Section2 can then "stretch" certain dimensions of the
inputvectors rotated by such matrices. This mechanism hassufficient
capacity to provide accurate and superiorperformance to
unstructured baselines estimators ofall PNG kernels that correspond
to random neural net-works, as we will see next. We then show that
it alsosuffices in the adaptive setting to learn good qualityRL
policies.
4 Random KAMA-NNs
Consider a random neural network with inputand output layer of
size d, nonlinearity f ap-plied to neurons in the output layer and
weightstaken independently at random from the gaus-sian
distribution N (0, 1√
d). This is a standard
choice for weights initialization in feedforward neu-ral
networks. We call it unstructured random NN.Several recent results
[Pennington and Worah, 2017],[Pennington et al., 2017], [Pennington
et al., 2018] fo-cus on understanding statistical properties of
unstruc-tured random NNs and their connection to randommatrix
theory. Recent work [Pennington et al., 2017,Xiao et al., 2018,
Chen et al., 2018] shows also that or-thogonal random
initialization of neural networks leadsto better learning profiles,
even though not much isknown about this phenomenon from the
theoreticalpoint of view. We shed light on it, by showing
thatrandom KAMA-NNs as well as previously analyzed ran-dom
orthogonal constructions lead to asymptoticallyas d → ∞ more
accurate estimators of the so-calledPNG (Pointwise nonlinear
Gaussian) kernels.Definition 2 (PNG-kernels). The PNG
kernel(shortly: PNG) defined by the mapping f is a func-tion: K :
Rd × Rd → R given as follows for x,y ∈ Rd:
Kf (x,y) = Eg∼N (0,Id)[f(g>x)f(g>y)]. (1)
PNG kernels are important in the analysis of unstruc-tured
random NNs since such networks can be equiv-alently thought of as
transformations that translateone of the most basic similarity
measures between fea-ture vectors, namely the linear (dot-product)
kernel
-
KAMA-NNs: low-dimensional rotation based neural networks
by the PNG corresponding to the particular mappingf . Indeed,
consider a linear kernel between activationvectors a(x) and a(y)
corresponding to given inputvectors x,y ∈ Rd. The following is
true:
a(x)>a(y) =1
d
d∑i=1
f(g>i x)f(g>i y), (2)
where gi is the ith row of the connection matrix. There-fore
unstructured random NNs become unbiased MonteCarlo (MC) estimators
of the values of particularPNG kernels. We denote these unbiased
estimatorsas URNf (x,y). Notice that URNs choose values ofweights
independently from N (0, 1). The so-calledorthogonal random NN is
obtained by replacing Gaus-sian matrix of connections G by its
orthogonal variantGort. Matrix Gort is obtained from G by
conduct-ing Gram-Schmidt orthogonalization and then usingscalars
s1, ..., sd to renormalize rows of the obtainedorthonormal matrix,
where si are sampled indepen-dently from ‖g‖2 for g ∼ N (0, Id). We
will denotethe corresponding estimator as ORNf (x,y). Finally,
ifinstead we use random KAMA-NNs constructed fromk blocks and with
scalars si chosen in the same wayas for orthogonal random NNs, then
the correspondingestimator will be denoted as KRNkf (x,y).
For � > 0, we denote by B(�) a ball centered at 0 andof
radius �. Our main theoretical results in this sectionare given
below.Theorem 1 (orthogonal random NNs for PNGs). Letf : R→ R be a
function and B ⊆ Rd be bounded region(for instance a unit sphere).
Then for every constant� > 0 there exists a constant K(�) > 0
such that forevery x,y ∈ B\B(�) the following holds for d
largeenough:
MSE(ORNf (x,y)) ≤ MSE(URNf (x,y))−K(�)
d, (3)
where MSE stands for the mean squared error.
If instead of orthogonal random NNs, we use KAMA-NNs then the
following is true:Theorem 2 (KAMA-NNs for PNGs). Let f : R →R be a
function and B ⊆ Rd be bounded region (forinstance a unit sphere).
Then for every constant � > 0there exist constants K(�), L(�)
> 0 such that for k =L(�)d log(d) and for every x,y ∈ B\B(�) the
followingholds for d large enough:
MSE(KRNkf (x,y)) ≤ MSE(URNf (x,y))−K(�)
d. (4)
The above theorems show that orthogonal randomNNs as well as
random KAMA-NNs provide asymp-totically as d → ∞ more accurate
estimators of
PNG kernels for any nonlinear function f . Previ-ously these
results were known only for orthogonalrandom NNs with sin / cos
nonlinear mappings corre-sponding to RBF kernels [Choromanski et
al., 2018b]and with f(x) = sgn(x) corresponding to angular
PNGkernels [Choromanski et al., 2017]. Not only do ran-dom KAMA-NNs
give accuracy gains, but they alsolead to faster inference (O(d
log(d)) versus O(d2) time)and compression of the model (O(d log(d))
versus O(d2)space) that orthogonal random NNs are not capable
of.Our empirical results in Section 6 confirm all our theo-retical
findings and show also that mean squared errorguarantees translate
to more downstream guarantees.
5 Learning KAMA-NNs
5.1 Givens rotations and manifold coordinateascent
The space of d×d dimensional orthogonal matrices withpositive
determinant form a connected manifold knownas the special
orthogonal matrix group, also denotedas SO(d) [Gallier and Xu,
2003]. This set is a (d−1)d2dimensional manifold with the property
that each pointM ∈ SO(d) has an associated tangent space, which isa
(d+1)d2 dimensional vector space where the tangentdirections to
SO(d) live. The tangent space is denotedas TMSO(d) is defined as
TMSO(d) = {A ∈ Rd×d, A =MΩ : Ω = −Ω>}, the set of Skew Symmetric
matricespremultiplied by M.
We show that maximizing a function F : SO(d) →R by performing
coordinate gradient ascent over themanifold SO(d) naturally yields
a solution that equalsa product of Givens rotations.
5.1.1 Ascent and descent directions in themanifold.
When moving along a manifold, the right generalizationof a
straight line between two points is the notion ofgeodesic curves.
For any given point M ∈ SO(d) anddirection MΩ ∈ TMSO(d), there is a
single geodesicthat passes through M in direction MΩ. In the case
ofthe Special Orthogonal Group of matrices these curvescan be
written in the parameteric form γΩ : R→ SO(d),such that γΩ(Θ) = M
exp(ΘΩ). The expression exp(A)for A ∈ Rd×d denotes the matrix
exponential of A.
exp(·) maps any skew symmetric matrix Ω to an orthog-onal
matrix. As a consequence, γΩ(Θ) = M exp(ΘΩ)is an orthogonal matrix
for all Θ ∈ R providedM ∈ SO(d). Let {AI,J}1≤I
-
Krzysztof Choromanski*, Aldo Pacchiano*, Jeffrey Pennington*,
Yunhao Tang*
AI,J [i, j] =
1 if i = I, j = J−1 if i = J, j = I0 o.w.
The exponential of scalar multiples of these basis ele-ments
equal Givens rotations: exp(−ΘAI,J) = GΘI,J .As a result, the
geodesic passing throughM in directionAI,J equals γI,J(Θ) = MGΘJ,I
[Gallier and Xu, 2003].
Let F : SO(d)→ R be a differentiable function over themanifold
of Special Orthogonal matrices. Similar to theEuclidean space
definition, the directional derivativealong AI,J and evaluated at M
∈ SO(d) is a scalartaking the value:
∇I,JF (M) :=d
dΘF (γI,J(Θ))|Θ=0
:=d
dΘF (MGΘJ,I)|Θ=0
The Reimannian gradient of F at M, denoted as∇F (M) is a matrix
in TMSO(d) of the form MΩ withΩ =
∑1≤Ix)f(g>y)], wheref(x) = sgn(x). We compared the following
estimators:KRNkf with k = 5d log(d) Givens random rotations
andcorresponding to random KAMA-NNs, baseline URNusing unstructured
Gaussian matrices, estimator ORNbuilt on matrices Gort as well as
estimators applyingrandom Hadamard matrices (HORN) [Yu et al.,
2016].Experiments were conducted on the following datasets:boston
and wine. Results are presented on Figure 1.We see that KAMA-NNs
provide moderate accuracygains over unstructured baselines. Next we
show thatthese moderate gains translate to more substantialaccuracy
gains on more downstream tasks.
Approximating kernel matrices: Here we test therelative error of
kernel matrix estimation via differentangular kernel estimators
based on random matices, in
-
KAMA-NNs: low-dimensional rotation based neural networks
particular those applying random KAMA-NNs.
For a given dataset X = {x1, ...,xN} andan angular kernel Kang
denote by K(X ) ={Kang(xi,xj)}i,j∈{1,...,N} the corresponding
kernel ma-trix and by K̂(X ) its approximate version obtained
byusing values proposed by a given estimator. The rel-ative error
of the kernel matrix estimation is givenas: � = ‖K(X )−K̂(X )‖F
‖K(X̂)‖F, where ‖‖F stands for the
Frobenius norm. We use the following datasets: g50,boston, cpu,
insurance, wine and parkinson. Following[Choromanski et al., 2017],
we plot the mean error ob-tained from r = 1000 repetitions for each
mechanism.Kernel matrices are computed on a randomly selectedsubset
of N = 550 datapoint from each datasets.
Results are presented on Figure 2. In all plots differ-ent
orthogonal mechanisms show similar performance(almost identical
curves substantially better than forthe URN mechanism), while
KAMA-NNs outperformother orthogonal transforms speed-wise.
6.2 The adaptive setting: learning RLpolicies
We show that KAMA-NNs can substantially reducethe number of
policy parameters in reinforcementlearning (RL) benchmark tasks,
while still provid-ing good performance. We choose standard
un-structured fully connected feedforward neural net-work policy
architectures and structured neural net-work policies with Toeplitz
matrices as baselines[Choromanski et al., 2018c]. With many more
param-eters, fully-connected policies can represent a muchlarger
policy space, which facilitates easier optimiza-tion and leads to
better performance in practice. On theother hand, Toeplitz policies
greatly compress parame-ters’ space, but at the cost of significant
degradationof policy performance. We show that KAMA-NNs pol-icy
achieves a desirable middle ground between thesetwo extremes:
drastically reducing the number of pa-rameters compared to a
fully-connected policy, whileachieving better performance than
Toeplitz policy andproviding the same computational speed-ups.
Our fully-connected neural network policies consist oftwo hidden
layers, each with h = 64 hidden units forPPO algorithm and h = 32
hidden units for TRPO al-gorithm (see: below). Let x and y =
σ(Wx+b) be theactivations at the first and second hidden layer
respec-tively, where W ∈ Rh×h is a weight matrix, b ∈ Rhis a bias
vector and σ(·) is the non-linear activationfunction. To construct
a compact policy using KAMAmechanism, we replace the unstructured
weight matrixW by a sequence of K Givens rotations. We do thesame
for the first and last matrix of connections, but
(a) g50 (b) boston
(c) cpu (d) insurance
(e) wine (f) parkinson
Figure 2: Normalized Frobenius norm error for theangular PNG
kernel matrix approximation. The follow-ing estimators are
compared: baseline using indepen-dent Gaussian vectors (URN),
structured using randomHadamard matrices with renormalized rows
(HORN),structured using k = 5d log(d) Givens random rotations(KAMA)
and structured using matrices Gort (ORN).Experiments are run on six
datasets: g50, boston, cpu,insurance, wine and parkinson.
this time apply also the truncation mechanism, as de-scribed in
Section 2. While using Toeplitz mechanism,we replace all three
matrices of connections by Toeplitzmatrices.
Learning low dimensional rotations: We now in-troduce the way to
parameterize and learn rotationsfor the KAMA mechanism. Upon
initialization, werandomly sample 2D-linear subspaces, where
rotationsdefined by Givens matrices Gi are conducted, fromthe set
of subspaces spanned by two vectors from thecanonical basis {e1,
..., en}. For each Gi only angle Θiof the rotation is learned.
Unstructured matrices are re-placed by (truncated) matrices of the
form G1G2...GK .All rotation angles Θi are learned by
back-propagation.For each matrix we also learn renormalization
scalarssi (see: Section 2).
Remark 1. Note that in the above setting we do
-
Krzysztof Choromanski*, Aldo Pacchiano*, Jeffrey Pennington*,
Yunhao Tang*
not need to explicitly store structured matrices S =G1G2...GK .
It suffices to keep: θ1, ..., θK to efficientlycompute Sx for any
input x.
Algorithms and Tasks: We test our policy forstate of the art RL
algorithms: Trust Region PolicyOptimization [Schulman et al., 2015]
(TRPO) andProximal Policy Optimization [Schulman et al.,
2017](PPO). Using these two algorithms, we comparedifferent policy
architectures. All implementations areusing OpenAI baseline
[Dhariwal et al., 2017].The benchmark tests are based on
MuJoColocomotion tasks provided by OpenAI Gym[Brockman et al.,
2016, Todorov et al., 2012] andRoboschool [Schulman et al., 2017].
We take thefollowing environments: Double Pendulum,
InvertedPendulum, Swimmer, Hopper, HalfCheetah.
6.2.1 Proximal Policy Optimization (PPO)
In Figure 3, we show training results on MuJoCo bench-mark tasks
with Proximal Policy Optimization (PPO)[Schulman et al., 2017].
Here we use K = 200 Givensrotations to construct all three
structured matrices inthe policy. We train the policy on each task
for afixed number of time steps and record the cumulativerewards
during training. We show the mean ± stdperformance across 5 random
seeds. As seen from Fig-ure 3, across most tasks KAMA-NNs policies
achievebetter performance than Toeplitz policies.
6.2.2 Trust Region Policy Optimization(TRPO)
In Figure 4, we show training results on MuJoCo bench-marks with
Trust Region Policy Optimization (TRPO)[Schulman et al., 2015].
Here we use K = 100 Givensrotations to construct all three
structured matrices. Asbefore, we train the policy on each task for
a fixed num-ber of time steps and record the cumulative
rewardsduring training. We show the mean ± std performanceacross 5
random seeds. As seen from Figure 4, acrossmost tasks KAMA-NNs
policies achieve significantlybetters performance than Toeplitz
policies.
6.2.3 Parameter Compression
By replacing unstructured matrices in the fully con-nected
architecture, structured policies can achievesignificant
compression in the number of parameters.In the settings where
unstructured models are large,structured models can offer much
faster inference dur-ing training and require much less storage. In
Table 1,we list the ratio of the total number of parameters usedby
structured policies relative to unstructured policies.The two
structured policies that we compare (built
(a) DoublePen. (b) InvertedPen.
(c) Swimmer (d) Hopper
(e) HalfCheetah (f) Walker
Figure 3: Illustration of KAMA-NNs policies on MuJoCobenchmarks
with PPO. KAMA-NNs are compared withunstructured baselines and
architectures based on low dis-placement rank matrices (Toeplitz).
For each task we trainthe policy with PPO for a fixed number of
steps and showthe mean ± std performance. Vertical axis is the
cumulativereward and horizontal axis stands for the # of time
steps.
from KAMA-NNs and Toeplitz networks) provide thesame
computational speed-ups for the inference (similarnumber of
floating point multiplications).
On benchmark tasks, KAMA-NN based policies achieve7x compression
relative to the unstructured model.Though Toeplitz policy reduces
the number of parame-ters even further, the significant drop in
performanceobserved in Figure 3 and Figure 4 is not desirable.
Wealso show in the Appendix that we can further reducethe number of
the parameters in the KAMA-NNs bydecreasing the number of Givens
rotations in the firstand third structured matrix, without
affecting learnedpolicy and at the same time, further compressing
themodel.
6.2.4 Ablation Analysis
One advantage of KAMA-NN architectures over struc-tured
architectures based on low displacement rankmatrices (such as
Toeplitz), is that it easily allowsto adjust the trade-off between
the capacity of the
-
KAMA-NNs: low-dimensional rotation based neural networks
(a) Swimmer (b) Walker
(c) Hopper (d) HalfCheetah
(e) Hopper (R) (f) HalfCheetah (R)
Figure 4: Illustration of KAMA-NNs policies on MuJoCobenchmarks
with TRPO. The same setup as for PPO exper-iments from Fig 3.
Experiments with (R) are taken fromRoboschool.
model and its compactness by simply varying the num-ber of
rotations K. Intuitively, when K is large, thearchitecture becomes
more expensive to use and theperformance improves; when K is small,
it becomesmore compact at the cost of worse performance.
We carry out an ablation study on the effect of thenumber of
rotations K. In Figure 5, we show thetraining curves of varying K ∈
{10, 20, 50, 100, 200}with both PPO and TRPO on a set of benchmark
tasksfrom OpenAI Gym. We see that the policy performanceimproves as
the number of rotations K increases.
PPO HalfCheetah Walker Hopper
KAMA-NN 15 % 15% 17%Toeplitz 7 % 7% 8%
TRPO HalfCheetah Walker Hopper
KAMA-NN 24 % 24% 28%Toeplitz 12 % 12% 13%
Table 1: Ratio of the total number of parameters used inthe
structured matrices relative to the unstructured model.KAMA-NN
architectures apply K = 200 Givens rotations
for PPO and K = 100 rotations for TRPO.
(a) PPO-Hopper (b) TRPO-Hopper
(c) PPO-HalfCheetah (d) TRPO-HalfCheetah
Figure 5: Ablation study on the effect of the number ofrotations
K. The experiments are performed with bothPPO and TRPO and on a set
of benchmark tasks fromOpenAI Gym. Vertical axis is the cumulative
reward andhorizontal axis stands for the # of time steps.
6.3 Learning low-dimensional subspaces forrotations
We also conducted experiments, where not only rota-tion angles
θ, but also 2-dimensional subspaces whererotations were conducted,
were learned. We did not ob-serve any quality gains in comparison
to the proposedalgorithm (where 2-dimensional subspaces were
cho-sen randomly), proving empirically that models withrandom
subspaces and learned angles have sufficientcapacity.
7 Conclusions
We presented a new class of compact architecturesfor feedforward
fully connected neural networks basedon low dimensional learned or
random rotations. Weempirically showed their advantages over state
of the artin both: adaptive (where the parameters are learned)as
well as non-adaptive regime on various tasks such asPNG-kernel
approximation and RL policies learning.We further provided
theoretical guarantees shedding anew light on the effectiveness of
(random) orthogonalcompact transforms in machine learning. In
particular,we showed that KAMA-NNs lead to asymptoticallyfaster and
more accurate estimators of PNG-kernelsrelated to random neural
networks. Our architecturesprovide practitioners with an easy way
of adjustingthe complexity (and thus also capacity) of the
neuralnetwork model to their needs by changing the numberof low
dimensional rotations used (that are buildingblocks of
KAMA-NNs).
-
Krzysztof Choromanski*, Aldo Pacchiano*, Jeffrey Pennington*,
Yunhao Tang*
Acknowledgements. The authors would like to ac-knowledge the
cloud credits provided by Amazon WebServices.
References
[Abadi et al., 2016] Abadi, M., Barham, P., Chen, J.,Chen, Z.,
Davis, A., Dean, J., Devin, M., Ghemawat,S., Irving, G., Isard, M.,
et al. (2016). Tensorflow:a system for large-scale machine
learning. In OSDI,volume 16, pages 265–283.
[Ailon and Chazelle, 2006] Ailon, N. and Chazelle, B.(2006).
Approximate nearest neighbors and the fastJohnson-Lindenstrauss
transform. In STOC.
[Andoni et al., 2015] Andoni, A., Indyk, P.,Laarhoven, T.,
Razenshteyn, I. P., and Schmidt,L. (2015). Practical and optimal
LSH for angulardistance. In NIPS.
[Bojarski et al., 2017] Bojarski, M., Choromanska,
A.,Choromanski, K., Fagan, F., Gouy-Pailler, C., Mor-van, A., Sakr,
N., Sarlos, T., and Atif, J. (2017).Structured adaptive and random
spinners for fastmachine learning computations. In AISTATS.
[Brockman et al., 2016] Brockman, G., Cheung, V.,Pettersson, L.,
Schneider, J., Schulman, J., Tang,J., and Zaremba, W. (2016).
Openai gym. arXivpreprint arXiv:1606.01540.
[Chen et al., 2018] Chen, M., Pennington, J., andSchoenholz, S.
S. (2018). Dynamical isometry anda mean field theory of rnns:
Gating enables signalpropagation in recurrent neural networks.
arXivpreprint arXiv:1806.05394.
[Choromanska et al., 2016] Choromanska, A., Choro-manski, K.,
Bojarski, M., Jebara, T., Kumar, S.,and LeCun, Y. (2016). Binary
embeddings withstructured hashed projections. In Proceedings of
the33nd International Conference on Machine Learning,ICML 2016, New
York City, NY, USA, June 19-24,2016, pages 344–353.
[Choromanski et al., 2018a] Choromanski, K.,Downey, C., Boots,
B., Holtmann-Rice, D.,and Kumar, S. (2018a). Initialization
matters: Or-thogonal predictive state recurrent neural networks.In
to appear at ICLR 2018.
[Choromanski et al., 2018b] Choromanski, K., Row-land, M.,
Sarlos, T., Sindhwani, V., Turner, R.,and Weller, A. (2018b). The
geometry of randomfeatures. In AISTATS 2018.
[Choromanski et al., 2018c] Choromanski, K., Row-land, M.,
Sindhwani, V., Turner, R. E., and Weller,
A. (2018c). Structured evolution with compact archi-tectures for
scalable policy optimization. In Pro-ceedings of the 35th
International Conference onMachine Learning, ICML 2018,
Stockholmsmässan,Stockholm, Sweden, July 10-15, 2018, pages
969–977.
[Choromanski and Sindhwani, 2016] Choromanski, K.and Sindhwani,
V. (2016). Recycling randomnesswith structure for sublinear time
kernel expansions.In ICML.
[Choromanski et al., 2017] Choromanski, K. M., Row-land, M., and
Weller, A. (2017). The unreasonableeffectiveness of structured
random orthogonal embed-dings. In Advances in Neural Information
ProcessingSystems 30: Annual Conference on Neural Informa-tion
Processing Systems 2017, 4-9 December 2017,Long Beach, CA, USA,
pages 218–227.
[Dhariwal et al., 2017] Dhariwal, P., Hesse, C.,Klimov, O.,
Nichol, A., Plappert, M., Rad-ford, A., Schulman, J., Sidor, S.,
Wu, Y.,and Zhokhov, P. (2017). Openai
baselines.https://github.com/openai/baselines.
[Gallier and Xu, 2003] Gallier, J. and Xu, D. (2003).Computing
exponentials of skew-symmetric matricesand logarithms of orthogonal
matrices. InternationalJournal of Robotics and Automation,
18(1):10–20.
[Janvresse, 2001] Janvresse, E. (2001). Spectral gapfor kac’s
model of boltzmann equat ion. The Annalsof Probability, 29(1).
[Kac, 1954] Kac, M. (1954). Foundations of kinetictheory.
Proceedings of the Third Berkeley Symposiumon Mathematical
Statistics and Probability, 3.
[Mania et al., 2018] Mania, H., Guy, A., and Recht, B.(2018).
Simple random search provides a competitiveapproach to
reinforcement learning. arXiv preprintarXiv:1803.07055.
[Mathieu and LeCun, 2014] Mathieu, M. and LeCun,Y. (2014). Fast
approximation of rotations and hes-sians matrices. CoRR,
abs/1404.7195.
[Mischler and Mouhot, 2013] Mischler, S. and Mouhot,C. (2013).
Kac’s program in kinetic theory. Inven-tiones mathematicae,
193(1).
[Patrascu and Necoara, 2015] Patrascu, A. andNecoara, I. (2015).
Efficient random coordinatedescent algorithms for large-scale
structured noncon-vex optimization. Journal of Global
Optimization,61(1):19–46.
[Pennington et al., 2017] Pennington, J., Schoenholz,S., and
Ganguli, S. (2017). Resurrecting the sigmoid
https://github.com/openai/baselines
-
KAMA-NNs: low-dimensional rotation based neural networks
in deep learning through dynamical isometry: theoryand practice.
In Advances in neural informationprocessing systems, pages
4785–4795.
[Pennington et al., 2018] Pennington, J., Schoenholz,S. S., and
Ganguli, S. (2018). The emergence of spec-tral universality in deep
networks. In InternationalConference on Artificial Intelligence and
Statistics,AISTATS 2018, 9-11 April 2018, Playa Blanca, Lan-zarote,
Canary Islands, Spain, pages 1924–1932.
[Pennington and Worah, 2017] Pennington, J. andWorah, P. (2017).
Nonlinear random matrix theoryfor deep learning. In Advances in
Neural Informa-tion Processing Systems 30: Annual Conference
onNeural Information Processing Systems 2017, 4-9December 2017,
Long Beach, CA, USA, pages 2634–2643.
[Pillai and Smith, 2015] Pillai, N. and Smith, A.(2015). Kac’s
walk on n-sphere mixes in n log(n)steps. In arxiv.
[Rahimi and Recht, 2007] Rahimi, A. and Recht, B.(2007). Random
features for large-scale kernel ma-chines. In NIPS.
[Salimans et al., 2017] Salimans, T., Ho, J., Chen, X.,and
Sutskever, I. (2017). Evolution strategies as ascalable alternative
to reinforcement learning. CoRR,abs/1703.03864.
[Schulman et al., 2015] Schulman, J., Levine, S.,Abbeel, P.,
Jordan, M., and Moritz, P. (2015). Trustregion policy optimization.
In International Confer-ence on Machine Learning, pages
1889–1897.
[Schulman et al., 2017] Schulman, J., Wolski, F.,Dhariwal, P.,
Radford, A., and Klimov, O. (2017).Proximal policy optimization
algorithms. arXivpreprint arXiv:1707.06347.
[Shalit and Chechik, 2014] Shalit, U. and Chechik, G.(2014).
Coordinate-descent for learning orthogonalmatrices through givens
rotations. In InternationalConference on Machine Learning, pages
548–556.
[Sindhwani et al., 2015] Sindhwani, V., Sainath, T. N.,and
Kumar, S. (2015). Structured transforms forsmall-footprint deep
learning. In Advances in NeuralInformation Processing Systems 28:
Annual Confer-ence on Neural Information Processing Systems
2015,December 7-12, 2015, Montreal, Quebec, Canada,pages
3088–3096.
[Todorov et al., 2012] Todorov, E., Erez, T., andTassa, Y.
(2012). Mujoco: A physics engine formodel-based control. In
Intelligent Robots and Sys-tems (IROS), 2012 IEEE/RSJ International
Confer-ence on, pages 5026–5033. IEEE.
[Vybíral, 2011] Vybíral, J. (2011). A variant of
theJohnson-Lindenstrauss lemma for circulant matrices.Journal of
Functional Analysis, 260(4):1096–1105.
[Xiao et al., 2018] Xiao, L., Bahri, Y., Sohl-Dickstein,J.,
Schoenholz, S. S., and Pennington, J. (2018). Dy-namical isometry
and a mean field theory of cnns:How to train 10,000-layer vanilla
convolutional neu-ral networks. arXiv preprint
arXiv:1806.05393.
[Yu et al., 2016] Yu, F., Suresh, A., Choromanski,
K.,Holtmann-Rice, D., and Kumar, S. (2016). Orthogo-nal random
features. In NIPS, pages 1975–1983.
[Zhang and Cheng, 2013] Zhang, H. and Cheng, L.(2013). New
bounds for circulant Johnson-Lindenstrauss embeddings. CoRR,
abs/1308.6339.
IntroductionKAMA-NN architecturesCapacity of KAMA-NNsRandom
KAMA-NNsLearning KAMA-NNsGivens rotations and manifold coordinate
ascentAscent and descent directions in the manifold.
ExperimentsThe non-adaptive setting: random KAMA-NNsThe adaptive
setting: learning RL policiesProximal Policy Optimization
(PPO)Trust Region Policy Optimization (TRPO)Parameter
CompressionAblation Analysis
Learning low-dimensional subspaces for rotations
Conclusions