watsonlogo Introduction The Kronecker and Box Product Matrix Differentiation Optimization Efficient Automatic Differentiation of Matrix Functions Peder A. Olsen, Steven J. Rennie, and Vaibhava Goel IBM T.J. Watson Research Center Automatic Differentiation 2012 Fort Collins, CO July 25, 2012 Peder, Steven and Vaibhava Matrix Differentiation
89
Embed
Efficient Automatic Differentiation of Matrix Functions...watsonlogo Introduction The Kronecker and Box Product Matrix Di erentiation Optimization E cient Automatic Di erentiation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
watsonlogo
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Efficient Automatic Differentiation of MatrixFunctions
Peder A. Olsen, Steven J. Rennie, and Vaibhava GoelIBM T.J. Watson Research Center
Numerical Methods that do numeric differentiation by formulaslike the finite difference
f ′(x) ≈ f (x + h)− f (x)
h
These methods are simple to program, but loose halfof all significant digits.
Symbolic Symbolic differentiation gives the symbolicrepresentation of the derivative without saying howbest to implement the derivative. In general symbolicderivatives are as accurate as they come.
Automatic Something between numeric and symbolicdifferentiation. The derivative implementation iscomputed from the code for f (x).
By using dual numbers dual numbers a + bE , where E 2 = 0 wecan further reduce the error. For second order derivatives we canuse hyper dual numbers. ((2012) J. Fike).
Complex number implementations usually come with mostprogramming languages. For dual or hyper-dual numbers we needto download or implement code.
Algorithmic Differentiation (often referred to asAutomatic Differentiation or just AD) uses the softwarerepresentation of a function to obtain an efficient methodfor calculating its derivatives. These derivatives can be ofarbitrary order and are analytic in nature (do not haveany truncation error). B. Bell – author of CppAD
Use of dual or complex numbers is a form of automaticdifferentiation. More common though is operator overloadingimplementations as in CppAD.
and fi = (y − ti )fi−1, f1 = y − t1, ri = (y − ti )ri+1, rn = y − tn.Note that f (y) = fn = r1.
Cost of computing f (y) is n subtractions and n − 1 multiplies. Atthe overhead of storing f1, . . . , fn the derivative can be computedwith an additional n − 1 additions and 2n − 1 multiplications.
Under quite realistic assumptions the evaluation of agradient requires never more than five times the effort ofevaluating the underlying function by itself.
Reverse mode differentiation (1971) is essentially applying thechain rule on a function in the reverse order that the function iscomputed.
This is the same technique as applied in the backpropagationalgorithm (1969) for neural networks and the forward-backwardalgorithm for training HMMs (1970).
Reverse mode differentiation applies to very general classes offunctions.
Reverse mode differentiation was independently invented by
G. M. Ostrovskii (1971)
S. Linnainmaa (1976)
John Larson and Ahmed Sameh (1978)
Bert Speelpenning (1980)
Not surprisingly the fame went to B. Speelpenning, whoadmittedly told the full story at AD 2012.
A. Sameh was on B. Speelpenning’s thesis committee.
The Speelpenning function was suggested as a motivation toSpeelpenning’s thesis by his Canadian interim advisor Prof. ArtSedgwick who passed away on July 5th.
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
Identity Box Products
The identity box product Im � In is permutation matrix withinteresting properties. Let A ∈ Rm1×n1 , B ∈ Rm2×n2 .
Orthonormal: (Im � In)>(Im � In) = Imn
Transposition: (Im1 � In1)vec(A) = vec(A>)
Connector: Converting a Kronecker product to a box product:(A⊗ B)(In1 � In2) = A� B.Converting a box product to a Kronecker product:(A� B)(In2 � In1) = A⊗ B.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
Box products are old things in a new wrapping
Although the notation for box products is new, Im � In has longbeen known as Tm,n by physicists, or as a stride permutation Lmn
m
by others. These objects are identical, but the box-product allowsus to express more complex identities more compactly, e.g. letA ∈ Rm1×n1 , B ∈ Rm2×n2 then
The initial permutation matrix would have to be written
Im1 � In1m2 � In2 = (Im1 ⊗ Tn1m2,m1)Tm1,n1m1m2 ,
if the notation of box-products wasn’t being used.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
Direct matrix products are important because such matrices can bemultiplied fast with each other and with vectors.
Let us show how the FFT can be done in terms of Kronecker andbox products. Recall that the DFT matrix of order n is given by
Fn = [e−2πikl/n]0≤k,l<n = [ωkln ]0≤k,l<n and therefore
F2 =
(1 11 −1
), F4 =
1 1 1 11 −i −1 i1 −1 1 −11 i −1 −i
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
We can factor the matrix F4 as follows:
F4 =
1 1
1 11 −1
1 −1
11
1−i
1 11 −1
1 11 −1
11
11
or more compactly
F4 = (F2 ⊗ I2)diag
(vec
(1 11 −i
))(I2 � F2)
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
In general if we define the matrix
VN,M(α) =
1 1 1 . . . 11 α α2 . . . αM−1
1 α2 α4 . . . α2(M−1)
......
.... . .
...
1 αN−1 α2(N−1) . . . α(N−1)(M−1)
,
For N = km we have the following factorizations of the DFTmatrix:
FN = (Fk ⊗ Im)(diag(vec(Vm,k(ωN))))(Ik � Fm)
FN = (Fm � Ik)(diag(vec(Vm,k(ωN))))(Fk ⊗ Im)
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
DefinitionPropertiesIdentity Box ProductsThe Fast Fourier Transform
This allows FFTN(x) = y = FNx to be computed as
FNx = vec((Vm,k(ωN) ◦ (FmX>))F>k ).
where x = vec(X) and ◦ denotes elementwise multiplication (.∗ inMatlab). The direct computation has a cost of O(k2m2), whereasthe formula above does the job in O(km(k + m)) operations.Repeated use of the identity for N = 2n leads to the Cooley-TukeyFFT algorithm. (James Cooley was at T. J. Watson from 1962 to1991).
The fastest FFT library in the world as of 2012 (Spiral) usesknowledge of several such factorizations to automatically optimizethe FFT implementation for an arbitrary value of n to a givenplatform!
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
Here we show the role of Kronecker and box product in thematrix–matrix differentiation framework.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
vec>(R0)(I⊗ X>) = vec>(XR0I)
trace(F(X))
×
f ′ = R0 = I
T5T′5 = (I⊗ X>) T′4+ T4 ⊗ I
× T4
R1 = XR0
X
(·)−1 T2 (·)>T3
+ XT1
XI
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
vec>(R0)(T4 ⊗ I) = vec>(IR0T4)
trace(F(X))
×
f ′ = R>2 R0 = I
T5T′5 = (I⊗ X>) T′4+ T4 ⊗ I
× T4
R1 = XR0
X R2 = R0T4
(·)−1 T2 (·)>T3
+ XT1
XI
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
vec>(R1)(I⊗ T>3 ) = vec>(T3R1I)
trace(F(X))
×
f ′ = R>2 R0 = I
T5
× T4
R1 = XR0
T′4 = (I⊗ T>3 ) T′2+ (T2 ⊗ I) T′3 X R2 = R0T4
(·)−1 T2
R3 = T3R1
(·)>T3
+ XT1
XI
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
vec>(R1)(T2 ⊗ I) = vec>(IR1T2)
trace(F(X))
×
f ′ = R>2 R0 = I
T5
× T4
R1 = XR0
T′4 = (I⊗ T>3 ) T′2+ (T2 ⊗ I) T′3 X R2 = R0T4
(·)−1 T2
R3 = T3R1
(·)>T3
R4 = R1T2
+ XT1
XI
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
vec>(R3)(−T2 ⊗ T>2 ) = vec>(−T2R3T2)
trace(F(X))
×
f ′ = R>2 R0 = I
T5
× T4 X R2 = R0T4
(·)−1 T2
R3 = T3R1
T′2 = −(T2 ⊗ T>2 )T′1 (·)>T3
R4 = R1T2
+ XT1
R5 = −T2R3T2
XI
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
vec>(R4)(I� I) = vec>(IR>4 I)
trace(F(X))
×
f ′ = R>2 +R>6 R0 = I
T5
× T4 X R2 = R0T4
(·)−1 T2 (·)>T3
R4 = R1T2
T′3 = I� I
+ X R6 = R>4T1
R5 = −T2R3T2
XI
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
vec>(R5)0 = vec(0)
trace(F(X))
×
f ′ = R>2 +R>6 R0 = I
T5
× T4 X R2 = R0T4
(·)−1 T2 (·)>T3
+ X R6 = R>4T1
R5 = −T2R3T2
T′1 =0+I⊗ I
XI R7 = 0
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
vec>(R5)I⊗ I = vec(R5)
trace(F(X))
×
f ′ = R>2 +R>6 +R>8 R0 = I
T5
× T4 X R2 = R0T4
(·)−1 T2 (·)>T3
+ X R6 = R>4T1
R5 = −T2R3T2
T′1 =0+I⊗ I
X R8 = R5I R7 = 0
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Product and Chain RuleMatrix PowersTrace and Log DeterminantForward and Reverse Mode Differentiation
f ′(X) = R>2 + R>6 + R>8
trace(F(X))
×
f ′ = R>2 +R>6 +R>8 R0 = I
T5
× T4 X R2 = R0T4
(·)−1 T2 (·)>T3
+ X R6 = R>4T1
X R8 = R5I
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Derivative PatternsHessian FormsNewton’s MethodFuture Work
There is growing interest in Machine Learning in scalar–matrixobjective functions.
Probabilistic Graphical Models
Covariance Selection
Optimization in Graphs and Networks.
Data-mining in social networks
Can we use this theory to help optimize such functions? We are onthe look-out for interesting problems.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Derivative PatternsHessian FormsNewton’s MethodFuture Work
The Anatomy of a Matrix-Matrix Function
Theorem
Let R(X) be a rational matrix–matrix function formed fromconstant matrices and K occurrences of X using arithmetic matrixoperations (+, −, ∗ and (·)−1) and transposition ((·)>). Then thederivative of the matrix–matrix function is of the form
k1∑i=1
Ai ⊗ Bi +K∑
i=k1+1
Ai � Bi .
The matrices Ai and Bi are computed as parts of R(X).
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Derivative PatternsHessian FormsNewton’s MethodFuture Work
Hessian Forms
The derivative of a function f of the form trace(R(X)) orlog det(R(X)) is a rational function. Therefore, the Hessian is ofthe form
k1∑i=1
Ai ⊗ Bi +K∑
i=k1+1
Ai � Bi .
where K is the number of times X occurs in the expression for thederivative.If Ai , Bi are d × d matrices then multiplication by H can be donewith O(Kd3) operations.Generalizations of this result can be found in the paper.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Derivative PatternsHessian FormsNewton’s MethodFuture Work
Let f : Rd×d → R with f (X) = trace(R(X)) orf (X) = log det(R(X)).
Derivative: G = f ′(X) ∈ Rd×d and Hessian:H = f ′′(X) ∈ Rd2×d2
.
Newton direction vec(V) = H−1vec(G) can be computedefficiently if K � d
K Algorithm Complexity Storage
1 A⊗ B = A−1 ⊗ B−1 O(d3) O(Kd2)2 Barthels-Stewart algorithm O(d3) O(Kd2)≥ 3 Conjugate Gradient Algorithm O(Kd5) O(Kd2)≥ d General matrix inversion O(d6 + Kd4) O(d4)
The (generalized) Bartels-Stewart algorithms solves theSylvester-like equation AX + X>B = C.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Derivative PatternsHessian FormsNewton’s MethodFuture Work
What about optimizing functions of the form
f (X) + ‖vec(X)‖1?
Common approaches:
Strategy method Use Hessian structure?
Newton-Lasso Coordinate descent XFISTA X
Orthantwise `1 CG XL-BFGS 7
It is not obvious how to take advantage of the Hessian structurefor all these methods. For orthantwise CG for example – thesub-problem requires the Newton direction for a sub-matrix of theHessian.
Peder, Steven and Vaibhava Matrix Differentiation
IntroductionThe Kronecker and Box Product
Matrix DifferentiationOptimization
Derivative PatternsHessian FormsNewton’s MethodFuture Work
1 We have applied this methodology to the covariance selectionproblem – a paper is forthcoming.
2 We are digging deeper in the theory of matrix differentiationand properties of the box products.
3 We are on the look-out for more interesting matrixoptimization problems. All suggestions are appreciated!