Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine
Jan 01, 2016
Herding Dynamical Weights
Max WellingBren School of Information
and Computer ScienceUC Irvine
Motivation
• Xi=1 means that pin i will fall during a Bowling round. Xi=0 means that pin i will still stand.
• You are given pairwise probabilities P(Xi,Xj).
• Task: predict the distribution Q(n), n=0,.., 10 of the total number of pins that will fall.
Stock market: Xi=1 means that company i defaults.You are interested in the probability of n companies defaulting in your portfolio.
Sneak Preview
Newsgroups-small (collected by S. Roweis)100 binary features, 16,242 instances (300 shown)
(Note: herding is a deterministic algorithm, no noise was added)
Herding is a deterministic dynamical system that turns “moments” (average feature statistics)into “samples” which share the same moments.
Quiz: which is which [top/bottom]?
-data in random order.
-herding sequence in order received.
Traditional Approach:Hopfield Nets & Boltzman Machines
is
ijw weight
state value (say 0/1)
jiij
ij sswwsE ),(
ji
ijijw ssw
wZsP exp
)(
1)(
Energy:
jijiji SWIS 0
Probability of a joint state:
Coordinate descent on energy:
Traditional Learning Approach
ij i
iijiij XXXW
eXP
)(
Pidataiii
Pjidatajiijij
XX
XXXXWW
Sii nSInQ
PS
10
0
)(
~
Use CDinstead
!
What’s Wrong With This?
• E[Xi] and E[XiXj] are intractable to compute (and you need them at every iteration of gradient descent).
• Slow convergence & local minima (only w/ hidden vars)
• Sampling can get stuck in local modes (slow mixing).
Solution in a Nutshell
datajiXX
Sii nSInQ
S
10
0
)(
dataiX
Nonlinear Dynamical SystemNonlinear Dynamical System
dataiSi
datajiSji
XS
XXSS
(sidestep learning + sampling)
Herding Dynamics
idataiii
jidatajiijij
jijiji
SX
SSXXWW
SWIS
0
no stepsize
• no stepsize
• no random numbers
• no exponentiation
• no point estimates
iSjS
ijWi
Piston Analogyweights=pistons
Pistons move up at a constant rate (proportional to observed correlations)
When they gets too high, the “fuel” will combustand the piston will be pushed down (depression)
“Engine driven by observed correlations”
Herding Dynamics with General Features
)()(
)(maxarg
SfXfWW
SfWS
kdatakkk
kkk
Si
i
• no stepsize
• no random numbers
• no exponentiation
• no point estimates
Features as New Coordinates)( 1Sf
)( 4Sf
)( 3Sf
)( 2Sf
1w
2w
tw
1tw
If then period is infinite
dataXf )(
)( 5Sf
data
B
bbbB ffnNnn
11 ),..,(
thanks to Romain Thibaux
Example]:1:[
)sin()(10
1)(
2
21
X
XXf
XXf
weights initialized in a grid
red ball tracks 1 weight
converence on afractal attractor setwith Hausdorf dim.1.5
The Tipi Function
gradient descend on G(w)with stepsize 1.
)(max)( SfWfWwG kk
kSk
datakk
This function is:
• Concave• Piecewise linear• Non-positive• Scale free
)(SffWW kdatakkk
kkk
SSfWS )(maxarg
coordinate ascend replaced with full maximization.
Scale free property implies that stepsize will not affect state sequence S.
RecurrenceThm: If we can find the optimal state S, then the weights will stay within a compact region.
Empirical evidence: coordinate ascent is sufficient to guarantee recurrence.
Ergodicity
s=1
s=2
s=3s=4
s=5
s=6
datak
T
ttk
T
fsfT
1
)(1
lim
s=[1,1,2,5,2...
Thm: If the 2-norm of the weights grows slower than linear, then feature averages over trajectories converge to data averages.
Relation to Maximum Entropy
dataP
P
fftoSubject
PHMaximize
:
][
x
xfW
kdatakk
W
kkk
k
efWWLMaximize)(
}{log)(
Dual:
Tipi function:
T
WLTWG
T 0lim)(
Herding dynamics satisfies constraints but not maximal entropy
Advantages / Disadvantages
• Learning & Inference have merged into one dynamical system.• Fully tractable – although one should monitor whether local maximization is enough to keep weights finite.• Very fast: no exponentation, no random number generation.• No fudge factors (learning rates, momentum, weight decay..).• Very efficient mixing over all “modes” (attractor set).
• Moments preserved, but what is our “inductive bias”? (i.e. what happens to remaining degrees of freedom?).
Back to BowlingData collected by P. Cotton.10 pins, 298 bowling runs.X=1 means a pin has fallen in two subsequent bowls.H.XX uses all pairwise probabilitiesH.XXX uses all triplet probabilities
P(total nr. pins falling)
More ResultsDatasets: Bowling (n=298, d=10, k=2, Ntrain=150, Ntest = 148)Abelone (n=4177, d=8, k=2, Ntrain=2000, Ntest = 2177)Newsgroup-small (n=16,242, d=100, k=2, Ntrain=10,000, Ntest = 6242)8x8 Digits (n=2200 [3’s and 5’s], d=64, k=2, Ntrain=1600, Ntest =600)
Task: given only pairwise probabilities,compute the probability of the total nr.of 1’s in a data-vector Q(n).
Solution: apply herding and compute Q(n)through sample averages.
Error : KL[Pdata||Pest]
Task: given only pairwise probabilities,compute the classifier P(Y|X).
Solution: train logistic regression (LR) classifieron herding sequence.
Error : fraction of misclassified test cases.
LR is too simple, PL on herding sequence also gives 0.04.In higher dimensions herding looses advantage in accuracy
Conclusions
• Herding replaces point estimates with trajectories over attractor sets (which is not the Bayesian posterior) in a tractable manner.
• Model for “neural computation”– similar to dynamical synapses– Quasi-random sampling of state space (chaotic?)– Local updates– Efficient (no random numbers, exponentiation)