Alexander J. Smola: Exponential Families in Feature Space, Page 1 Exponential Families in Feature Space Alexander J. Smola [email protected]National ICT Australia Machine Learning Program RSISE, The Australian National University Thanks to Yasemin Altun, Stephane Canu, Thomas Hofmann, and Vishy Vishwanathan
36
Embed
Exponential Families in Feature Space · Alexander J. Smola: Exponential Families in Feature Space, Page 1 Exponential Families in Feature Space Alexander J. Smola [email protected]
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Alexander J. Smola: Exponential Families in Feature Space, Page 1
Solving the ProblemExpand θ in a linear combination of φ(xi, yi).Solve convex problem in expansion coefficients.
General Strategy
Alexander J. Smola: Exponential Families in Feature Space, Page 12
Choose a suitable sufficient statistic φ(x, y)
Conditionally multinomial distribution leads to Gaus-sian Process multiclass estimator: we have a distribu-tion over n classes which depends on x.Conditionally Gaussian leads to Gaussian Process re-gression: we have a normal distribution over a randomvariable which depends on the location.Note: we estimate mean and variance.Conditionally Poisson distributions yields spatial Pois-son model.
Solve the optimization problemThis is typically convex.
The bottom lineInstead of choosing k(x, x′) choose k((x, y), (x′, y′)).
Example: GP Classification
Alexander J. Smola: Exponential Families in Feature Space, Page 13
Sufficient StatisticWe pick φ(x, y) = φ(x)⊗ ey, that is
k((x, y), (x′, y′)) = k(x, x′)δyy′ where y, y′ ∈ {1, . . . , n}Kernel Expansion
By the representer theorem we get that
θ =
m∑i=1
∑y
αiyφ(xi, y)
Optimization ProblemBig mess . . . but convex.
A Toy Example
Alexander J. Smola: Exponential Families in Feature Space, Page 14
Noisy Data
Alexander J. Smola: Exponential Families in Feature Space, Page 15
Example: GP Regression
Alexander J. Smola: Exponential Families in Feature Space, Page 16
Sufficient Statistic (Standard Model)We pick φ(x, y) = (yφ(x), y2), that is
k((x, y), (x′, y′)) = k(x, x′)yy′ + y2y′2 where y, y′ ∈ R
Traditionally the variance is fixed, that is θ2 = const..Sufficient Statistic (Fancy Model)
We pick φ(x, y) = (yφ1(x), y2φ2(x)), that is
k((x, y), (x′, y′)) = k1(x, x′)yy′+k2(x, x
′)y2y′2 where y, y′ ∈ R
We estimate mean and variance simultaneously .Kernel Expansion
By the representer theorem (and more algebra) we get
θ =
(m∑i=1
αi1φ1(xi),
m∑i=1
αi2φ2(xi)
)
Training Data
Alexander J. Smola: Exponential Families in Feature Space, Page 17
Mean ~k>(x)(K + σ21)−1y
Alexander J. Smola: Exponential Families in Feature Space, Page 18
Variance k(x, x) + σ2 − ~k>(x)(K + σ21)−1~k(x)
Alexander J. Smola: Exponential Families in Feature Space, Page 19
Putting everything together . . .
Alexander J. Smola: Exponential Families in Feature Space, Page 20
Another Example
Alexander J. Smola: Exponential Families in Feature Space, Page 21
Adaptive Variance Method
Alexander J. Smola: Exponential Families in Feature Space, Page 22
Optimization Problem:
minimizem∑
i=1
−14
m∑j=1
α1jk1(xi, xj)
> m∑j=1
α2jk2(xi, xj)
−1 m∑j=1
α1jk1(xi, xj)
−1
2log det−2
m∑j=1
α2jk2(xi, xj)
− m∑j=1
[y>i α1jk1(xi, xj) + (y>j α2jyj)k2(xi, xj)
]+
12σ2
∑i,j
α>1iα1jk1(xi, xj) + tr[α2iα
>2j
]k2(xi, xj).
subject to 0 �m∑
i=1
α2ik(xi, xj)
Properties of the problem:The problem is convexThe log-determinant from the normalization of theGaussian acts as a barrrier function .We get a semidefinite program.
Heteroscedastic Regression
Alexander J. Smola: Exponential Families in Feature Space, Page 23
Natural Parameters
Alexander J. Smola: Exponential Families in Feature Space, Page 24
Structured Observations
Alexander J. Smola: Exponential Families in Feature Space, Page 25
Joint density and graphical models
Hammersley-Clifford Theorem
p(x) =1
Zexp
(∑c∈C
ψc(xc)
)Decomposition of any p(x) into product of potential func-tions on maximal cliques.
Application to Exponential Families
Alexander J. Smola: Exponential Families in Feature Space, Page 26
Hammersley-Clifford CorollaryCombining the CH-Theorem and exponential families
p(x) =1
Zexp
(∑c∈C
ψc(xc)
)p(x) = exp (〈φ(x), θ〉 − g(θ))
we obtain a decomposition of φ(x) into
p(x) = exp
(∑c∈C
〈φc(xc), θc〉 − g(θ)
)Consequence for Kernels
k(x, x′) =∑c∈C
kc(xc, x′c)
Conditional Random Fields
Alexander J. Smola: Exponential Families in Feature Space, Page 27
Dependence structure between variables
Time t− 2 t− 1 t t + 1 t + 2
X ?>=<89:;x ?>=<89:;x ?>=<89:;x ?>=<89:;x ?>=<89:;x
Y ?>=<89:;y ?>=<89:;y ?>=<89:;y ?>=<89:;y ?>=<89:;y
Key PointsWe can drop cliques in x: they do not affect p(y|x, θ).Compute g(θ|x) via dynamic programming.Assume stationarity of the model, that is θc does notdepend on the position of the clique.We only need a sufficient statistic φxy(xt, yt) andφyy(yt, yt+1).
Computational Issues
Alexander J. Smola: Exponential Families in Feature Space, Page 28
ConvergenceThe perceptron algorithm converges in ‖θ‖2
max‖φ(xi,y)‖2up-
dates.
Extension: Partial Labels
Alexander J. Smola: Exponential Families in Feature Space, Page 33
Semi-supervised learningWe have both the training set X,Y and the test patternsX ′ available at estimation time. Can we take advantageof this additional information (aka “transduction”)?
Partially labeled dataSome observations may have uncertain labels, i.e., yi ∈Yi ⊆ Y (such as yi ∈ {apple,oranges} but yi 6= pear).Can we use the observations and also infer labels?
ClusteringHere we have no label information at all. The goal is tofind a plausible assignment of yi such that similar obser-vations tend to share the same label.
Key IdeaWe maximize the likelihood p(y|t,X) over t and y.
Extension: Distributed Inference
Alexander J. Smola: Exponential Families in Feature Space, Page 34
Interacting AgentsWe have a set of agents which only interact with theirneighbors.
Junction TreeCan use distributed algorithm to find junction tree basedon local neighborhood structure. This assumes “nice”structure in the neighborhood graph.
Local Message PassingUse the Generalized Distributive Law, if junction tree isthin enough. Messages are expectations of φc(xc).
AlternativeWhen no junction tree exists, just use loopy belief prop-agation. And hope . . .
Summary
Alexander J. Smola: Exponential Families in Feature Space, Page 35
Sufficient statistic leads to kernel via
k(x, x′) = 〈φ(x), φ(x′)〉
Maximum a posteriori is convex problem
Conditioning turns simple models into fancy nonpara-metric estimators, such as
Normal distribution =⇒ regressionMultinomial distribution =⇒ multiclass classificationStructured statistic =⇒ CRFPoisson distribution =⇒ spatial disease modelLatent category =⇒ clustering model
Shameless Plugs
Alexander J. Smola: Exponential Families in Feature Space, Page 36
We are hiring. For details [email protected] (http://www.nicta.com.au)