Efficient Feature Learning Using Perturb-and-MAPkeli/publications/pandm_nips13_poster.pdf · {keli,kswersky,zemel}@cs.toronto.edu . Title: Efficient Feature Learning Using Perturb-and-MAP

$Page 1: Efficient Feature Learning Using Perturb-and-MAPkeli/publications/pandm_nips13_poster.pdf · {keli,kswersky,zemel}@cs.toronto.edu . Title: Efficient Feature Learning Using Perturb-and-MAP$
• Perturb-and-MAP [1] has been shown to be effective for pairwise MRFs, yet its application to other kinds of graphical models has been limited.

• We demonstrate that Perturb-and-MAP is effective at learning features using graphical models with complex dependencies between variables.

• We also propose a method of designing perturbations so that the distribution induced by Perturb-and-MAP better approximates the Gibbs distribution.

Abstract

Efficient Feature Learning Using Perturb-and-MAP Ke Li, Kevin Swersky and Richard Zemel

• The cardinality restricted Boltzmann machine (CaRBM) enforces a sparsity constraint over hidden units:

P 𝐯, 𝐡 =1

𝑍exp(𝐡T𝑊𝐯+ 𝐛T𝐡 + 𝐜T𝐯) ψ𝑘( ℎ𝑖𝑖 )

where ψ𝑘 𝑥 = 1 if x ≤ k and 0 otherwise.

• Training requires sampling from P 𝐡 | 𝐯 , which is non-trivial as the hidden units are not conditionally independent from each other.

• Swersky et al. [2] proposed a method to compute P 𝐡 | 𝐯 using message passing in O(𝑘𝐹) time.

• Using Perturb-and-MAP, if the input to each hidden unit is perturbed with Logistic(0,1) noise and MAP is performed using a selection algorithm, an approximate sample can be drawn in O(𝐹) time.

• We found the features learned by Perturb-and-MAP has a greater discriminative capability.

Cardinality RBM

• Many tasks involve predicting the correct matching in a bipartite graph, like image stitching, stereo reconstruction and video tracking.

• Our aim is to learn a descriptor for image patches that is tailored to matching key points across images.

• Our bipartite matching model is characterized by:

P 𝑀; θ =1

𝑍exp −

1

2𝑁 𝑚𝑖𝑗𝑖,𝑗 ϕ 𝑥𝑖; θ − ϕ 𝑥′𝑗; θ 2

2·

ψ( 𝑚𝑖𝑗)𝑖𝑗 ψ( 𝑚𝑖𝑗)𝑗𝑖

where ψ(x) = 1 if x = 1 and 0 otherwise, 𝑚𝑖𝑗 = 1 if ith

and jth key points match and 0 otherwise. • Training requires estimating an expectation over 𝑀

using a sample from P 𝑀; θ .

• As computing the partition function of P 𝑀; θ is #P-hard, sampling from P 𝑀; θ is challenging.

• If the model is perturbed with noise from the right distribution, approximate samples can be drawn in O(𝑁3) time using the Hungarian algorithm.

Bipartite Matching

• If the negative energy of each joint configuration is perturbed with i.i.d. Gumbel(0,1) noise, exact samples can be drawn from the Gibbs distribution using Perturb-and-MAP.

• In practice, reduced-order perturbation must be used to ensure tractability. As a result, negative perturbed energies across joint configurations are no longer independent or Gumbel-distributed. We propose a way of designing perturbations so that the latter property is preserved.

• The negative perturbed energy of each joint

configuration is distributed according to the sum of individual perturbations.

• We find a distribution D(1) using numerical deconvolution that satisfies the following property: If 𝑋~𝐺𝑢𝑚𝑏𝑒𝑙 0,1 ⊥ 𝑌~𝐷(1), 𝑋 + 𝑌~𝐺𝑢𝑚𝑏𝑒𝑙(0,2).

• Define D(s) as a scaled version of D(1). Then if

𝑋~𝐺𝑢𝑚𝑏𝑒𝑙 0, 2− 𝑁−1 ⊥ 𝑌𝑖~𝐷 2−𝑖 ∀𝑖 ∈ *1, … , 𝑁 −

1+, 𝑋 + 𝑌𝑖 ~𝐺𝑢𝑚𝑏𝑒𝑙(0,1). Thus, by perturbing the model with noise from the above distributions, the negative energy of each joint configuration is guaranteed to follow a Gumbel(0,1) distribution.

Designing Perturbations

Figure 1: The pdfs of Gumbel(0,1) and D(1)

Figure 2b: Comparison of prediction errors

Figure 4: Comparison of test error rates

Figure 2a: Comparison of reconstruction errors

Ongoing Research

• We are exploring ways of combining D-perturbations to obtain perturbations with equal entropy while ensuring the negative perturbed energies are approximately Gumbel-distributed.

• We are also investigating how closely the empirical marginals over configurations produced using different perturbation methods approximate the underlying Gibbs distribution.

Figure 3: Two frames and ground truth matching from dataset

• Perturb-and-MAP is an approximate sampling method that leverages existing optimization algorithms for performing MAP inference.

• It works by perturbing potentials with random noise,

and then performing MAP inference on the model with perturbed potentials.

• It relies on the following fact: If 𝜖1,…,𝜖𝑛 ~ iid Gumbel(0,1), then

𝑃 𝑎𝑘 + 𝜖𝑘 = max𝑖 𝑎𝑖 + 𝜖𝑖 =exp(𝑎𝑘)

exp(𝑎𝑖)𝑖

• If the energy of each joint configuration is perturbed,

Perturb-and-MAP yields an exact sample.

• In a pairwise MRF, perturbing unary and pairwise potentials has been shown empirically to produce similar results as perturbing each joint configuration.

Perturb-and-MAP

References [1] George Papandreou and Alan L. Yuille (2011). Perturb-and-MAP random fields: Using discrete optimization to learn and sample from energy models, ICCV. [2] Kevin Swersky, Danny Tarlow, Ilya Sutskever, Ruslan Salakhutdinov, Rich Zemel, and Ryan Adams (2012). Cardinality restricted boltzmann machines, NIPS 25.

{keli,kswersky,zemel}@cs.toronto.edu

Efficient Feature Learning Using Perturb-and-MAPkeli/publications/pandm_nips13_poster.pdf · {keli,kswersky,zemel}@cs.toronto.edu . Title: Efficient Feature Learning Using Perturb-and-MAP

Documents

Efficient Feature Learning Using Perturb-and-MAPkeli/publications/pandm_nips13_poster.pdf · {keli,kswersky,zemel}@cs.toronto.edu . Title: Efficient Feature Learning Using Perturb-and-MAP