Top Banner
Perturb-and-MAP [1] has been shown to be effective for pairwise MRFs, yet its application to other kinds of graphical models has been limited. We demonstrate that Perturb-and-MAP is effective at learning features using graphical models with complex dependencies between variables. We also propose a method of designing perturbations so that the distribution induced by Perturb-and-MAP better approximates the Gibbs distribution. Abstract Efficient Feature Learning Using Perturb-and-MAP Ke Li, Kevin Swersky and Richard Zemel The cardinality restricted Boltzmann machine (CaRBM) enforces a sparsity constraint over hidden units: P , = 1 exp( T + T + T ) ψ ( ) where ψ =1 if x ≤ k and 0 otherwise. Training requires sampling from P | , which is non-trivial as the hidden units are not conditionally independent from each other. Swersky et al. [2] proposed a method to compute P | using message passing in O( ) time. Using Perturb-and-MAP, if the input to each hidden unit is perturbed with Logistic(0,1) noise and MAP is performed using a selection algorithm, an approximate sample can be drawn in O( ) time. We found the features learned by Perturb-and-MAP has a greater discriminative capability. Cardinality RBM Many tasks involve predicting the correct matching in a bipartite graph, like image stitching, stereo reconstruction and video tracking. Our aim is to learn a descriptor for image patches that is tailored to matching key points across images. Our bipartite matching model is characterized by: P ; θ = 1 exp 1 2 , ϕ ;θ −ϕ 2 2 · ψ( ) ψ( ) where ψ(x) = 1 if x = 1 and 0 otherwise, =1 if i th and j th key points match and 0 otherwise. Training requires estimating an expectation over using a sample from P ; θ . As computing the partition function of P ; θ is #P-hard, sampling from P ; θ is challenging. If the model is perturbed with noise from the right distribution, approximate samples can be drawn in O( 3 ) time using the Hungarian algorithm. Bipartite Matching If the negative energy of each joint configuration is perturbed with i.i.d. Gumbel(0,1) noise, exact samples can be drawn from the Gibbs distribution using Perturb-and-MAP. In practice, reduced-order perturbation must be used to ensure tractability. As a result, negative perturbed energies across joint configurations are no longer independent or Gumbel-distributed. We propose a way of designing perturbations so that the latter property is preserved. The negative perturbed energy of each joint configuration is distributed according to the sum of individual perturbations. We find a distribution D(1) using numerical deconvolution that satisfies the following property: If ~ 0,1 ⊥ ~(1), + ~(0,2). Define D(s) as a scaled version of D(1). Then if ~ 0, 2 −1 ~ 2 ∀ ∈ *1, … , − 1+, + ~(0,1). Thus, by perturbing the model with noise from the above distributions, the negative energy of each joint configuration is guaranteed to follow a Gumbel(0,1) distribution. Designing Perturbations Figure 1: The pdfs of Gumbel(0,1) and D(1) Figure 2b: Comparison of prediction errors Figure 4: Comparison of test error rates Figure 2a: Comparison of reconstruction errors Ongoing Research We are exploring ways of combining D-perturbations to obtain perturbations with equal entropy while ensuring the negative perturbed energies are approximately Gumbel-distributed. We are also investigating how closely the empirical marginals over configurations produced using different perturbation methods approximate the underlying Gibbs distribution. Figure 3: Two frames and ground truth matching from dataset Perturb-and-MAP is an approximate sampling method that leverages existing optimization algorithms for performing MAP inference. It works by perturbing potentials with random noise, and then performing MAP inference on the model with perturbed potentials. It relies on the following fact: If 1 ,…, ~ iid Gumbel(0,1), then + = max + = exp( ) exp( ) If the energy of each joint configuration is perturbed, Perturb-and-MAP yields an exact sample. In a pairwise MRF, perturbing unary and pairwise potentials has been shown empirically to produce similar results as perturbing each joint configuration. Perturb-and-MAP References [1] George Papandreou and Alan L. Yuille (2011). Perturb-and-MAP random fields: Using discrete optimization to learn and sample from energy models, ICCV. [2] Kevin Swersky, Danny Tarlow, Ilya Sutskever, Ruslan Salakhutdinov, Rich Zemel, and Ryan Adams (2012). Cardinality restricted boltzmann machines, NIPS 25. {keli,kswersky,zemel}@cs.toronto.edu
1

Efficient Feature Learning Using Perturb-and-MAPkeli/publications/pandm_nips13_poster.pdf · {keli,kswersky,zemel}@cs.toronto.edu . Title: Efficient Feature Learning Using Perturb-and-MAP

Jul 28, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Efficient Feature Learning Using Perturb-and-MAPkeli/publications/pandm_nips13_poster.pdf · {keli,kswersky,zemel}@cs.toronto.edu . Title: Efficient Feature Learning Using Perturb-and-MAP

• Perturb-and-MAP [1] has been shown to be effective for pairwise MRFs, yet its application to other kinds of graphical models has been limited.

• We demonstrate that Perturb-and-MAP is effective at learning features using graphical models with complex dependencies between variables.

• We also propose a method of designing perturbations so that the distribution induced by Perturb-and-MAP better approximates the Gibbs distribution.

Abstract

Efficient Feature Learning Using Perturb-and-MAP Ke Li, Kevin Swersky and Richard Zemel

• The cardinality restricted Boltzmann machine (CaRBM) enforces a sparsity constraint over hidden units:

P 𝐯, 𝐡 =1

𝑍exp(𝐡T𝑊𝐯+ 𝐛T𝐡 + 𝐜T𝐯) ψ𝑘( ℎ𝑖𝑖 )

where ψ𝑘 𝑥 = 1 if x ≤ k and 0 otherwise.

• Training requires sampling from P 𝐡 | 𝐯 , which is non-trivial as the hidden units are not conditionally independent from each other.

• Swersky et al. [2] proposed a method to compute P 𝐡 | 𝐯 using message passing in O(𝑘𝐹) time.

• Using Perturb-and-MAP, if the input to each hidden unit is perturbed with Logistic(0,1) noise and MAP is performed using a selection algorithm, an approximate sample can be drawn in O(𝐹) time.

• We found the features learned by Perturb-and-MAP has a greater discriminative capability.

Cardinality RBM

• Many tasks involve predicting the correct matching in a bipartite graph, like image stitching, stereo reconstruction and video tracking.

• Our aim is to learn a descriptor for image patches that is tailored to matching key points across images.

• Our bipartite matching model is characterized by:

P 𝑀; θ =1

𝑍exp −

1

2𝑁 𝑚𝑖𝑗𝑖,𝑗 ϕ 𝑥𝑖; θ − ϕ 𝑥′𝑗; θ 2

ψ( 𝑚𝑖𝑗)𝑖𝑗 ψ( 𝑚𝑖𝑗)𝑗𝑖

where ψ(x) = 1 if x = 1 and 0 otherwise, 𝑚𝑖𝑗 = 1 if ith

and jth key points match and 0 otherwise. • Training requires estimating an expectation over 𝑀

using a sample from P 𝑀; θ .

• As computing the partition function of P 𝑀; θ is #P-hard, sampling from P 𝑀; θ is challenging.

• If the model is perturbed with noise from the right distribution, approximate samples can be drawn in O(𝑁3) time using the Hungarian algorithm.

Bipartite Matching

• If the negative energy of each joint configuration is perturbed with i.i.d. Gumbel(0,1) noise, exact samples can be drawn from the Gibbs distribution using Perturb-and-MAP.

• In practice, reduced-order perturbation must be used to ensure tractability. As a result, negative perturbed energies across joint configurations are no longer independent or Gumbel-distributed. We propose a way of designing perturbations so that the latter property is preserved.

• The negative perturbed energy of each joint

configuration is distributed according to the sum of individual perturbations.

• We find a distribution D(1) using numerical deconvolution that satisfies the following property: If 𝑋~𝐺𝑢𝑚𝑏𝑒𝑙 0,1 ⊥ 𝑌~𝐷(1), 𝑋 + 𝑌~𝐺𝑢𝑚𝑏𝑒𝑙(0,2).

• Define D(s) as a scaled version of D(1). Then if

𝑋~𝐺𝑢𝑚𝑏𝑒𝑙 0, 2− 𝑁−1 ⊥ 𝑌𝑖~𝐷 2−𝑖 ∀𝑖 ∈ *1, … , 𝑁 −

1+, 𝑋 + 𝑌𝑖 ~𝐺𝑢𝑚𝑏𝑒𝑙(0,1). Thus, by perturbing the model with noise from the above distributions, the negative energy of each joint configuration is guaranteed to follow a Gumbel(0,1) distribution.

Designing Perturbations

Figure 1: The pdfs of Gumbel(0,1) and D(1)

Figure 2b: Comparison of prediction errors

Figure 4: Comparison of test error rates

Figure 2a: Comparison of reconstruction errors

Ongoing Research

• We are exploring ways of combining D-perturbations to obtain perturbations with equal entropy while ensuring the negative perturbed energies are approximately Gumbel-distributed.

• We are also investigating how closely the empirical marginals over configurations produced using different perturbation methods approximate the underlying Gibbs distribution.

Figure 3: Two frames and ground truth matching from dataset

• Perturb-and-MAP is an approximate sampling method that leverages existing optimization algorithms for performing MAP inference.

• It works by perturbing potentials with random noise,

and then performing MAP inference on the model with perturbed potentials.

• It relies on the following fact: If 𝜖1,…,𝜖𝑛 ~ iid Gumbel(0,1), then

𝑃 𝑎𝑘 + 𝜖𝑘 = max𝑖 𝑎𝑖 + 𝜖𝑖 =exp(𝑎𝑘)

exp(𝑎𝑖)𝑖

• If the energy of each joint configuration is perturbed,

Perturb-and-MAP yields an exact sample.

• In a pairwise MRF, perturbing unary and pairwise potentials has been shown empirically to produce similar results as perturbing each joint configuration.

Perturb-and-MAP

References [1] George Papandreou and Alan L. Yuille (2011). Perturb-and-MAP random fields: Using discrete optimization to learn and sample from energy models, ICCV. [2] Kevin Swersky, Danny Tarlow, Ilya Sutskever, Ruslan Salakhutdinov, Rich Zemel, and Ryan Adams (2012). Cardinality restricted boltzmann machines, NIPS 25.

{keli,kswersky,zemel}@cs.toronto.edu