-
MSc Artificial Intelligence
Master Thesis
Few-shot Classification by LearningDisentangled
Representations
by
Emiel Hoogeboom
10831428
June, 2017
36 ECTSJanuary – June, 2017
Supervisor:Dr. E. Gavves
Daily Supervisor:Dr. E. Gavves
Assessor:Prof. Dr. M. Welling
Faculteit der Natuurkunde, Wiskunde en Informatica
-
Acknowledgements
I would like to thank Efstratios Gavves for his guidance and
help the past half year. He couldtruly inspire me to approach a
problem differently. He managed to spend a lot of time with
me,despite his busy schedule.
I would also like to thank Jorn Peters, with whom I have had
numerous discussions that led tosignificant insights. Jorn may be
one of the smartest guys I know, and I predict that he will oneday
run his own research lab.
My gratitude goes out to my committee, consisting of Max Welling
and Efstratios, who agreedto read my report on short notice.
Finally I would like to thank my parents, and everyone else, who
helped me with their supportand encouragement.
i
-
Abstract
Machine learning has improved state-of-the art performance in
numerous domains, by usinglarge amounts of data. In reality,
labelled data is often not available for the task of interest.A
fundamental problem of artificial intelligence is finding a
representation that can generalizeto never seen before classes. In
this research, the power of generative models is combined
withdisentangled representations. The combination is leveraged to
learn a representation for content,which generalizes to unseen
classes. Potentially, disentangled representations can
drasticallyreduce the number of required training examples, and
improve understanding of different factorsof variation.
This is achieved, by starting with a known procedure to
disentangle representations. By exploringthe structure of the
content representation, a loss function is composed such that the
model learnsa few-shot class probability. A mathematical framework
is defined, that includes a few-shot classprobability. This
probability ensures that a disentangled representation is learned.
A lowerboundof the log-likelihood, is derived to obtain an
objective function that optimizes the log-likelihoodconditioned on
the support set. The presented method has achieved state-of-the-art
performanceon the Omniglot dataset at the time of writing.
ii
-
Contents
Acknowledgements i
Abstract ii
Contents iii
Introduction 1
1 Related Work 41.1 Few-shot learning . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . 41.2 Generative
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 41.3 Disentangling Representation . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . 51.4 Contribution . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Preliminaries 62.1 Variational Autoencoders . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . 62.2 Generative
Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . .
. . . . 72.3 Kullback Leibler Divergence for Multivariate Normals .
. . . . . . . . . . . . . . 72.4 Squared Euclidean distance between
Multivariate Normals . . . . . . . . . . . . . 82.5 Disentangling
Factors of Variation . . . . . . . . . . . . . . . . . . . . . . .
. . . 9
3 Structure of Disentanglement 113.1 Understanding
Disentanglement . . . . . . . . . . . . . . . . . . . . . . . . . .
. . 113.2 Distance Penalty for Content . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 123.3 Disentangling with Distance
Loss Exclusively . . . . . . . . . . . . . . . . . . . . 13
4 Model: Generative Few-shot Learning 154.1 Generative Model . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
154.2 Class probability . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 16
4.2.1 Embedding Distance in Literature . . . . . . . . . . . . .
. . . . . . . . . 164.2.2 Model Class Probability . . . . . . . . .
. . . . . . . . . . . . . . . . . . . 17
4.3 Support Set . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 184.3.1 The Support-Conditional
Log-likelihood . . . . . . . . . . . . . . . . . . . 184.3.2
Resolving the Posterior . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 19
4.4 Collecting All Components . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . 204.5 Inference . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.6
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 21
5 Datasets 225.1 Episodes . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . 225.2 MNIST . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 225.3 Omniglot . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 235.4 miniImageNet . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.5
Quick, Draw . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 24
6 Experiments 256.1 Setup . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . 25
iii
-
6.1.1 Moving Average Batch Normalization . . . . . . . . . . . .
. . . . . . . . 256.1.2 Architecture . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . 256.1.3 Configuration . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 276.2.1 Omniglot . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . 276.2.2
miniImageNet . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 316.2.3 Quick, Draw . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . 32
6.3 Expectation of Support Set . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . 336.4 Discussion . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.4.1 Architecture . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 346.4.2 Performance . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . 346.4.3 Model Framework .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Conclusion 35
A Derivation Lowerbound for Batches 37A.1 Model definition . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37A.2 Log-likelihood Conditioned on Support Content (SS) . . . . .
. . . . . . . . . . . 37A.3 Lowerbound Conditioned on Support
Examples (XS) . . . . . . . . . . . . . . . . 38A.4 Intermezzo:
Factorizing the Support Set KL Term . . . . . . . . . . . . . . . .
. 39A.5 Collecting Terms . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 39A.6 Objective Function . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
iv
-
Introduction
“Much learning does not teach understanding.”
Heraclitus of Ephesus
A deep learning model is a complex function approximator, based
on a simple principle thatis applied repeatedly. Its complexity
makes it incredibly malleable, which allows it to performtasks such
as object classification and detection at high performance levels.
This performance ispossible, given vast amounts of data from the
test domain. However, when inputs appear outsidethis domain, the
words of Heraclitus make deep learning models look foolish.
The human brain is remarkable at object recognition, especially
because of object constancy.With different types of illumination,
pose or other changes in viewpoint, an object is often
easilyrecognized. Different from machine learning, is that humans
can easily generalize from very fewexamples. This phenomenon is
called object constancy. A picture from an apple covered bysnow, is
still an apple. Most people have no problem with this decision,
even if this is the firsttime that an apple is observed in this
exact condition.
“What I cannot create, I do not understand.”
Richard Feynman
A promising direction for these problems, is generative
modelling. Generative modelling is ele-gantly motivated by the
words of Feynman. Deep learning models may simply be cheating
byrecognizing the sky, when they need to recognize birds. By
learning to generate examples, amodel is forced to represent the
whole image, including the bird. From a practical
viewpoint,modelling a generative process has the advantage of being
an unsupervised learning problem, andmany unlabelled examples are
available. However, for a representation to be object constant,
itneeds to be disentangled for variations in the object and other
factors. Disentangled representa-tions are appealing, because a
representation suitable for distinguishing cars from trucks,
shouldbe disentangled from color. This concept is illustrated in
Figure 1.
Figure 1: The concept of disentanglement: multiple attributes
characterize an object.
1
-
Status
In recent years, machine learning has improved state-of-the art
performance in numerous do-mains. Notably, deep learning has shown
superhuman performance on multiple classificationtasks, with
extensive amounts of data [1, 2]. In reality, labelled examples may
be scarce, whichmakes it difficult to learn a deep network
directly. Moreover, sometimes large quantities of la-belled data
are available, but not for all classes of interest. The field that
tries to classify imageswith either one or a handful exemplars, is
called one or few-shot learning.
In few-shot learning scenarios, a system is presented with only
a few examples per class withknown labels. The collection of these
examples is called the support set. Another example isthen
presented to the system, which has to be classified by comparing it
with the support set.Early attempts in the field of few-shot
learning, only inferred directly from the support set. Inmore
recent studies, the field has shifted towards similarity metric
learning. First a metric islearned from a subset of classes, and
then few-shot classification is tested on another subset
withdifferent classes. The underlying assumption is that large
quantities of labelled data are available,but these are unavailable
for some classes. For the classes of interest, only a few examples
arelabelled. The goal of few-shot learning is then to learn an
embedding, that generalizes to unseenclasses.
Deep generative models have been shown to improve classification
performance, in semi-supervisedlearning settings. The aim is to
have a model for the data generation process, because capturingthis
process means that the data was understood by the model to some
degree. For example,a discriminative model might classify a ship
based on the surrounding water, but a generativemodel will learn an
actual representation for a ship. The intuition is that learning
the actualrepresentation, allows generalization and yields better
performance. However, the application ofgenerative models to
few-shot learning has thus far been limited.
Disentanglement
We define disentanglement, as a separation in the representation
of the attributes of an object.Let us illustrate this concept with
an example. Imagine taking a photo of a car. In the camera thephoto
is represented with a large number of pixels. These pixels are
highly entangled, as changingthe color of the car would change a
large number of the pixels. A representation would be
moredisentangled, when it generates the same image, but changing a
subset of variables changes asubset of attributes, for example the
color of the car. Suppose that an object is completelydefined by a
set of attributes. A valid representation of an object should
represent the completeset. Furthermore, a disentangled
representation is defined as a separable representation, suchthat
each part represents an exclusive subset of attributes, and the
union of all subsets is thecomplete set. The choice of attributes
is arbitrary, and can be chosen to match meaningfulhuman
intuitions.
Direction
In this thesis, a mathematical framework to combine generative
models, learning disentangle-ment, and few-shot learning is
presented. The framework is designed to learn a
disentanglementbetween two subsets of object attributes, content
and style. The framework learns to represent
2
-
images in content and style variables, where the content
variable is used for few-shot classifi-cation and reconstruction,
while the style variable is only used for reconstruction. To
enforcethat content and style represent attributes of an example
exclusively, priors are placed on thesevariables. These priors
allow the framework to eliminate all redundant information in
represen-tations during optimization. In this formulation, the
content is defined as all helpful informationin classifying the
image. The style is defined as all other possible sources of
variation, that areneeded for reconstruction. The following
hypotheses shall be addressed:
• Learning disentangled representation can be combined with
few-shot classification.
• Few-shot classification accuracy is improved, by using
disentangled representations.
3
-
1 Related Work
The related work is organized into three different sections on
sub-domains of deep learning. Thedomains few-shot learning,
generative models and disentangling representations are
discussed.The last section outlines what differentiates this thesis
from existing literature.
1.1 Few-shot learning
Few-shot learning is a field where the number of examples is
very limited. A key insight byFei-Fei et al., is that knowledge of
previously learned classes can be used, and hence learningdoes not
start from scratch [3]. The union of few-shot learning with deep
learning, has shiftedthe field towards a metric learning approach,
and this metric (or embedding) has been learnedin various
manners.
Siamese networks [4] use the contrastive loss function to learn
an embedding on a data set.Another key insight was provided by
Vinyals et al., who showed that performance can
significantlyincrease when the train procedure is adapted to match
the test procedure closely. In their work,memory networks termed
Matching Networks [5] are used to augment the embedding. Ravi et
al.improved the meta-learning approach, by proposing a recurrent
meta-learner model the updatesfor the few-shot model [6].
Prototypical Networks [7] use basic components of matching
networks,showing that even higher performance can be attained with
a relatively simple architecture andprocedure, without the need for
recurrent networks.
Instead of presuming a fixed distance metric to measure distance
between examples, the metriccan be directly learned. Competitive
results on some datasets have been achieved by Residualnetworks
with skip connections [8]. Learning the distance metric, allows the
model to choosea suitable distance measure itself. This
demonstrates that a distance function with parameterscan be a more
suitable choice on some problem instances.
1.2 Generative Models
An emerging field within machine learning is deep generative
modelling. A common assumptionin generative modelling, is that some
lower-dimensional representation exists. One can proposesome
low-dimensional representation, and learn a transformation whose
output resembles samplesfrom the data distribution.
Variational Auto-Encoders (VAEs) [9] are derived from a
generative process, by introducing avariational distribution that
can be recognized as an encoder in traditional auto-encoders.
Theylearn to reconstruct by encoding images into a
lower-dimensional latent space, and decoding areconstruction from
that latent space.
Generative Adversarial Networks (GANs) [10] learn to model the
data by defining two competingnetworks. The discriminator needs to
classify which images are real and which are fake, thegenerator
tries to deceive the discriminator. The reconstruction loss of the
VAE is definedexplicitly, and is often modelled by a pixel wise
error. That means that a perfect reconstructioncan still have a
high error, with small perturbations (e.g. translation of one
pixel). The loss ofthe generator in a GAN is defined implicitly, as
the ability to mislead the discriminator. GANstend to be able to
reconstruct crisper and more realistic images.
4
-
1.3 Disentangling Representation
In [11] a combination of VAEs and GANs learn to disentangle
variation, separating class infor-mation from style into two latent
spaces. To disentangle content from style, the labels are chosento
represent content. The training procedure is then formulated such
that all other informationwill be encoded by the style variable. In
other work, an unsupervised disentangled representationis be
learned by maximizing the mutual information between a subset of
the latent variables andthe observation [12].
1.4 Contribution
This paper combines few-shot learning, generative models and
disentangled representations. Tothe best knowledge of the author,
disentangled representations have never before been used
forfew-shot classification.
5
-
2 Preliminaries
In this section, preliminary techniques are explained that will
be used in subsequent sections.The techniques discussed, relate to
general deep learning models, mathematical derivations
ofdistribution distances, and an application where a
disentanglement is learned.
2.1 Variational Autoencoders
Auto-encoders are artificial neural networks used for
unsupervised representation learning. Thedimension of the input is
equal to the dimension of the output, and the purpose of the
networkis to reconstruct the input. The representation that is
learned at the bottleneck of the network,is called the code or
latent space. An auto-encoder can be divided in two distinct
modules,the encoder and the decoder. The encoder is a function that
maps an input x into some latentrepresentation z, Enc : X → Z. The
decoder maps the latent representation to the input space,Dec : Z →
X . The objective of the auto-encoder is to minimize some distance
loss as defined inEquation 1.
Dec*,Enc* = argminDec,Enc
||x−Dec(Enc(x))||2 (1)
Variational Auto-Encoders (VAEs) [9] assume some generative
process from the latent space z tox (Depicted in Figure 2). Note
that the latent variable z is treated as a random variable.
x
z
θ
φ
N
Figure 2: Generative process in a graphical model. This model is
the basis for the VariationalAutoencoder.
By introducing a variational distribution qθ(z|x), a lower bound
for p(x) can be derived withJensen’s inequality (Equation 2). In
this equation, DKL represents the Kullback-Leibler di-vergence,
probability distributions are parametrized by θ, and the
variational distribution isparametrized by φ. The decoder is now
defined as the conditional distribution Dec := pθ(x|z).The encoder
is defined as the variational distribution Enc := qφ(z|x). A common
assumption isto let qφ(z|x) be a multivariate normal distribution
with diagonal variances. Thus, the encoder isdefined as Enc := N
(z|µφ(x), Iσφ(x)). Then the choice of prior is often p(z) = N (z|0,
I).
6
-
log pθ(x) = log
∫pθ(x, z)dz
= log
∫qφ(z|x)
pθ(x, z)
qφ(z|x)dz
≥∫qφ(z|x) log
pθ(x, z)
qφ(z|x)dz
= Ez∼qφ(z|x) [log pθ(x|z)] +DKL(qφ(z|x)||pθ(z))
(2)
2.2 Generative Adversarial Networks
Generative Adversarial Networks (GANs) [10] are a different type
of generative model. A GANconsists of two distinct modules. A
generator that maps some latent representation to an exampleGen : Z
→ X , and a discriminator that maps an example to a confidence that
signifies how realan example looks, Disc : X → [0, 1]. The two
networks a trained as adversaries, in a zero-sumgame setting. The
value function is depicted in Equation 3. The generator tries to
minimize thevalue function, while the discriminator tries to
maximize it, as depicted in Equation 4.
V (Gen,Disc) = Ex∼Pdata[
log Disc(x)]
+ Ez∼p(z)[
log 1−Disc(Gen(z))]
(3)
Gen*,Disc* = argminGen
argmaxDisc
V (Gen,Disc) (4)
Where an auto-encoder uses some defined distance metric to
compare the reconstruction to theoriginal, a GAN uses the certainty
prediction of a discriminator. Loss functions for reconstruc-tion
of high dimensional data such as images, are difficult to define
such that sharp images aregenerated. Instead, a GAN architecture
only implicitly defines a loss function, via the
discrimi-nator.
Optimization of the value function leads to a problem for the
generator, since the strength of thegradient decreases when the
discriminator is certain. Practically, instead of optimizing
Equation3, the generator optimizes a value function that has
stronger gradients for a certain discriminator,defined in Equation
5. When a discriminator is more certain, i.e. Disc(Gen(z))→ 0, the
gradientwill be stronger, since ddx log f(x) =
1f(x)
df(x)dx .
VG(Gen,Disc) = −Ez∼p(z)[
log Disc(Gen(z))]
(5)
2.3 Kullback Leibler Divergence for Multivariate Normals
The Kullback-Leibler divergence is a distance measure between
probabilities. In a previoussection, we already saw that the
variational autoencoder optimizes a KL divergence betweenthe
variational distribution and the prior. By our choice of
parametrization, we will only befaced with multivariate normal
distributions. In Equation 6 the analytical solution between
twoarbitrary normal distributions, q = N (x|µ,Σ) and p = N (x|m,L)
is derived.
7
-
DKL(q||p) = EN (x|µ,Σ) [logN (x|µ,Σ)− logN (x|m,L)]
=1
2log|L||Σ|
+ EN (x|µ,Σ)[−1
2(x− µ)TΣ−1(x− µ) + 1
2(x−m)TL−1(x−m)
]=
1
2log|L||Σ|
+1
2EN (x|µ,Σ)
[− Tr((xxT − µµT − 2xµT )Σ−1) + Tr((xxT + mmT − 2xmT )L−1)
]=
1
2log|L||Σ|
+1
2
[−Tr I + (µµT + Σ + mmT − 2µmT )L−1
]=
1
2
[log|L||Σ|−D + Tr(ΣL−1) + (m− µ)TL−1(m− µ)
](6)
If we parametrize multivariate normal distributions such that
the covariance matrix only hasdiagonal entries, the solution can be
further simplified. The normal distributions are redefinedto q = N
(x|µ, I · σ) and p = N (x|m, I · l). The corresponding KL
divergence between q and pis shown in Equation 7.
DKL(q||p) =1
2
[D∑i
(2 log
liσi
+σi
2
li2 +
(mi − µi)2
li2
)−D
](7)
The equation can be further simplified if p is a prior with zero
mean and variance one. Thedistribution p is redefined such that p =
N (x|0, I). The KL divergence between q and p ispresented in
Equation 8.
DKL(q||p) =1
2
[D∑i
(−2 log σi + σi2 + µi2
)−D
](8)
2.4 Squared Euclidean distance between Multivariate Normals
A straightforward measure of distance is the squared euclidean
distance. We define two arbitrarymultivariate normal distributions
p = N (x|µ,Σ) and q = N (y|m,L) where x and y have anequal number
of dimensions. An analytical solution for the expectation of the
squared euclideandistance between two multivariate normal
distributions is presented in Equation 9.
Ex∼p,y∼q(||x− y||2) = Ex∼p,y∼q(xTx + yTy − 2xTy)= Ex∼p,y∼qTr(xxT
+ yyT − 2xyT )= Tr(µµT + Σ + mmT + L− 2µmT )= Tr(Σ + L) + (µ−m)T
(µ−m)
(9)
8
-
2.5 Disentangling Factors of Variation
In 2016, Mathieu et al. learned a disentangled representation by
combining a variational auto-encoder with a generative adversarial
network [11]. They specify two latent variables, the con-tent s and
the style z. The function of the content is to contain all class
information, andthe style should contain any other information,
such as how slanted a letter is written. To-gether, s and z provide
sufficient information to reconstruct the original example x. An
encoder(s, (µz, logσz)) = Enc(x) and a decoder x = Dec(s, z) are
defined. Furthermore a discriminator[0, 1] = Disc(x, id) is trained
to distinguish real and fake examples. The variable id denotesthe
label of the example x. In Equations 10 and 11 the loss for the VAE
and the GAN arespecified. The complete loss can be formulated as in
Equation 12, where λ is a scaling factor.Note that the authors
chose to include a KL regularization on z, but s is not treated as
a randomvariable.
L(V AE) = −Ez∼q(z|x,s) log p(x|z, s) +DKL(q(z|x, s)||p(z))
(10)
L(GAN) = log Disc(x, id) + log (1−Disc(Gen(z, s), id)) (11)
L = L(V AE) + λL(GAN) (12)
The authors propose a training procedure with multiple steps
that swaps the latent variables.This training procedure, in
combination with the model, ensures that a disentangled
represen-tation is learned. A summary of the procedure is described
below, please refer to [11] for exactdetails.
• Two samples from the same class, x1 and x1′ are drawn. The VAE
is trained to maximizep(x1|Dec(s1, z1)) and p(x1|Dec(s1′ , z1).
Note that both produce the same reconstruction.This ensures that
only content information may flow through s.
• To avoid that the network ignores s, a sample from a different
class, x2 is drawn. The VAEis trained to minimize the generator GAN
loss log Disc(Gen(z2, s1), id(x1)). This ensuresthat the content
information must flow through s, and may not flow through z.
• Again sampling x1, x1′ and x2 in a similar fashion. The
discriminator is trained to minimizelog Disc(Gen(z1, s1),
id(x1))+log (1−Disc(Gen(z2, s1), id(x1))). Thus the discriminator
istrained to detect whether a reconstruction used a style z from an
example with anotherclass.
Since it is difficult to express disentanglement in numbers, we
follow the procedure of the originalauthors to display
interpolations in latent representations. Some of the
reconstruction resultsare depicted in Figure 3. In the left image,
a slanted seven is interpolated to an upright nine.Moving downwards
from the top left, the seven gradually appears more upright. Going
upwardsfrom the bottom right, the nine becomes increasingly
slanted. In the right image, a three isinterpolated with a
seven.
9
-
Figure 3: Interpolation between content and style. Left and
right follow the same procedurewith different examples. The top
left image is a reconstruction of an image in the dataset.
Thebottom right image is also a reconstruction of an image in the
dataset. Horizontally the contents is linearly interpolated.
Vertically the style z is interpolated.
10
-
3 Structure of Disentanglement
Disentangled representations are representations where specific
variables of the representationcan be modified to change specific
components. The method of Mathieu et al. [11] learns
adisentanglement of content (class) and style (all other
variations), but does not put any con-straint on the structure of
the disentangled representation. In this section, the structure of
therepresentations is investigated, and modified with additional
constraints. Ultimately, the goal isto perform inference for
few-shot learning on the disentangled content representation.
3.1 Understanding Disentanglement
The work of [11] is taken as a starting point, a combination of
a VAE and a GAN with the specifiedtraining procedure. With this
model, a disentangled representation is learned on MNIST, and
allvisualizations are obtained with datapoints in the test set. The
structure of the high dimensionalcontent and style representations
are visualized with stochastic neighbourhood embedding. Notethat z
is a distribution, and therefore only µz is visualized. In Figure 4
these embeddings aredepicted. Notice that the content s is
clustered stronger, and style z clustering is less apparent.This is
expected since content variables from the same class should contain
the same information,making clusters very distinct. In contrast,
style is often more continuous (how slanted or bold adigit is), and
the same style can be shared between different classes.
7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.010.0
7.5
5.0
2.5
0.0
2.5
5.0
7.5
10.0
7.5 5.0 2.5 0.0 2.5 5.0 7.58
6
4
2
0
2
4
6
8
Figure 4: Visualization of the high dimensional latent variables
of the model in [11]. All pointsrepresent test data. Left: t-SNE
plot of content s. Right: t-SNE plot of style µz (z is
adistribution). Different colors represent different classes.
The structure of the content s is not suited for few-shot
classification, because multiple clustersexist for the same class.
An example that maps to the necessary cluster, might be absent.
In[11] this was not necessarily a problem, since two different
points can be mapped to the sameclass by the decoder. However, for
few-shot classification, ideally the content embedding wouldhave
one cluster for each class.
11
-
3.2 Distance Penalty for Content
In the previous section visualizations showed that the clusters
for content s were scattered. Itcan advantageous to group examples
more tightly when classification is based on the proximityof s.
Therefore, in addition to the VAE and GAN loss, a simple loss
that is based on distance betweencontent variables of the same
class (Equation 13) is used. In this equation, the subscript
notationcorrespond to the previously described training procedure,
s1 and s1′ are the same class. Inessence, optimizing the distance
penalty will attract style s of examples with the same class.The
objective that is optimized is presented in Equation 14.
L(penalty) = ||s1 − s1′ ||2 (13)
L = L(V AE) + λL(GAN) + L(penalty) (14)
The content s and style µz are visualized in Figure 5. Clearly,
classes are clustered more com-pactly in the embedding.
Furthermore, no class has multiple clusters. As a proof of
conceptfor few-shot learning, a single content s of each class is
chosen as the support set. The testset is classified using a
nearest neighbour approach on the support set. Classification based
ona single example in the content domain, has about 99% accuracy.
For comparison, the modelwithout a penalty evaluated with the same
procedure has only about 90% accuracy. Althoughthe model is not
classifying examples of an unseen class, this illustrates two
important points:Firstly, a disentangled representation of content
can be used for few-shot classification. And sec-ondly, an
additional restriction (such as a distance penalty) is effective to
learn a useful few-shotembedding.
7.5 5.0 2.5 0.0 2.5 5.0 7.5
7.5
5.0
2.5
0.0
2.5
5.0
7.5
8 6 4 2 0 2 4 6 8
8
6
4
2
0
2
4
6
Figure 5: Visualization of the high dimensional latent variables
of the model, that also optimisesa penalty on distance between
same-class content variables. All points represent test data.
Left:t-SNE plot of content s. Right: t-SNE plot of style µz (z is a
distribution). Different colorsrepresent different classes.
12
-
3.3 Disentangling with Distance Loss Exclusively
Inspired by the results in the previous section, a new distance
loss on s is proposed. We formulatea classification probability for
the correct class, based on euclidean distance. The probability
isnormalized similar to a softmax function (Equation 15). With
examples from the same class, thecontent s should lie close
together. For other classes, they should lie far apart. The first
termof the loss contracts content variables of the same class, and
the second term of the expands thedistance between content
variables of different classes.
Different from previous work, we also choose s to be a random
variable, and let the encoderoutput ((µs, logσs), (µz, logσz)) =
Enc(x). The objective function in previous models did nottreat s as
a random variable, and therefore it was not regularized. Because
the new objective doesconstrain s, the variable is now modeled as a
distribution. Experiments showed that without thismodification, s
encodes all information and z is ignored. The VAE loss is depicted
in Equation 16,which now includes s as a random variable. Note that
both latent variables are now regularizedwith their priors. In
Equation 17 the objective to optimize is shown.
L(distance) = − log
[exp(−||s1 − s1′ ||2)∑Ci=1 exp(−||s1 − si′ ||2)
]
= ||s1 − s1′ ||2︸ ︷︷ ︸Contraction term
+ log
C∑i=1
exp(−||s1 − si′ ||2)︸ ︷︷ ︸Expansion term
(15)
L(V AE) = −Ez∼q(z|x),s∼q(s|x) log p(x|z, s) +DKL(q(z|x)||p(z))
+DKL(q(s|x)||p(s)) (16)
L = L(V AE) + λL(distance) (17)
Without the adversarial procedure, learning a disentanglement is
less explicitly enforced. How-ever, the intuition is that the
distance loss will ensure that the classes will cluster in the
embed-ding s. To create a reconstruction, the decoder can obtain
information through s and z. Theencoder needs to send information
through the latent space, by changing the distribution fromthe
prior. Changing the distribution of the latent space, incurs a
penalty via the KL divergence.If class information is already
available in s, the model will avoid putting the same informationin
z, because doing so would incur another penalty.
The model is trained with the following procedure. Draw two
samples from the same class, x1and x1′ . Also draw samples from
other classes: x2′ , . . . , xC′ . All gradients for the decoder
areoriginating from L(V AE) for x1. The gradient signal for the
encoder comes from both L
(V AE)
and L(distance) with s1 and s1′ in the contraction term, and all
s1′ , . . . , sC′ in the expansionterm.
Interpolations of style and content are depicted in Figure 6, by
changing s and z linearly betweentwo examples. Notice that content
information is conveyed via s and style via z. Note how afour
written slanted become upright and shaky, in the style of the
eight. Thus, the modelis able to learn a disentanglement with a
euclidean distance loss, instead of the adversarialprocedure.
13
-
Figure 6: Interpolation between content and style,
reconstructions created with a VAE trainedwith distance loss. Left
and right follow the same procedure with different examples. The
topleft image is a reconstruction of an image in the dataset. The
bottom right image is also areconstruction of an image in the
dataset. Horizontally the content s is linearly
interpolated.Vertically the style z is interpolated.
In Figure 7, t-SNE visualizations of the content and style
variables of test examples are depicted.The content variables are
strongly clustered, and the style variables show less structure
basedon class. Notice that content grouping has become tight, and
that style grouping has becomeless noticeable. Thus, only by
restraining the distance of s for images of the same class,
adisentanglement can be learned. Furthermore, a continuous
representation for content is learnedthat is tightly clustered.
8 6 4 2 0 2 4 6 88
6
4
2
0
2
4
6
4 2 0 2 4
4
2
0
2
4
Figure 7: Visualization of the high dimensional latent variables
of the VAE with MNIST, trainedwith distance loss. All points
represent test data. Left: t-SNE plot of content µs. Right:
t-SNEplot of style µz. Different colors represent different
classes.
14
-
4 Model: Generative Few-shot Learning
The previous section described how a disentanglement can be
learned, and hinted at how a few-shot learning loss may actually
aid in learning a disentanglement. In this chapter, a
generativemodel for few-shot learning is formally defined, inspired
by disentangling representations.
Generative models in semi-supervised learning can have x
conditioned on some latent variable zand the class variable y.
However, in few-shot learning scenarios the number of classes is
large,and classes during test time have never been seen before.
Therefore, conditioning directly on yis impractical. Instead, the
example x is conditioned on content s and style z.
In this section a lowerbound of the conditional log-likelihood
will be derived, for a simplified usecase. The actual derivation
involves a few more terms, which make it notation heavy.
Therefore,the complete derivation is presented in appendix A.
4.1 Generative Model
A generative model for an example x and its class y in few-shot
learning is defined. Thereare latent variables for content s ∼ p(s)
and style z ∼ p(z). The observed class y = p(y|s)is conditionally
independent of x and the observed example is conditioned on both
content andstyle x = p(x|s, z). The corresponding graphical model
is depicted in Figure 8. Class informationinformation is often
encoded in discrete variables, but this formulation allows the
content variables to be continuous, which makes generalization for
few-shot learning possible.
y
s
x
z
N
Figure 8: Graphical model that shows how content and style
influence the example and its label.The example x is conditioned on
s and z, and the class y is only conditioned on s.
Analogous to the derivation of [9], a lower bound for log p(y,x)
can be obtained. As depicted inthe graphical model, the priors for
the content and style are independent, thus p(s, z) = p(s)p(z).In
contrast, the posterior p(s, z|x) cannot be factorized. However, we
impose that the variationaldistribution can be factorized to
simplify the model, such that q(s, z|x) = q(s|x)q(z|x).
(Equation18)
15
-
log p(y,x) =
∫∫q(s, z|x) log p(y,x)dsdz
= Es,z∼q(s,z|x) [log p(y,x)]
= Es,z∼q(s,z|x)[log p(y,x|s, z)− log q(s, z|x)
p(s)p(z)+ log
q(s, z|x)p(s, z|y,x)
]= Es,z∼q(s,z|x) [log p(y,x|s, z)]−DKL(q(s, z|x)||p(s)p(z))
+DKL(q(s, z|x)||p(s, z|y,x))≥ Es,z∼q(s,z|x) [log p(y,x|s,
z)]−DKL(q(s, z|x)||p(s)p(z))= Es∼q(s|x),z∼q(z|x) [log p(y|s)p(x|s,
z)]−DKL(q(s|x)||p(s))−DKL(q(z|x)||p(z))
(18)
The terms in this equation can be interpreted as an autoencoder
with a classification model. Theterm p(y|s) is a class probability,
p(x|s, z) is a reconstruction probability and q(s|x) and q(z|x)can
be interpreted as encoders. For now, the term DKL(q(s, z|x)||p(s,
z|y,x)) is neglected, as itis non-negative.
4.2 Class probability
Thus far, we have obtained a lower bound to optimize log p(y,x).
Inside the lower bound, theterm p(y|s) refers to the class
probability. Defining a class probability with some
discriminatorcan be problematic, since classes will be different
when tested. Instead, the class probabilityp(y|s) is defined
relative to other examples. This definition is inspired by few-shot
learningliterature.
4.2.1 Embedding Distance in Literature
In the method proposed by Vinyals et al. [5], a modified softmax
equation is used to compute theclassification prediction (Equation
19). This equation can be modified to output a probability
distribution p(y|x) =∏Cc=1 ŷ
ycc , where C is the total number of classes. Variable S
denotes
the support set, which contains a few examples with labels. The
function d can be an arbitrarydistance metric, either a basic
function such as Euclidean distance, or a complex function
modeledby a deep network. f(x) is an embedding of an input vector
x. The embedding function can belearned by a deep network. The
equation is suitable for few-shot learning, because it makes
aprediction for y, and is defined relative to the support set.
ŷ =∑
(x′,y′)∈S
exp−d(f(x), f(x′))∑(x′′,y′′)∈S exp−d(f(x), f(x′))
y′ (19)
There are two distinct challenges when this method is applied to
the generative model. Firstly,instead of a point estimate, the
encoder predicts distributions. Not every distance metric maylead
to meaningful distribution distances. For instance, in variational
autoencoders, the varia-tional distribution is often modeled by a
multivariate normal distribution. In few-shot literature,d is often
modeled by cosine distance. However, two equally likely samples (1,
1) and (-1, -1) fromthe normal distribution N (0, I), have a cosine
similarity of minus one, meaning very dissimilar.Because the cosine
distance changes in a curved space, it does not correspond to the
form of the
16
-
normal distribution. The effect of two commonly used distance
metrics is illustrated in Figure9. Secondly, the classification
probability is actually conditioned on the support set S, which
hasthus far been ignored in the graphical model. The formula p(y|s)
needs to be defined relativeto the support set, and will therefore
be conditioned on the support set. As a result, the
classprobability will be redefined from p(y|s) to include the
support set, p(y|s,SS ,YS).
Figure 9: Left: Visualization of the probability landscape of a
multivariate normal distributionwith mean at (1, 1) and diagonal
variances of one. Center: Visualization of the cosine
distancebetween a sample and the point (1, 1). Right: Visualization
of the negative squared euclideandistance between a sample and the
point (1, 1).
4.2.2 Model Class Probability
The content s is chosen as the embedding for x, as the content
variable should contain allnecessary information to classify an
example. Thus, the conditional probability distribution overin the
model is now defined as in Equation 20. The probability of a class
increases when thedistance between the content of an example and a
support example of that class decreases. Toavoid clutter, the
normalization constant is written as Z. The variable C denotes the
totalnumber of classes in the support set. The variable c is used
to select the c’th component of avector.
p(y|s,SS ,YS) =C∏c=1
∑(ss,ys)∈(SS ,YS)
exp−d(s, ss)Z
ysc
yc (20)For the distance function d, a simple squared euclidean
distance is chosen, d(a, b) = ||a − b||2.There are two arguments to
do so: Firstly, an expected euclidean distance for
multivariatenormals, corresponds to a distance that is intuitively
coherent, as depicted in Figure 9. But moreimportantly, we
previously showed that by using euclidean distances, an actual
disentanglementcan be learned.
In the special case when the single support set has only one
example per class (i.e. 1-shot), the ex-pectation of the numerator
can be analyzed analytically. Assume that the objective function
willinclude the form Es,SS [
∑i yi log p(yi|s,SS ,YS)]. Note then, that for the matching
class, the nu-
merator Es∼N (µ,Σ),s′∼N (m,L)[− log exp−d(s, s′)] can be
simplified into Es∼N (µ,Σ),s′∼N (m,L)[d(s, s′)],where two arbitrary
normal distributions are assumed for s and s′.
Since d is squared euclidean distance, optimizing numerator in
this special case will minimizethe expected squared euclidean
distance between two multivariate normal distributions, Tr(Σ +
17
-
L)+(µ−m)T (µ−m) (section 2). This term is the combination of the
squared distance betweenmeans, and the sum of all diagonal
variances.
4.3 Support Set
By specifying the class probability, a new concept was
introduced, the support set. To beaccurate, the graphical model is
adapted to include the support set.
The model with support set is depicted in Figure 10. Every
training example is now connectedto its own support set Sn. A
support set example has the same generative process as a
normalexample. The only difference is that the support set is used
to classify the example. Note thatto optimize the complete
likelihood log p(x,y,XS ,YS), all probabilities represented by
arrowsin the graphical model would need to be defined. This
includes the class probability equationp(ys|ss), which was the
reason we introduced the support set in the first place.
Alternatively, theconditional distribution log p(x,y|XS ,YS) can be
optimized and does not require the definitionfor p(ys|ss).
xs ys
sszs
y
s
x
z
Sn
N
Figure 10: Graphical model that includes a support set S. Every
example in the dataset isconnected to its support set, that defines
relatively what class the example belongs to.
4.3.1 The Support-Conditional Log-likelihood
Instead of optimizing log p(x,y,XS ,YS), which would maximize
the likelihood of all observeddata, it is possible to optimize log
p(x,y|XS ,YS), where the example is conditioned on thesupport set.
Intuitively, this corresponds with the few-shot learning context,
where an exampleis classified given a support set. To integrate the
support set with the previously derived model,first the term p(y|s)
from Equation 18 is redefined as p(y|s,SS ,YS), which changes the
left-handside of the log-likelihood to also condition on SS and YS
, as shown in Equation 21.
log p(y,x|SS ,YS) ≥ Es∼q(s|x),z∼q(z|x)[
log p(y|s,SS ,YS)p(x|s, z)]
−DKL(q(s|x)||p(s))−DKL(q(z|x)||p(z))(21)
This equation is conditioned on support content SS . To obtain
the log-likelihood conditionedon support examples XS , p(y,x|SS
,YS) can marginalized over support content SS with theposterior
p(SS |XS). Since the posterior is intractable, a variational
distribution is introducedand the log-likelihood is formulated as
in Equation 22. Note that the second term still involves
18
-
the intractable posterior distribution, which will be solved in
the next section. Also note thatthe variational distribution q(SS
|XS) can be factorized as
∏xsq(ss|xs) and shares the same
parameters as for an ordinary example x.
log p(y,x|XS ,YS) = log∫p(SS |XS ,YS)p(y,x|SS ,YS)dSS
= log
∫q(SS |XS)
p(SS |XS ,YS)q(SS |XS)
p(y,x|SS ,YS)dSS
≥ ESS∼q(SS |XS)[log p(y,x|SS ,YS)− log
q(SS |XS)p(SS |XS ,YS)
]= ESS∼q(SS |XS)
[log p(y,x|SS ,YS)
]−DKL(q(SS |XS)||p(SS |XS ,YS))︸ ︷︷ ︸
intractable
(22)
4.3.2 Resolving the Posterior
At first glance, the KL divergence with the posterior seems
problematic. Recall that, duringthe derivation of the probability
of an example, the term DKL(p(s, z|y,x)||q(s, z|x)) was
ne-glected.
In a moment the neglected term will be reintroduced, to cancel
the intractable term. To matchthe neglected term, first the
posterior for the support set is redefined so that it includes the
styleZS , p(SS ,ZS |XS ,YS). Note that the first term of the lower
bound does not need to include anexpectation over ZS because the
term is independent of ZS , leaving the first term
unchanged(Equation 23).
log p(y,x|XS ,YS) = log∫∫
p(SS ,ZS |XS ,YS)p(y,x|SS ,YS)dSSdZS
≥ ESS∼q(SS |XS)[
log p(y,x|SS ,YS)]−DKL(q(SS ,ZS |XS)||p(SS ,ZS |XS ,YS))︸ ︷︷
︸
intractable
(23)
The intractable term was encountered before in Equation 18,
albeit for an ordinary example.Realize that x and XS are
identically distributed, as they come from the dataset. The
posteriorsfor an example and a support set should not differ, as
these are also identically distributed. Andthus, the expected value
of the difference between the terms will become zero, as portrayed
inEquation 24. For now, the support set has been assumed to consist
of only one example.
E(x,y),(xs,ys)∼Pdata
DKL(q(s, z|x)||p(s, z|x,y))︸ ︷︷ ︸neglected term
−DKL(q(ss, zs|xs)||p(ss, zs|xs,ys))︸ ︷︷ ︸intractable term
= 0 (24)The expected value is zero, when the support set has
only one example. In reality however, thesupport set always has
multiple examples. The training procedure of few-shot learning,
uses
19
-
batches of queries. Therefore, the expected value of the
difference, will be greater than or equalto zero, under the
condition that |B| ≥ |S| (the size of the batch is greater than or
equal tothe size of the support set). Because the procedure to
derive the lowerbound is repetitive andnotation heavy, the complete
derivation is shown in appendix A.
4.4 Collecting All Components
Collecting components from Equations 18, 23 and 24, the
formulation for the model is displayedin Equation 25. The
approximation is valid in the expectation over data when the number
ofsamples for support set and queries are balanced, or the number
of queries is greater. Althoughthe final term is simplified for a
single support set example, the same principle applies for
largersupport sets, as long as the batch is greater.
log p(x,y|XS ,YS) ≥ ESS∼q(SS |XS)[Es∼q(s|x),z∼q(z|x) log
p(y|s,SS ,YS)p(x|s, z)
]−DKL(q(s|x)||p(s))−DKL(q(z|x)||p(z))+DKL(q(s, z|x)||p(s,
z|x,y))−DKL(q(SS ,ZS |XS)||p(SS ,ZS |XS ,YS))
EPdata[
log p(x,y|xs,ys)]≥ EPdata
[Ess∼q(ss|xs),s∼q(s|x),z∼q(z|x)
[log p(y|s, ss,ys)p(x|s, z)
]−DKL(q(s|x)||p(s))−DKL(q(z|x)||p(z))
](25)
In summary, the log-likelihood started from a generative process
for x. To define class probability,a support set was included. The
log-likelihood was updated to condition on the support set,
bymarginalizing over the posterior. By rewriting the posterior, the
expected difference betweentwo intractable terms will be positive.
In the last step components were collected, and all termsof the
equation can be computed.
4.5 Inference
During classification the term p(y|x,XS ,YS) is maximized.
Maximizing this term is equivalentto maximizing the joint
probability as depicted in Equation 26. Since all other terms in
the jointprobability are independent of y, practically only the
class probability term needs to be max-imized. Technically, the
expected value should be computed, however, approximating
sampleswith the mean of the distribution did not affect performance
significantly.
argmaxy
[log p(y|x,XS ,YS)] = argmaxy
[log p(y,x|XS ,YS)− log p(x|XS ,YS)]
= argmaxy
[log p(y,x|XS ,YS)]
= argmaxy
ESS∼q(SS |XS),s∼q(s|x) [log p(y|s,SS ,YS)]
(26)
20
-
4.6 Implementation
A variational autoencoder is defined with two latent variables,
s and z with nlatent values each.The content code s is an input for
the reconstruction and the class probability. The stylecode z is
only used to reconstruct an example. The term p(x|s, z) is modeled
by the decoder,and represents reconstruction error. This term is
modeled with a the Bernoulli loss, such thatlog p(x|s, z) =
∑i xi log x̂i, where x̂ = Dec(s, z), and the summation is over
pixel values. The
terms q(s|x) and q(z|x) are modeled by encoder with multivariate
normal distributions that havediagonal variances, such that s ∼ N
(µs(x), Iσs(x)) and z ∼ N (µz(x), Iσz(x)). The DKL termslimit the
divergence between the variational and the prior distributions and
can be obtainedanalytically (section 2). Unless mentioned
otherwise, expectations for ordinary examples areapproximated with
a single sample. Expectations for the support set are approximated
with themean of the distribution.
In principal, the model is not constrained to learn two
completely separate embeddings. Themodel could use s only for
classification and z only for reconstruction. However, making useof
the latent variables incurs a small penalty via the
Kullback-Leibler divergence between theprior and the variational
distribution. The term can be seen as a regularizer on the latent
codes.Since the classification pushes content codes s apart, useful
information for construction is alsoavailable in s. Although the
model can theoretically choose to put duplicate information in z,
thiswould incur an additional penalty on the KL divergence of z.
And thus, in optimal conditions,the model saves content information
in s and style information in z.
21
-
5 Datasets
Few shot learning scenarios differ from conventional
classification tasks in machine learning. Theconcept of few-shot
learning is that a few examples with label information are
presented, knownas the support set. Also, a different example is
presented without the label. The task is toclassify the example by
using the support set. In general, the support set contains the
samenumber of examples per class, and the class of the example is
always in the support set. Wedefine two variables that describe the
few shot learning setting: nway denotes the number ofclasses in the
support set, and nshot denotes how many examples per class are in
the supportset. For instance, when nshot is 1 and nway is 5, this
is a 5-way 1-shot classification problem.Importantly, during
evaluation the classes in the support set have never been seen
before.
To evaluate models related to few-shot classification, we use
four different datasets, differing insize and complexity. In this
section we will first define the few-shot learning episode.
Subsequentsections present the details of four different
datasets.
5.1 Episodes
Suppose we have some pool of training examples Dtrain and test
examples Dtest. Each trainingexample belongs to a class. We choose
Dtrain and Dtest such that a class is exclusively presentin only
one of the sets.
Few-shot learning is comprised of episodes: a few examples are
given with label information, anew example needs to be classified.
We describe an episode following the procedure of [5]. Foreach
episode, we take nway different classes L ∼ D. For each class, we
sample nshot examplesas the support set S ∼ L. Also, we sample a
batch B ∼ L with nqueries per class, for the nshotdifferent
classes. We make sure that S and B are disjoint, i.e. they contain
different samples.The task now, is to classify B with the
information in S. An example of a session is shown inFigure 11.
The classes inDtrain are different from the classes inDtest.
Therefore, a successful model will haveto effectively use the
limited information in the support set to make the correct
prediction.
Support set Example
. . .
Figure 11: Configuration of a few-shot learning session. On the
left side the support set is shown.For every example the correct
label is known. The right side shows the image that needs to
beclassified.
5.2 MNIST
The MNIST dataset is a well-known standard benchmark for machine
learning. The datasetis relatively to learn, models can easily
achieve 99% classification accuracy. This makes it a
22
-
practical dataset to test new algorithms and architectures. The
training set contains 60000examples and the test set 10000
examples, with 10 different classes. Images are 28 by 28 pixels.In
Figure 12 a random sample of the dataset is shown.
Figure 12: Random samples from the MNIST dataset.
MNIST is not particularly well suited to evaluate few-shot
learning performance. There are onlya limited amount of classes
with many examples per class. The MNIST dataset is mainly used
toshow disentanglements and examine latent variables. Actual
few-shot learning evaluation resultsare presented on more
complicated and better suited datasets.
5.3 Omniglot
The Omniglot dataset was created by Lake et al. to test
algorithms while having only a handful oflabelled examples [13].
The dataset contains 1623 different characters from 50 different
alphabets.For every character, there are only 20 different examples
available. In Figure 13, 20 samples fromthe dataset are presented.
To preprocess the data, the procedure from [5] is followed.
Imagesare resized to 28 by 28 pixels. The first 1200 characters are
training data that is augmentedwith rotations of 0, 90, 180 and
270. The remaining 423 characters are used for evaluation.
Incontrast with [7], we do not augment test data unless specified
otherwise.
Figure 13: Random samples from the Omniglot dataset. Images are
resized to 28 by 28 pixelsand colors are inverted.
5.4 miniImageNet
The miniImageNet dataset was created by Ravi et al. to have a
more difficult baseline for few-shot classification. Derived from
the original ImageNet, the dataset contains only 100
differentclasses, with 600 examples per class. The dataset is split
up in 64 train classes, 16 validationclasses and 20 test classes.
To preprocess the data, pixel values are rescaled to the range [0,
1],by dividing by 255. Images are resized to 84 by 84 pixels. The
train images are rotated by 0,90, 180 and 270 degrees to create
more image classes. In Figure 14 a few random examples fromthe
miniImageNet test set are presented.
23
-
Figure 14: Random samples from the miniImageNet test
dataset.
5.5 Quick, Draw
The Quick, Draw dataset has been collected by Google Creative
Lab.1 Users were asked to drawa concept, based on a textual
description, such as “airplane” or “Eiffel Tower”. This can lead
tovery different drawings of the same concept. For instance, the
description “clock” let some usersdraw an analogue clock, and
others a digital one.
While users were drawing a concept, a recurrent neural network
was guessing what the usertried to draw. Also, users were limited
to 20 seconds within they had to draw the concept. Asession
finished either in 20 seconds, or when the network guessed the
concept correctly. As aconsequence, some images might be
incomplete. Both Omniglot and Quick, Draw use a processwhere users
are asked to draw a concept. The difference is that Omniglot users
were presentedwith a visual example. Quick, Draw users were
presented with a concept, which can be expressedin many different
ways. Therefore, the intra-class variation in Quick Draw is not
only caused bydrawing style, but also the interpretation of the
concepts to draw.
In total there are 345 classes that we split in 275 train, 35
validation and 35 test classes. Eachclasses contain numerous
images, but we use only the first 100 images per class. Each image
is a28 by 28 pixels in gray scale. A few samples from the dataset
are depicted in Figure 15.
Figure 15: Random samples from the Quick, Draw test dataset.
1Quick, Draw Dataset https://quickdraw.withgoogle.com/data
[Accessed in June 2017]
24
https://quickdraw.withgoogle.com/data
-
6 Experiments
In this section the experiments with the model are discussed.
The first part gives an overviewof the techniques that were used.
The second part presents and analyzes the results.
6.1 Setup
In this section the experimental setup is detailed. First a
non-standard batch normalization layeris introduced, because
samples from the data are not identically distributed. Then the
networkarchitectures and hyperparameter configurations are
discussed.
6.1.1 Moving Average Batch Normalization
Batch normalization [14] has significantly improved deep
learning optimization in some instances.However, using batch
normalization may be problematic when samples in a batch are not
inde-pendent and identically distributed. Since a batch is skewed
with only nway different classes, wemay experience high variance
for the first and second moments over different batches. Instead,we
propose a simple method resembling [15], where moving averages are
used at train and testtime. In the pseudocode below the exact
mechanism is specified. Note that x is an input and y isthe
corresponding output. Furthermore, β and γ are parameters trained
with backpropagation.The moment variables are not trained, but
updated as specified.
mu, sigma , beta , gamma = i n i t ( )de f moving average norm
(x , i s t r a i n i n g , decay ) :
y = (y + (x − mu) / sigma ) ∗ gammai f i s t r a i n i n g :
mu b , sigma b = compute moments ( x )mu = mu + decay ∗ (mu b −
mu)sigma = sigma + decay ∗ ( sigma b − sigma )
re turn y
6.1.2 Architecture
The model can be separated in two distinct parts. An encoder for
content and style, and thedecoder for reconstructions. For
computational efficiency, we use only one encoder that outputsboth
content and style. This means that weights are shared for content
and style. The encoderarchitecture is largely inspired by [7],
because their network achieved state of the art performance,at the
time of writing.
Different from [7], we choose to have only 3 max pooling layers.
Furthermore, we pad featuremaps during pooling so that no
information is discarded. The last layer is a fully connected
layerto output µs, logσs, µz and logσz, corresponding to the
distributions s ∼ N (s|µs, Iσs2) andz ∼ N (z|µz, Iσz2). Details are
presented in Table 1.
25
-
Table 1: Encoder architecture, input is an image with 1 channel
and 28 by 28 pixels, outputsare µs, logσs,µz, logσz.
Name Feature maps Output sizeinput 1 28, 28conv1 1 · nfilters
28, 28max pool 1 · nfilters 14, 14conv2 2 · nfilters 14, 14max pool
1 · nfilters 7, 7conv3 4 · nfilters 7, 7max pool 1 · nfilters 4,
4conv4 8 · nfilters 4, 4fc 4 · nlatent 1, 1
The decoder takes s and z as inputs and transforms them with a
fully connected layer to afeature map with shape (4, 4, nfilters ·
8). In subsequent layers, the resolution is increased bya re-sizing
and then a convolution operation. We choose to increase the
resolution with factorsof 2, and therefore the final 32 by 32
output needs to be cropped such that we have a 28 by 28output. In
Table 2 the specifics of the decoder are presented.
Table 2: Decoder architecture, inputs are s, z and output is an
image with 28 by 28 pixels
Name Feature maps Output sizeInput 2 · nlatent 1, 1fc 8 ·
nfilters 4, 4upsample 8 · nfilters 8, 8conv1 8 · nfilters 8,
8upsample 4 · nfilters 16, 16conv2 4 · nfilters 16, 16upsample 2 ·
nfilters 32, 32conv3 2 · nfilters 32, 32conv4 1 32, 32slice 1 28,
28
6.1.3 Configuration
During training, Adam [16] is used to optimize the network. The
model is trained for 50000iterations with a learning rate of 1e-4.
Since the reconstruction loss heavily outweighs the few-shot loss,
the few shot term log p(y|s,SS ,YS) is magnified. To match training
procedures fromliterature closely, all other terms in the loss are
divided by a factor λ, instead of multiplying thefew shot term with
λ. The factor λ was set to 1000 experimentally.
Every iteration, a support set of nway · nshot samples is drawn.
Also nway · nqueries samples aredrawn to be classified. Authors
generally observed increased performance when trained with ahigher
nway in [7]. Therefore, nway is set to 30. An overview of all
additional parameters can befound in Table 3.
26
-
Table 3: Configuration for experiments during training
Name Valuenfilters 128nlatent 32learning rate 1e-4λ 1000nway
30nshot 1nqueries 15
6.2 Evaluation
The model will be evaluated on the Omniglot, miniImageNet and
the Quick, Draw dataset.This section will test two hypotheses that
were introduced in the introduction: (1) The modelactually learns a
disentanglement. This can be tested by visualizing the
reconstructions withperturbations of the content and style. If the
model successfully learns a disentanglement, (2)few-shot
performance will improve performance when a disentangled
representation is learned.The effect of disentanglement is tested,
by comparing the model to a deep network with the samearchitecture,
but without the generative loss.
6.2.1 Omniglot
The model is trained on the Omniglot dataset with the
hyperparameter settings as described insection 6.1. The
disentanglement of the model is visualized as in previous sections:
Indirectly,by reconstructing images of perturbed content and style
variables. And directly, by visualizingthe structure of the
high-dimensional content and style variables with stochastic
neighbourhoodembedding.
Interpolations of the content and style variables are depicted
in Figure 16. These pictures showto what extend content and style
have been disentangled. The interpolation results show thatthe
network has successfully learned to encode the stylistic
translation and scale attributes in z,since these tend to change
vertically. Also, the content of an image tends to change
horizontally,which confirms that content is encoded in s.
27
-
Figure 16: Interpolations between the latent variables of two
images. The upper left and thebottom right are reconstructions from
the test set. All other images are linear interpolations
overcontent s and style z. The content variable s changes
horizontally, the style z changes vertically.
Reconstructions of examples with interchanged variables are
shown in Figure 17. Note thatall characters in a column still have
the general shape of the original example, which againdemonstrates
that content is modeled by s. In addition, note that the actual
location, size androundness are modeled by z, as one would expect
style to be modeled.
The pictures also show the limitations of the model: the
reconstructions can be blurry, andsometimes lack certain strokes.
Recall that Omniglot only has 20 examples per class, and themodel
has never actually seen the classes in the test set before. Another
reason for blurrinessis the reconstruction loss, that is formulated
pixel-wise. This penalizes the model heavily forreconstructions
that have been translated slightly.
Figure 17: Interchanging the content and style of examples.
Reconstructions are generated bytaking s from the column and z from
the row. Images from the dataset that are used as inputare depicted
at the sides.
28
-
The structure of the representation is visualized in Figure 18.
The latent variables for contentand style, are embedded into a 2D
manifold. Each test example is represented by a dot, wherethe color
denotes the class. Since the examples are ordered by alphabet,
similar colors oftencorrespond to letters in the same alphabet. For
every class, precisely one example is shown asan image. Some
cluster annotations are provided to make the diagrams more
understandable,note that these are subjective and are not
necessarily complete.
The top embedding shows the structure of s, representing
content. Images with classes that aresimilar, lie close together.
For instance, a group of ‘o’-shaped characters is grouping
together.Furthermore, a few box-shaped characters are visibly
clustering. In general, characters from thesame alphabet are
grouped together more than others. There seems to be no real
pattern forother factors. For instance, the ‘o’-group has very
different sizes, and still their s variables lieclose together.
The bottom embedding depicts the style z. In contrast with the
previous embedding, nowgrouping based on location, scale and other
factors is expected. Some clusters are annotatedintuitively to
demonstrate the different styles. For instance, the top-group
contains images thatare drawn relatively high in the image. Thus,
the diagrams illustrate that content is indeedmodeled by s, and
style is modeled by z.
29
-
10 5 0 5 10
10
5
0
5
10
0
50
100
150
200
250
300
350
400
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5
10.0
7.5
5.0
2.5
0.0
2.5
5.0
7.5
0
50
100
150
200
250
300
350
400
Figure 18: Visualization of the structure of the
high-dimensional content and style variables. Forevery class of the
test set, exactly one image is depicted. Top: Embedding of content
s. Bottom:Embedding of style z.
30
-
The classification performance is presented in Table 4. Since
the authors in [7] evaluated onaugmented test data, performance is
reported on normal and augmented test data. It is worthmentioning
that augmenting the test data increases accuracy, but may not be a
realistic problemsetting. Furthermore, the disentangled VAE
outperforms the models consistently in every setting.The last row
denotes the performance of the same architecture without the
generative loss,and therefore without disentangled representations.
Clearly, the performance drops consistentlyfor all tasks. And thus,
learning disentangled representations significantly improves
few-shotclassification.
Table 4: 20-way classification performance on the Omniglot
dataset. The ‘+’ sign denotes thatperformance is measured on an
augmented test dataset (90 degree rotations).
1-shot 1-shot+ 5-shot 5-shot+Matching Networks [5] 93.8 - 98.7%
-Prototypical Networks [7] - 96.0% - 98.9%Disentangled VAE 95.9%
97.0% 98.8% 99.1%Only few-shot loss 94.8% 96.0% 98.4% 98.9%
6.2.2 miniImageNet
The miniImageNet images were resized to 84 by 84 pixels matching
[5, 7]. To control theincreased resolution, an additional max
pooling layer is added after conv4 in the encoder.
Thereconstruction task is simplified, by re-sizing the target to 32
by 32 pixels. And thus the finalslicing layer is removed in the
decoder.
Experiments showed that the model performed worse than baseline
models. In limited datasettings, the network relies on heavy
regularization. With Omniglot, this problem did not reallyoccur.
But a more complex dataset such as miniImageNet, the network
quickly overfits. However,tests on Omniglot revealed that removing
the fully connected layer, significantly degrades thequality of the
learned disentanglement. Visualizations of the reconstructions also
reveal that thenetwork does not disentangle anything, as the
decoder only relies on z (Figure 19). Furthermore,the network
learns to create vague reconstructions, showing that the generative
model itself islimited.
31
-
Figure 19: Interchanging the content and style of examples.
Reconstructions are generated bytaking s from the column and z from
the row. Images from the dataset that are used as inputare depicted
at the sides.
The miniImageNet dataset is complex compared to Omniglot, and
the classes exhibit largevariations within a class. The capacity
for the VAE is limited when it comes to ImageNet, andthe
architecture changes for disentangling remove important
regularization. To achieve betterperformance, the model needs to be
revised such that these problems are addressed.
6.2.3 Quick, Draw
The Quick, Draw dataset is in the same format as Omniglot, and
thus the same settings asdescribed in section 6.1 are used. The
few-shot classification performance on Quick, Draw isdifficult to
report, since the validation accuracy of either baselines and the
disentangled VAEdid not converge. Even when converged during
training, the validation error fluctuated rapidly.Overall, the
performance of the baseline is better, because the same problem as
with miniIma-geNet occurs: the network overfits because of the
fully connected layer.
Interestingly, although somewhat vague, a disentanglement is
actually learned. Analogous toprevious visualization, the
interpolation and interchanging of s and z are depicted (Figure
20).The pictures show that the stylistic attributes such as
rotation are modeled by z, while thegeneral shape of an object is
modeled by s.
Quick, Draw resembles Omniglot in certain ways, but there are
also important differences. InOmniglot, the variation is caused by
the drawing style of the user, while the images of Quick,Draw users
also vary because of the variation in interpretation. For instance,
disentangling thevariation in clocks, would require the model to
encode “analogue” or “digital” in the style z.However, for another
class such as airplane, this attribute might not have any meaning.
Thus, tosome degree, disentangled representations are learned on
Quick, Draw, but they may be impededbecause the stylistic
variations are often limited to a single class.
32
-
Figure 20: Left: Interpolations between the latent variables of
two images. The upper left and thebottom right are reconstructions
from the test set. All other images are linear interpolations
overcontent s and style z. The content variable s changes
horizontally, the style z changes vertically.Right: Interchanging
the content and style of examples. Reconstructions are generated by
takings from the column and z from the row.
6.3 Expectation of Support Set
The loss function that is optimized, defined in section 4,
consists of multiple expectations. Forthe example that needs to be
classified, and for the support set. Since optimization is
performedin batches, the queries are approximated with a single
sample. For the support set we test twooptions:
1. ESS∼q(SS |XS)[·] is approximated with a single sample for
each support set item, for eachquery.
2. ESS∼q(SS |XS)[·] is approximated by the overconfident
estimate µs.
In the end, no significant difference in classification
performance or learning was encountered.Also, the learned
disentanglement visually did not look different. All presented
results arereported on models trained with the second method, as it
is the most straightforward to imple-ment.
6.4 Discussion
In this section, the results and conclusions of the experiments
are summarized. Furthermoresome intuitive insights are provided
from empirical observations. The first part will discussresults
based on architecture changes. The second part will discuss
performance on differentdatasets.
33
-
6.4.1 Architecture
The architecture for the encoder is designed so that it matches
Prototypical Networks [7] closely,since they achieved state of the
art performance at that time on both Omniglot and miniIm-ageNet.
However, to make the model suitable for disentangled representation
learning, someaspects have to be modified.
Experiments with different network architectures on Omniglot
showed that disentangled rep-resentation are learned when a fully
connected layer is used in the encoder, between the
lastconvolutional layer and the latent representation. A downside
to fully connected layers, is thatin some instances they easily
overfit, which caused the model to perform worse on more
complexdatasets.
Prototypical Networks used max-pooling layers to reduce the
resolution. Max-pooling layers areinsensitive to small translation
perturbations, which improves the regularization, but
reducesprecision. With generative modelling, ideally the model
would be more sensitive to these per-turbations. However, strided
convolutions impeded performance drastically. Instead, the
fourthmax pooling layer is removed, to retain some sensitivity. In
contrast with prototypical networks,the max pooling operations are
padded, because it allows them to retain more information.
6.4.2 Performance
The Disentangled VAE showed a large performance increase over
existing methods on the Om-niglot dataset. However, it had more
difficulty with miniImageNet and Quick, Draw.
The hypothesis that learning disentangled representations can be
combined with few-shot classi-fication, is confirmed by the direct
and indirect visualizations of s and z. Visualizations confirmthat
s models content and z models style on Omniglot. Moreover, few-shot
classification perfor-mance is improved on Omniglot by utilizing
disentangled representations, confirming the secondhypothesis.
The miniImageNet and Quick, Draw datasets are inherently more
difficult. The reconstructionsfrom a VAE tend to be vague and
without much detail. Generative modelling combined with
deeplearning is a relatively new area of research, and future
developments could play a crucial role tolet this method work on
more complex datasets. In addition, disentangling representations
mightbe less helpful when style attributes are not shared between
classes. Being able to disentanglethe property analogue or digital,
does not improve airplane classification.
6.4.3 Model Framework
The mathematically derived model is optimizing a lower bound of
the conditional log-likelihood.Furthermore, the variational
distribution is imposed to be a multivariate normal distribution
withdiagonal variance. Nonetheless, the model has learned to
disentangle representations, reconstructimages, and perform
few-shot classification. Potentially, less restricting
approximations canimprove the quality of reconstructions and
classification performance.
34
-
Conclusion
The human brain can remarkably disentangle different types of
illumination, pose or otherchanges in viewpoint from the actual
object of interest. We propose that disentangling rep-resentations,
is key in learning an interpretation that is suitable for
generalization. Few-shotclassification is a field, where such a
suitable representation has to be learned. The hypothesis is,that
learning disentangled representation can be combined with few-shot
classification. Further-more, few-shot classification accuracy is
improved, by using disentangled representations.
In an exploratory study, we demonstrated that the structure of
learned disentangled representa-tions can be shaped into a suitable
structure for few-shot classification. Furthermore, experimentsshow
that an adversarial network is not necessary, to learn a suitable
disentangled representa-tion.
Inspired by intuitions gained from the exploratory study, theory
for generative few-shot learningis developed. A graphical model is
defined for a single example. The process assumes twolatent
variables s and z that represent the content and style of an
example x. The graphicalmodel is extended to include the support
set, and a lower bound of a conditional log-likelihoodis
mathematically derived.
The framework is trained on three different datasets. Two
datasets reveal opportunities forthe model two improve. Experiments
with the Omniglot dataset confirm, that learning disen-tangled
representation can be combined with few-shot classification.
Moreover, state-of-the-artperformance is achieved at the time of
writing, showing that classification accuracy improves byusing
disentangled representations. Analysis of the datasets indicates
that the framework worksparticularly well when the stylistic
attributes are shared, which effectively gives the generativemodel
more samples to learn stylistic attributes.
In summary, we combine learning disentangled representations,
few-shot learning, and generativemodelling. By combining a few-shot
loss with a variational autoencoder, a disentanglement islearned
naturally. Experiments demonstrate that disentangled
representations can improve few-shot classification
performance.
A few suggestions for interesting further research include:
• Defining the distance function as a deep learning model dθ(s,
s′), preferable such thatdθ(s, s
′) = dθ(s′, s). This may allow the model to learn a more
suitable metric.
• Exploiting siamese adversarial networks to disentangle
representations. This may improvethe quality of generated data, and
allow disentanglement of more complicated data.
• Incorporating unsupervised disentangling methods, for
semi-supervised few-shot classifica-tion.
35
-
References
[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Identity mappings in deepresidual networks. In European Conference
on Computer Vision, pages 630–645. Springer,2016.
[2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Imagenet classification with deepconvolutional neural networks. In
Advances in neural information processing systems, pages1097–1105,
2012.
[3] Li Fei-Fei, Rob Fergus, and Pietro Perona. One-shot learning
of object categories. IEEEtransactions on pattern analysis and
machine intelligence, 28(4):594–611, 2006.
[4] Gregory Koch. Siamese neural networks for one-shot image
recognition. PhD thesis, Uni-versity of Toronto, 2015.
[5] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan
Wierstra, et al. Matching networks forone shot learning. In
Advances in Neural Information Processing Systems, pages
3630–3638,2016.
[6] Sachin Ravi and Hugo Larochelle. Optimization as a model for
few-shot learning. In FifthInternational Conference on Learning
Representations, ICLR, 2017.
[7] Jake Snell, Kevin Swersky, and Richard S Zemel. Prototypical
networks for few-shot learn-ing. arXiv preprint arXiv:1703.05175,
2017.
[8] Akshay Mehrotra and Ambedkar Dukkipati. Generative
adversarial residual pairwise net-works for one shot learning.
arXiv preprint arXiv:1703.08033, 2017.
[9] Diederik P Kingma and Max Welling. Stochastic gradient vb
and the variational auto-encoder. In Second International
Conference on Learning Representations, ICLR, 2014.
[10] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu,
David Warde-Farley, SherjilOzair, Aaron Courville, and Yoshua
Bengio. Generative adversarial nets. In Advances inneural
information processing systems, pages 2672–2680, 2014.
[11] Michael F Mathieu, Junbo Jake Zhao, Junbo Zhao, Aditya
Ramesh, Pablo Sprechmann,and Yann LeCun. Disentangling factors of
variation in deep representation using adversarialtraining. In
Advances in Neural Information Processing Systems, pages 5041–5049,
2016.
[12] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya
Sutskever, and Pieter Abbeel.Infogan: interpretable representation
learning by information maximizing generative adver-sarial nets. In
Advances in Neural Information Processing Systems, 2016.
[13] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B
Tenenbaum. Human-level conceptlearning through probabilistic
program induction. Science, 350(6266):1332–1338, 2015.
[14] Sergey Ioffe and Christian Szegedy. Batch normalization:
Accelerating deep network trainingby reducing internal covariate
shift. In Proceedings of the 32nd International Conference
onMachine Learning (ICML-15), pages 448–456, 2015.
[15] Sergey Ioffe. Batch renormalization: Towards reducing
minibatch dependence in batch-normalized models. arXiv preprint
arXiv:1702.03275, 2017.
[16] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic
optimization. In ThirdInternational Conference on Learning
Representations, ICLR, 2014.
36
-
A Derivation Lowerbound for Batches
In this section, a lowerbound for the conditional log-likelihood
will be derived. This modelis adapted such that an intractable
posterior term can be cancelled with another posteriorterm.
A.1 Model definition
During optimization, the queries will be sampled in batches.
This ensures that a lowerboundis optimized. The graphical model
that corresponds to this perspective, is illustrated in
Figure21.
xs ys
sszs
yb
sb
xb
zb
Sn Bn
N
Figure 21: Graphical model that includes a support set S and the
batch B. All examples in thebatch are classified with the same
support set.
A.2 Log-likelihood Conditioned on Support Content (SS)
In this section, the log-likelihood for a batch is derived,
conditioned on support content SS . Thederivation is based on the
graphical model in Figure 21. (Equation 27)
37
-
log p(YB ,XB |SS ,YS) = log∏
(xb,yb)
p(yb,xb|SS ,YS)
=∑
(xb,yb)
log p(yb,xb|SS ,YS)
=∑
(xb,yb)
∫∫q(sb, zb|xb) log p(yb,xb|SS ,YS)dsdz
=∑
(xb,yb)
Esb,zb∼q(sb,zb|xb) [log p(yb,xb|SS ,YS)]
=∑
(xb,yb)
Esb,zb∼q(sb,zb|xb)[log p(yb,xb|s, z,SS ,YS)− log
q(sb, zb|xb)p(sb)p(zb)
+ logq(sb, zb|xb)
p(sb, zb|xb,yb)
]=
∑(xb,yb)
Esb,zb∼q(sb,zb|xb) [log p(yb|sb,SS ,YS)p(xb|sb, zb)]
−DKL(q(sb, zb|xb)||p(sb)p(zb))+DKL(q(sb, zb|xb)||p(sb,
zb|xb,yb))
=∑
(xb,yb)
Esb,zb∼q(sb,zb|xb) [log p(yb|sb,SS ,YS)p(xb|sb, zb)]
−DKL(q(sb|xb)||p(sb))−DKL(q(zb|xb)||p(zb))+DKL(q(sb,
zb|xb)||p(sb, zb|xb,yb))
(27)
A.3 Lowerbound Conditioned on Support Examples (XS)
In this section, the lowerbound of the log-likelihood for a
batch is derived, conditioned onsupport content XS . The previously
derived term log p(XB ,YB |SS ,YS), is marginalized overp(SS ,ZS
|XS ,YS). (Equation 28)
log p(YB ,XB |XS ,YS)
= log
∫p(SS ,ZS |XS ,YS)
∑(xb,yb)
log p(yb,xb|SS ,YS)
dSSdZS= log
∫q(SS ,ZS |XS)
p(SS ,ZS |XS ,YS)q(SS ,ZS |XS)
∑(xb,yb)
log p(yb,xb|SS ,YS)
dSSdZS≥ ESS ,ZS∼q(SS ,ZS |XS)
∑(xb,yb)
log p(xb,yb|SS ,YS)
− log q(SS ,ZS |XS)p(SS ,ZS |XS ,YS)
= ESS∼q(SS ,ZS |XS)
∑(xb,yb)
log p(xb,yb|SS ,YS)
−DKL(q(SS ,ZS |XS)||p(SS ,ZS |XS ,YS))︸ ︷︷ ︸intractable
(28)
38
-
A.4 Intermezzo: Factorizing the Support Set KL Term
Note that the distributions of the support set factorize as
follows. The variational distributionfactorizes such that q(SS |XS)
=
∏(ss,xs)
q(ss|xs), and the posterior distribution factorizes likep(SS ,ZS
|XS ,YS) =
∏(xs,ys)
p(ss, zs|xs,ys). (Equation 29)
DKL(q(SS ,ZS |XS)||p(SS ,ZS |XS ,YS))
=
∫∫q(SS ,ZS |XS) log
q(SS ,ZS |XS)p(SS ,ZS |XS ,YS)
dSSdZS
=
∫ ∏xs
q(ss, zs|xs) log∏
xsq(ss, zs|xs)∏
(xs,ys)p(ss, zs|xs,ys)
∏ss
dss∏zs
dzs
=
∫ ∏xs
q(ss, zs|xs) log∏
xsq(ss, zs|xs)∏
(xs,ys)p(ss, zs|xs,ys)
∏ss
dss∏zs
dzs
=
∫ ∏xs
q(ss, zs|xs)
∑(xs,ys)
logq(ss, zs|xs)
p(ss, zs|xs,ys)
∏ss
dss∏zs
dzs
=∏xs
∫q(ss, zs|xs)
∑(xs,ys)
logq(ss, zs|xs)
p(ss, zs|xs,ys)
∏ss
dss∏zs
dzs
=∑
(xs,ys)
∫q(ss, zs|xs) log
q(ss, zs|xs)p(ss, zs|xs,ys)
dssdzs
=∑
(xs,ys)
DKL(q(ss, zs|xs)||p(ss, zs|xs,ys))
(29)
A.5 Collecting Terms
In this section we combine terms from Equations 27-29.
39
-
log p(YB ,XB |XS ,YS)
≥ ESS∼q(SS ,ZS |XS)
∑(xb,yb)
log p(xb,yb|SS ,YS)
−DKL(q(SS ,ZS |XS)||p(SS ,ZS |XS ,YS))
= ESS∼q(SS ,ZS |XS)
∑(xb,yb)
Esb,zb∼q(sb,zb|xb) [log p(yb|sb,SS ,YS)p(xb|sb, zb)]
−DKL(q(sb|xb)||p(sb))−DKL(q(zb|xb)||p(zb))
+DKL(q(sb, zb|xb)||p(sb, zb|xb,yb))
−∑
(xs,ys)
DKL(q(ss, zs|xs)||p(ss, zs|xs,ys))
= ESS∼q(SS ,ZS |XS)
∑(xb,yb)
Esb,zb∼q(sb,zb|xb) [log p(yb|sb,SS ,YS)p(xb|sb, zb)]
+∑
(xb,yb)
(−DKL(q(sb|xb)||p(sb))−DKL(q(zb|xb)||p(zb))
)+∑
(xb,yb)
DKL(q(sb, zb|xb)||p(sb, zb|xb,yb))
−∑
(xs,ys)
DKL(q(ss, zs|xs)||p(ss, zs|xs,ys))
(30)
A.6 Objective Function
The last two terms in Equation 30 can be ignored, because the KL
terms concerning posteriorshave the same distribution. The
expectation of the two terms is greater than or equal to zero,under
the condition that |B| ≥ |S| (Equation 31).
E(XB ,YB),(XS ,YS)∼Pdata ∑(xb,yb)
DKL(q(sb, zb|xb)||p(sb, zb|xb,yb))−∑
(xs,ys)
DKL(q(ss, zs|xs)||p(ss, zs|xs,ys))
≥ 0(31)
In conclusion, the objective function can be written as a
simplification of Equation 30, where thelast two terms are ignored.
The function is presented in Equation 32. The inequality
originatesfrom both the lowerbound when marginalizing over the
posterior of the support set content, andthe inequility as
formulated in Equation 31.
40
-
E(XB ,YB),(XS ,YS)∼Pdata [log p(YB ,XB |XS ,YS)]
≥ EPdata
ESS∼q(SS ,ZS |XS) ∑(xb,yb)
Esb,zb∼q(sb,zb|xb) [log p(yb|sb,SS ,YS)p(xb|sb, zb)]
+∑
(xb,yb)
(−DKL(q(sb|xb)||p(sb))−DKL(q(zb|xb)||p(zb))
)(32)
41
AcknowledgementsAbstractContentsIntroductionRelated WorkFew-shot
learningGenerative ModelsDisentangling
RepresentationContribution
PreliminariesVariational AutoencodersGenerative Adversarial
NetworksKullback Leibler Divergence for Multivariate NormalsSquared
Euclidean distance between Multivariate NormalsDisentangling
Factors of Variation
Structure of DisentanglementUnderstanding
DisentanglementDistance Penalty for ContentDisentangling with
Distance Loss Exclusively
Model: Generative Few-shot LearningGenerative ModelClass
probabilityEmbedding Distance in LiteratureModel Class
Probability
Support SetThe Support-Conditional Log-likelihoodResolving the
Posterior
Collecting All ComponentsInferenceImplementation
Da