Domain Adaptive Computational Models for Computer Vision by Hemanth Kumar Demakethepalli Venkateswara A Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Approved March 2017 by the Graduate Supervisory Committee: Sethuraman Panchanathan, Chair Baoxin Li Hasan Davulcu Jieping Ye Shayok Chakraborty ARIZONA STATE UNIVERSITY May 2017
193
Embed
Domain Adaptive Computational Models for Computer Vision by … · 2017-06-01 · Domain Adaptive Computational Models for Computer Vision by Hemanth Kumar Demakethepalli Venkateswara
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Domain Adaptive Computational Models for Computer Vision
by
Hemanth Kumar Demakethepalli Venkateswara
A Dissertation Presented in Partial Fulfillmentof the Requirements for the Degree
Doctor of Philosophy
Approved March 2017 by theGraduate Supervisory Committee:
Sethuraman Panchanathan, ChairBaoxin Li
Hasan DavulcuJieping Ye
Shayok Chakraborty
ARIZONA STATE UNIVERSITY
May 2017
ABSTRACT
The widespread adoption of computer vision models is often constrained by the issue
of domain mismatch. Models that are trained with data belonging to one distribution,
perform poorly when tested with data from a different distribution. Variations in
vision based data can be attributed to the following reasons, viz., differences in image
quality (resolution, brightness, occlusion and color), changes in camera perspective,
dissimilar backgrounds and an inherent diversity of the samples themselves. Machine
learning techniques like transfer learning are employed to adapt computational models
across distributions. Domain adaptation is a special case of transfer learning, where
knowledge from a source domain is transferred to a target domain in the form of
learned models and efficient feature representations.
The dissertation outlines novel domain adaptation approaches across different fea-
ture spaces; (i) a linear Support Vector Machine model for domain alignment; (ii) a
nonlinear kernel based approach that embeds domain-aligned data for enhanced clas-
sification; (iii) a hierarchical model implemented using deep learning, that estimates
domain-aligned hash values for the source and target data, and (iv) a proposal for
a feature selection technique to reduce cross-domain disparity. These adaptation
procedures are tested and validated across a range of computer vision applications
like object classification, facial expression recognition, digit recognition, and activity
recognition. The dissertation also provides a unique perspective of domain adaptation
literature from the point-of-view of linear, nonlinear and hierarchical feature spaces.
The dissertation concludes with a discussion on the future directions for research that
highlight the role of domain adaptation in an era of rapid advancements in artificial
intelligence.
i
ACKNOWLEDGEMENTS
You gave me the opportunity to grow and the freedom to explore
and when success came my way there was none who cheered more;
‘Dr. Panch is my advisor’ is a badge I will proudly wear,
its my privilege and honor you’re guru, guide and chair.
I am indebted to Dr. Ye, for guidance in machine learning,
He showed me the ropes when he took me under his wing;
To the marquee team of Drs. Chakraborty, Davulcu, Li and Ye,
it means laurels to me to have you on my committee;
SCIDSE and ASU, your support has been relentless;
For seeing in me a TA - Navabi, Mutsumi and Calliss;
Christina from advising and Kathy et al. from Fulton;
Pam, Monica, Teresa and Brint - I thank you a ton.
Exploring uncharted waters upon the merry CUbiC boat,
a riot of swashbucklers kept my dreams and spirits afloat;
We conquered the horizon and raised the ASU flag high,
ably led by Cap’n Troy, we strived and aimed for the sky.
To CUbiC champions Morris, Terri, Rita and my mentor Vineeth;
To friends who cheered me on - Sai, Indu, Ganesh and Prasanth;
To buddies Mike, Corey, Scott, Brian, Arash, Ramesh, Ramin,
Meredith, Bijan, and partners Jose, Hiranmayi, Ragav and Binbin;
You’ve made this possible and I give many thanks from deep within.
ii
The Sai Center in Mesa is an oasis in the desert;
Its people nourished my body and their music my soul;
On dreary days when I felt broken and wanted to quit,
I drank from its cool spring and rejuvenated my spirit.
Your love and blessings have guided me through and through;
Swami, Tata, Ajji, Amma and Naanna, I offer this to you.
learning, zero-shot learning and domain adaptation.
1. Multitask Learning (MTL): In this setting, labeled training data is available
for a set of K tasks T = {T1, T2, . . . , TK} where each task is associated with a
different domain, D = {D1,D2, . . . ,DK}. Given the kth task, it is not possible
to estimate the empirical joint distribution Pk(X, Y ) reliably with data from
the kth domain, Dk = {xik, y
ik}nk
i=1, xik ∈ Xk and yik ∈ Yk. A good approximation
for Pk(X, Y ) is learned by exploiting the training data from all the domains
D = {D1,D2, . . . ,DK} and learning all the tasks simultaneously Bruzzone and
Marconcini (2010). The tasks are different irrespective of the equality of the
domains. In terms of availability of labels, all the domains usually have labels.
Even by this definition, Pk(X, Y ) is inferred by combining the data from all the
tasks and learning all the tasks simultaneously. An introduction and a survey of
multitask learning procedures is provided in Caruana (1997); Thrun and Pratt
15
(2012).
As an example, consider a problem with K tasks, where each task is represented
by Xk ∈ Rnk×d, where nk is the number of samples and d is the dimension.
The labels are represented by Yk ∈ Rnk×1. The goal is to estimate a simple
linear model W = [W1,W2, . . . ,WK ] such that Yk = Xk ×Wk. One of the
standard procedures to model task relatedness is to assume Wk are close to one
another. Regularized Multitask Learning by Evgeniou and Pontil (Evgeniou and
Pontil (2004)) is a landmark work in modeling task relatedness. The authors
incorporated task relatedness by assuming the Wk are close to each other. The
optimization problem sought to minimize,
minW
1
K
(
K∑
i
Loss(Xk,Wk,Yk) + λ
K∑
k
∣
∣
∣
∣Wk −1
K
K∑
k′
Wk′∣
∣
∣
∣
2
2
)
(2.2)
The first term is a standard loss term. The second term captures the inter-task
relationship, where tasks are closely related to each other and their distance
from the mean task is minimized. Although the model is very elegant, real
world tasks need not be so closely related to one another. Improvements upon
this very basic model, along with other procedures to perform multitask learning
are outlined in the following works: Bakker and Heskes (2003); Evgeniou and
Pontil (2007); Collobert and Weston (2008); Weinberger et al. (2009); Kang
et al. (2011); Kumar and Daume III (2012) and Gong et al. (2012b).
2. Self-taught Learning : This learning paradigm was introduced in Raina et al.
(2007). The concept of learning is based on how humans learn in an unsuper-
vised manner from unlabeled data. In this paradigm, the transfer of knowledge
is from unrelated domains in the form of learned representations. Given un-
labeled data, {x1u, . . . ,x
ku} where xi
u ∈ Rd, the self-taught learning framework
estimates a set of K basis vectors that are later used as a basis to represent the
16
Figure 2.1: Different learning paradigms with labeled data in orange border. Su-pervised learning uses labeled examples, semi-supervised uses additional unlabeledexamples, transfer learning uses additional labeled examples from different domainand self-taught learning uses unlabeled data to learn. (Image based on Raina et al.(2007)).
target data. Specifically,
min∑
i
||xiu −
K∑
j
ajibj||2 + β||ai||1
s.t.||bj|| ≤ 1, ∀j ∈ 1, . . . K (2.3)
where, {b1, . . . , bK} are a set of basis vectors that are learned from unlabeled
data and bi ∈ Rd. For input data xi
u, the corresponding sparse representation
is ai = {a1i , . . . , aKi } with aji corresponding to the basis vector bj. The transfer
of learning occurs when the same set of basis vectors {b1, . . . , bK} are used as
a basis to represent labeled target data. Figure (2.1), provides an overview of
the different learning paradigms compared with self-taught learning. Although
the Figure (2.1) distinguishes transfer learning from self-taught learning, this
discussion treats it as a special case of transfer learning. The unlabeled dataset
(outdoor scene images in Figure (2.1)) can be considered as the source data set
and the labeled dataset (elephants and rhinos in Figure (2.1)) can be treated
as the target dataset. Some of the prominent machine learning and computer
vision techniques that incorporate self-taught learning are Yang et al. (2009);
Bengio (2009); Lee et al. (2009); Mairal et al. (2010).
3. Sample Selection Bias : The concept of sample selection bias was introduced in
17
Economics as a Nobel prize winning work by James Heckman in 1979 (Heckman
(1979)). When the distribution of sampled data does not reflect the true dis-
tribution of the dataset it is sampled from, it is a case of sample selection bias.
For example, a financial bank intends to model the profile of a loan defaulter in
order to deny such defaulters a loan from the bank. It therefore builds a model
based on the loan defaulters it has in its records. However, this is a small subset
and therefore does not truthfully model the general public the bank wants to
profile but does not have access to. Therefore, the defaulter profile generated
by the bank can be considered to be offset by what is termed as the sample
selection bias.
In this learning scenario, a dataset X = {xi, yi}ni=1 is made available. This
dataset is used to estimate the joint distribution P (X, Y ) which is an approx-
imation for the true joint distribution P (X, Y ). However, P (X, Y ) 6= P (X, Y )
where P (X, Y ) is the estimated distribution and P (X, Y ) is the true distri-
bution (Bruzzone and Marconcini (2010)). This could be because of very few
data samples, which could lead to a poor estimation of the prior distribution,
P (X) 6= P (X). In other cases, when the training data does not represent the
target (test) data, and introduces a bias in the class prior (P (Y ) 6= P (Y )), this
eventually leads to incorrect estimation of the conditional (P (Y |X) 6= P (Y |X)).
When both the marginal (P (X) 6= P (X)) and the conditionals are differ-
ent (P (Y |X) 6= P (Y |X)), the problem is referred to as sample selection bias
Zadrozny (2004); Dudık et al. (2005); Huang et al. (2006). When only the
marginals vary (P (X) 6= P (X)) and the conditionals are approximately equal
(P (Y |X) ≈ P (Y |X)), the problem is termed as covariate shift Shimodaira
(2000); Quionero-Candela et al. (2009); Bickel et al. (2009); Gretton et al.
(2009).
18
4. Lifelong Machine Learning (LML): The concept of life long learning was dis-
cussed by Thrun in the seminal work Thrun (1996). The concept of transfer
in life long learning can be formulated as follows. A machine learning model
trained for K tasks {T1, T2, . . . , TK} is updated by learning task TK+1 with data
DK+1. The work discussed if learning the K + 1th task was easier than learning
the first task. The key characteristics of life long learning are: (i) a continuous
learning process, (ii) knowledge accumulation, and (iii) use of past knowledge
to assist in future learning 1 Fei et al. (2016).
Lifelong machine learning differs from multitask learning because it retains
knowledge about previous tasks and applies that knowledge to learn new tasks.
It also differs from standard domain adaptation which transfers knowledge to
learn only one task (target). This germane concept of lifelong learning is closely
related to the paradigm of incremental learning where a model is updated with
new data to learn a new task. Lifelong machine learning can also be viewed as
lifelong incremental learning. However, some incremental learners depend on
data from previous tasks when learning a new task Mensink et al. (2013). Other
approaches learn succinct data representations (also termed as exemplars) to
model data from previous tasks and recall them when updating the classifier
for a new task Rebuffi et al. (2017). In an uncompromising form of incremental
learning, no data is used from previous tasks and the learner is updated using
only the data from the new task Li and Hoiem (2016).
5. One-shot Learning and Zero-shot Learning : These can be viewed as extreme
cases of transfer learning Goodfellow et al. (2016). Both these forms of transfer
seek to learn data categories from minimal data. The key motivation is the
The above derivation uses the covariate shift assumption (PS(Y |X) ≈ PT (Y |X)
and PS(X) 6= PT (X)) and the concept of shared support, i.e. (PS(X) =
0 iff PT (X) = 0). The weight for each source data point is pT (x)pS(x)
. Domain adap-
tation approaches that estimate the weights for source data points are called in-
stance weighting techniques. Most domain adaptation algorithms estimate this
weight using Kernel Mean Matching (Gretton et al. (2009)) or Kullback-Leibler
divergence (Sugiyama et al. (2008)). Domain adaptation can also be viewed as a
case of covariate shift or sample selection bias Quionero-Candela et al. (2009).
Another approach to domain adaptation is feature matching, where a shared
feature representation between the source and target is estimated. Examples of
this technique are discussed in Pan et al. (2008, 2011); Long et al. (2013).
Other procedures learn feature subspaces that are common to the source and
target datasets. They project the source and target dataset into that space and
train a classifier on the source and expect it to work on the projected dataset
Fernando et al. (2013); Gong et al. (2012a); Long et al. (2014); Hoffman et al.
(2012). Recently, there has been work modeling the difference in class prior
and conditional shift Zhang et al. (2013). All of the above procedures can be
viewed as fixed representation approaches. In a fixed representation approach,
the features are predetermined and fixed and domain adaptation is performed
using these pre-determined features. In recent years deep learning approaches
have outperformed fixed representation techniques in domain adaptation. Deep
learning based domain adaptation learns to extract transferable feature repre-
sentations using deep neural networks Tzeng et al. (2014); Long et al. (2015);
Ganin et al. (2016). The following chapter provides a classification of domain
adaptation procedures including recent approaches involving deep learning.
22
In addition to these types of transfer learning there are a couple of other learning
approaches that have elements of knowledge transfer. The problem of concept drift
occurs when the data changes its distribution gradually over time. Models in this
case must adapt to the changing distribution while also transferring knowledge from
previously seen data. Multimodal learning can also be viewed as an example of
transfer learning where a relationship is captured between representations in multiple
modalities in order to enhance learning Srivastava and Salakhutdinov (2012).
2.1.3 Types of Domain Adaptation
This subsection outlines the different approaches to posing problems in domain
adaptation. A detailed discussion on different approaches to solving domain adap-
tation problems is described in the following chapter. This subsection merely lists
the different ways in which a domain adaptation problem is posed. There are two
standard domain adaptation problem statements.
1. Supervised or Semi-supervised Domain Adaptation: In this setting the source
domain consists of labeled data Ds = {xsi , y
si }ns
i=1 and the target domain also
consists of labeled data Dt = {xti, y
ti}nt
i=1 ∪ {xti}nt+nu
i=nt+1. There are nt labeled
target data points and nu unlabeled target data points with nt ≪ nu. However,
it is not possible to estimate the joint distribution PT (X, Y ) over the target
because of limited number of target samples nt, without the risk of overfitting.
The source dataset has more labeled samples than the target with nt ≪ ns
and it can be used to estimate the joint distribution PS(X, Y ). Therefore, the
source dataset Ds is used along with Dt to train a classifier for the target data
as in Daume III et al. (2010); Saenko et al. (2010); Hoffman et al. (2013) and
Venkateswara et al. (2015b). When ns and nt are of similar size, the problem
can also be viewed as a multi-task learning setup.
23
2. Unsupervised Domain Adaptation: This is by far the most standard and also
the most challenging approach to domain adaptation. In this setting the source
domain consists of labeled data Ds = {xsi , y
si }ns
i=1 and the target data consists
of only unlabeled data Dt = {xti}nt
i=1. The task is to learn a classifier for the
target data using the source dataset Ds and the target dataset Dt. There is no
restriction on the number of source and target samples. The source data can be
used to estimate the joint distribution PS(X, Y ). But the source data is adapted
to the target using the unlabeled target data to approximate PT (X, Y ) as in
Gopalan et al. (2011); Gong et al. (2012a); Long et al. (2014) and Venkateswara
et al. (2016).
Apart from these standard setups for domain adaptation, there is also multi-source
domain adaptation where, as the name indicates, there are multiple source domains
and one target domain as in Mansour et al. (2009); Chattopadhyay et al. (2012) and
the multiple source domains are adapted to the target to estimate a classifier for the
target.
2.2 Performance Bounds for Domain Adaptation
Generalization bounds give the probability that a function chosen from a hypoth-
esis set achieves a certain error in a statistical learning model. Generalized learning
bounds have been applied to evaluate the consistency of Empirical Risk Minimization
(ERM) based learning methods Vapnik (2013). However, these learning bounds are
based on the assumption that the test set is drawn from the same distribution as
the training set. There have been attempts to adapt these generalization bounds for
domain adaptation, where the test (target) set is from a different distribution than
the training (source) set Ben-David et al. (2010); Mansour et al. (2009) and Zhang
et al. (2012).
24
A binary classification task is considered with input space X ⊆ Rd and label space
Y = {0, 1}. The source domain is denoted by DS = {(X × Y), PS(X, Y )}, where
PS(X, Y ) is the source joint distribution. Similarly, the target domain is denoted by
DT = {(X × Y), PT (X, Y )}, where PT (X, Y ) is the target joint distribution. The
source dataset consists of labeled data Ds = {xsi , y
si }ns
i=1 and the target dataset is
Dt = {xti}nt
i=1. The goal of domain adaptation is to learn a classifier h : X → Y for
the target data with minimum risk of prediction,
ǫT (h) = Pr(xt,yt)∼DT
(
h(xt) 6= yt)
(2.5)
where, Pr(.) is probability.
2.2.1 Divergence Between Domains
The standard procedure for bounding the target error in terms of the source error
and a factor measuring the discrepancy between the source and the target. This
reasoning is based on the notion that the source error is a good substitute for the
target error when the distributions are similar. The distance between the marginal
distributions PXS and PX
T is defined for a hypothesis class H and is termed as the
H-divergence Kifer et al. (2004).
dH(PXS , P
XT ) = 2 sup
h∈H
∣
∣
∣Pr
xs∼PXS
(
h(xs) = 1)
− Prxt∼PX
T
(
h(xt) = 1)
∣
∣
∣. (2.6)
The H-divergence is based on the ability of the hypothesis class H to distinguish
between samples generated from PXS and PX
T . An empirical divergence can also be
estimated based on samples Ds and Dt from the two domains Ben-David et al. (2010).
dH(Ds,Dt) = 2
(
1−minh∈H
[
1
ns
ns∑
i=1
I[h(xsi ) = 1] +
1
nt
nt∑
i=1
I[h(xti) = 0]
]
)
(2.7)
where, I[c] is an indicator function which is 1 when the condition c is true, otherwise
it is 0. The goal is to determine the best classifier that can differentiate between
25
the two domains. The H-divergence is then represented in terms of the error of that
classification.
2.2.2 Proxy Divergence Measure
To determine the best classifier may be difficult in practice. Even in the space
of linear classifiers, Equation (2.7) may be intractable. Ben-David et al. outline a
procedure to determine a proxy distance as a substitute for the H-divergence Ben-
David et al. (2010). The proxy-distance is based on the same principle of learning a
classifier to distinguish between the two domains. The Proxy A-distance is given by,
dA = 2(1− 2ǫ) (2.8)
where, ǫ is the average error using a linear classifier to distinguish between data points
from the two domains. The proxy distance measure has been used to estimate the
distances between pairs of datasets in domain adaptation experiments Glorot et al.
(2011), Long et al. (2015) and Venkateswara et al. (2017b).
2.2.3 Generalization Bound on Target Risk
To estimate a generalization bound for the target error, a few definitions are first
outlined. For a hypothesis function h : X → {0, 1} and a marginal distribution PXS ,
the probability that the hypothesis disagrees with the labeling function f(.) is given
by,
ǫS(h, f) = Ex∼PXS[|h(x)− f(x)|] (2.9)
If H is a hypothesis class and AH is a set of subsets over X , that are a support over
the hypothesis set H, i.e. ∀h ∈ H, {x : x ∈ X , h(x) = 1} ∈ AH, then the distance
between two distributions PXS and PX
T can be defined according to Blitzer et al. (2008)
26
as:
dH(PXS , P
XT ) = 2 sup
A∈AH
|PrPXS[A]− PrPX
S[A]|. (2.10)
Following the definition of distance in Equation (2.10), a symmetric difference hy-
pothesis space is defined as,
H∆H = {h(x)⊕ h′(x) : h, h′ ∈ H}, (2.11)
where the XOR operator ⊕ indicates that ∀g ∈ H∆H which labels x as positive,
∃h, h′ ∈ H : h(x) 6= h′(x). Similarly, AH∆H is defined as a set of all subsets A such
that A = {x : x ∈ X , h(x) 6= h′(x)} for some h, h′ ∈ H. Along with the definition
of error probability in Equation (2.9), the definition of distance in Equation (2.10)
and the outline of the symmetric difference hypothesis space in Equation (2.11), Ben-
David et al. (2010) derive the distance dH∆H between two distributions as,
|ǫS(h, h′)− ǫT (h, h′)| ≤1
2dH∆H(P
XS , P
XT ). (2.12)
Blitzer et al. Blitzer et al. (2008) derive a target error bound for a pair of source
and target datasets Ds and Dt with n data samples based on a hypothesis space H
that has a VC-dimension d. With a probability of at least 1 − δ (over the choice of
samples), for every h ∈ H,
ǫT (h) ≤ ǫS(h) +1
2dH∆H(Ds,Dt) + 4
√
2dlog(2n) + log4δ
n+ λ. (2.13)
Here λ = minh∈H ǫT (h)+ ǫS(h), the sum of source and target errors for the least error
hypothesis. The bound guarantees that if the distance dH∆H between the domains
is small (i.e., the domains are similar) and n is large, then the target error can be
approximated by the source error. If the combined error of the ideal hypothesis is
large, then there is no classifier that can be trained using the source data which will
be a good target hypothesis. If λ is small (as it is usually for domain adaptation),
then the key factor is the dH∆H for computing target error.
27
2.3 Domain Adaptation in Computer Vision
This section outlines how domain adaptation research is currently being pursued.
It brings to attention the drawbacks with current approaches and proposes changes
to the field.
2.3.1 Research in Domain Adaptation
While the problem of variations in data coming from different distributions is
outstanding, the solutions provided by domain adaptation models have not been
easily applied to real world applications in computer vision. This can be attributed
to the manner in which domain adaptation models are currently being developed
and evaluated in the research community. Most of the proposed solutions in domain
adaptation are based on models developed in the following environment: (i) Two
different datasets to represent the source and the target domains. (ii) Source dataset
with labeled data and target dataset with unlabeled data. (iii) The label space of
the source and target being exactly the same. These restrictions limit the adoption
of the proposed domain adaptation models to real world problems and confine them
to the research community.
While these self-imposed restrictions help to formulate a well-defined domain
adaptation problem, they do not reflect a real world setting. An environment for
domain adaptation that is closer to a real world setting ought to be: (i) Two well
defined domains (rather than datasets) to represent the source and the target. This
would entail that the domain shift between the two domains is modeled. (ii) Both the
source and target domains have labeled and unlabeled data. The target labeled data
is essential to evaluate the performance of the model. Currently, domain adaptation
models are evaluated using test data. (iii) There is no restriction on the label space
28
of the domains being exactly the same. One weak restriction could be an intersection
of the label spaces of the source and target.
The definition of a domain is rather ambiguous when dealing with images. Unlike
in audio signal processing and text data processing (NLP), domains in computer
vision are defined by dataset. Images from different datasets are likely to belong
to two different domains. This is due to the bias introduced by the data capture
methods and the representation procedures for a dataset and not necessarily the data
itself Torralba and Efros (2011). There has been some work in estimating domains by
segregating data from multiple datasets into clusters Gong et al. (2013b). However,
this has not been applied or extended by subsequent research in the area. A primary
focus area for research in domain adaptation would be to develop comprehensive
models for domain shift.
2.3.2 Computer Vision Datasets for Domain Adaptation
Figure 2.2: Sample images from the Office-Home dataset. The dataset consistsof images of everyday objects organized into 4 domains; Art: paintings, sketchesand/or artistic depictions, Clipart: clipart images, Product: images without back-ground and Real-World: regular images captured with a camera. The figure displaysexamples from 16 of the 65 categories. Image based on Venkateswara et al. (2017b).
Evolution in datasets and the evolution of models for domain shift should com-
plement each other. Current datasets for domain adaptation are not based on any
models of domain shift. They are merely data samples coming from different sources
29
of data all with the same categories. The domain difference between these datasets is
attributed to the ‘bias’ between the datasets, without a specific model characterizing
the domain shift Torralba and Efros (2011). The domain adaptation procedures that
are developed using these datasets can therefore be considered to be very generic.
There is no guarantee on the performance of these procedures when applied to new
problems. For e.g., if a domain adaptation approach were to be developed using the
digit datasets (USPS and MNIST Jarrett et al. (2009)), there is no guarantee that
this procedure would work well for a domain adaptation problem with medical im-
ages. On the other hand, if a dataset were to be created based on a domain shift
model, then algorithms that are developed using this dataset can be applied to any
domain adaptation problem where the same domain shift is observed. This is one pri-
mary reason for introducing new datasets for domain adaptation based on modeling
domain shift.
Domain adaptation for vision based applications has generated great interest in
the computer vision community in recent years Patel et al. (2015). Given the richness
of their feature representations, deep learning based domain adaptation approaches
like Tzeng et al. (2015a); Long et al. (2015); Ganin et al. (2016) have outperformed
traditional shallow learning techniques Saenko et al. (2010); Pan et al. (2011); Gong
et al. (2012a); Shekhar et al. (2013); Long et al. (2013); Fernando et al. (2013); Sun
et al. (2015a). However, supervised deep learning models require a large volume
of labeled training data. Unfortunately, existing datasets for vision-based domain
adaptation are limited in their size and are not suitable for validating deep learn-
ing algorithms. In the absence of large datasets, domain adaptation algorithms are
evaluated on datasets with few images.
The standard datasets for vision based domain adaptation are, facial expression
datasets CKPlus (Lucey et al. (2010)) and MMI (Pantic et al. (2005)), digit datasets
30
SVHN (Netzer et al. (2011)), USPS and MNIST (Jarrett et al. (2009)), head pose
recognition datasets PIE (Long et al. (2013)), object recognition datasets COIL (Long
et al. (2013)), Office (Saenko et al. (2010)) and Office-Caltech (Gong et al. (2012a)).
These datasets were created before deep-learning became popular and are insufficient
for training and evaluating deep learning based domain adaptation approaches. For
instance, the object-recognition dataset Office has 4110 images across 31 categories
and Office-Caltech has 2533 images across 10 categories. A notable exception is the
recently released Cross-Modal Places dataset with nearly 1 million images for indoor
scene-recognition based on 5 different domains, viz., natural images, clip-art, sketches,
text and spatial-text Castrejon et al. (2016).
One of the most popular datasets for computer vision is the Office dataset. Recent
works in domain adaptation have also outlined some flaws with the dataset like label
noise and lack of variation in object pose. They have used other datasets to evaluate
their algorithms Bousmalis et al. (2017, 2016). Although the dataset lists 3 domains
viz., Amazon, DSLR, and Webcam, it effectively has only 2 domains because the DLSR
and Webcam domains are very similar with no domain discrepancy.
This also raises the question of what constitutes a domain in computer vision, for
which there is no clear answer. Variations in vision based data can be attributed to
the following reasons, viz., differences in image quality (resolution, brightness, occlu-
sion, color), changes in camera perspective, dissimilar backgrounds and an inherent
diversity of the samples themselves. Differences in any or all of these factors are
what makes data from two domains dissimilar. A thorough evaluation of domain
adaptation models can be done when testing with data that exhibits most of these
variations. The existing datasets in domain adaptation for computer vision are very
limited in the amount of variations in between the domains.
To address these limitations, a new dataset has been released as one of the con-
31
Table 2.1: Statistics for the Office-Home dataset. Min: # is the minimum numberof images amongst all the categories, Min: Size and Max: Size are the minimumand maximum image sizes across all categories andAcc. is the classification accuracy.
This set of linear methods is based on adapting a linear Support Vector Machine
(SVM) classifier from the source to the target. The domain adaptive SVM (DASVM)
by Bruzzone and Marconcini (2010), is a unique linear SVM based solution for un-
supervised domain adaptation. The DASVM initially trains a SVM classifier using
the labeled source data. In successive iterations the decision boundary for the SVM
is re-estimated as batches of unlabeled target data are added to the pool of labeled
data and labeled source data is removed in batches. The SVM classifier is used to
predict the labels (termed as ‘semi-labels’) for the unlabeled target data. At every
iteration, a batch of unlabeled data that is closest to the SVM boundary within the
SVM margins is chosen to be added to the pool of labeled data. Any semi-labeled
40
data that changes its labels across iterations is removed from the labeled set. Also,
a batch of source data points that is far from the decision boundary and therefore
does not affect the decision boundary, is removed from the labeled set. The algorithm
converges when the number of unlabeled data points within the margin goes below
a threshold. The authors also outline a circular cross validation strategy to validate
model parameters for the SVM using unlabeled target data.
The authors in Yang et al. (2007a), adapt a SVM classifier trained on the source
(auxillary) data to the target (primary) data. This is a supervised domain adaptation
problem that has labeled data in the target domain. Let ws be the source data SVM
classifier, then the target data SVM classifier is modeled so that it is not very different
from the source SVM with the following optimization problem,
minwt
=1
2||wt −ws||2 + C
nt∑
i=1
ξ
subject to ξ ≥ 0, ytiw⊤y x
ti ≥ 1− ξ ∀i (3.5)
The first term in the optimization framework is the regularization over the decision
boundaries. The constraint ensures that the target classifier wt is similar to the
source classifier ws. In a closely related work, Yang et al. (2007b), the authors train
a SVM classifier for the target data which is based on the source data SVM.
ft(x) = fs(x) + ∆f(x)
= w⊤s x+w⊤
t x (3.6)
Both of the above works are also extended to the nonlinear setting by introducing
kernels. In Aytar and Zisserman (2011), the authors extend the Adapt-SVM to a
Projective Model Transfer SVM (PMT-SVM), by introducing a constant factor γ,
41
which controls the amount of transfer.
minwt
=1
2||wt||2 + γ||Pwt||2 + C
nt∑
i=1
l(xti, y
ti ;wt)
subject to w⊤t ws ≥ 0. (3.7)
Here P = I − wsw⊤s
w⊤s ws
is the projection matrix and l(.) is the hinge loss. ||Pwt||2 =
||wt||2sin2θ, where θ is the angle between ws and wt. The constraint confines wt to
the positive half-space of ws.
A technique that combines both linear transformations and classifier adaptation
is presented in Hoffman et al. (2013). The Max Margin Domain Transform (MMDT)
algorithm learns a transformation matrix W , that projects the target data to a sub-
space that makes it similar to the source data. Simultaneously, the model learns a
SVM classifier {w, b}, to classify both the source and target data. The minimization
problem is given by,
minW,w,b
=1
2||W ||2F +
1
2||w||2
s.t. ysi
([
xsi
1
]⊤[wb
])
≥ 1, i ∈ {1, . . . , ns}
yti
([
xti
1
]⊤
W⊤[
wb
])
≥ 1, i ∈ {1, . . . , nt} (3.8)
The first and the second terms are the regularization on the transformation matrix
W and the decision boundary w, respectively. The constraints are the standard max-
margin constraints for SVM, with the target data being transformed by W . The
parameters W and {w, b} are estimated using alternating minimization strategies.
3.2.3 Linear Alignment of Moments
In these set of linear models, domain adaptation is achieved by alining the mo-
ments of the source and target data. One such algorithm is the Feature LeArning with
42
second Moment Matching (FLAMM) method in Jiang et al. (2015). In FLAMM, the
authors derive a bound for the expected error difference between the source data and
the target data for a common classifier h(x). This error difference can be minimized
by minimizing the difference between the second moments given by MS = 1nsXSX
⊤S
andMT = 1ntXTX
⊤T . IfM =MS−MT , the goal is to learn a transformation matrix P
such that the transformed data P⊤X has the domains aligned. This condition can be
represented as a minimization over tr(P⊤M2P ). The minimization problem is given
by,
minP||X⊤ −X⊤P ||2F + γ1tr(P
⊤ΛP ) + γ2tr(P⊤M2P ) (3.9)
Here, Λ is a diagonal matrix where Λii = XiX⊤i , and Xi is the row i of X. The first
term is to ensure the projected data is similar to the target so as to avoid trivial
solutions when the mismatch between source and target data is small. The solution
to P is iteratively refined by updating X ← P⊤X in every iteration and solving for
Equation (3.9).
The CORrelation ALignment (CORAL) algorithm in Sun et al. (2015a), is an-
other unsupervised domain adaptation method that seeks to match the second order
statistics of the source and the target. In CORAL, the authors learn a transformation
matrix to transform the source data so that the source and target covariance matrices
are nearly identical. If the source data is transformed by matrix A, i.e., XS ← AXS,
the difference between the correlation matrices of the transformed source CS and the
target CT , is given by,
minA||CS − CT ||2F
=minA||A⊤CSA− CT ||2F . (3.10)
Here CS is the correlation of the original source data. The optimal solution to Equa-
43
tion (3.10) is given by,
A∗ = (USΣ+S
12U⊤
S )(UT [1:r]Σ12
T [1:r]U⊤T [1:r]), (3.11)
where, CS = USΣSU⊤S is the singular value decomposition of CS and CT = UTΣTU
⊤T
is the singular value decomposition of CT . The value r is the minimum of the ranks
of CS and CT . The authors interpret Equation (3.11) in the following manner. The
first term within the parenthesis can be viewed as a source whitening matrix. The
second term can be viewed as a target coloring matrix. It implies that the source is
first whitened to remove source related covariance. It is then colored with the target
related covariance so that the covariance becomes similar to the target. The authors
also extend this work to a deep learning based CORAL model in Sun and Saenko
(2016).
3.3 Nonlinear Feature Spaces for Domain Adaptation
This section describes the literature for domain adaptation in nonlinear feature
spaces. Here, the domain disparity is reduced by projecting the data from the do-
mains into high-dimensional nonlinear feature spaces using kernels. A standard proce-
dure that is applied for nonlinear domain adaptation is Maximum Mean Discrepancy
(MMD). A brief introduction to MMD is provided at the beginning of the section
before proceeding to the different methods that apply MMD for domain adaptation.
Given data samples Ds and Dt from two distributions, the MMD measure esti-
mates the distance between the distributions. Most existing measures require the
assumption of certain parameters to estimate the distance between distributions, but
not the MMD. The MMD is a nonparametric distance estimate designed by embed-
ding the data into a Reproducing Kernel Hilbert Space (RKHS), Smola et al. (2007).
The data is mapped to a high-dimensional (possibly infinite-dimensional) space de-
44
fined by Φ(X) = [φ(x1), . . . , φ(xn)]. φ : Rd → H defines a mapping function and
H is a RKHS. The dot product between the high-dimensional mapped vectors φ(x)
and φ(y), is estimated by the kernel-trick. The dot product is given by the positive
semi-definite (psd) kernel, k(x,y) = φ(x)⊤φ(y). The kernel k(.) can be viewed as
a similarity measure between x and y. A standard kernel function is the Gaussian
radial basis function (RBF), k(x,y) = exp(− ||x−y||2
2σ2 ). The similarity measure be-
tween all pairs of data points in X, can represented using the kernel gram matrix
which is given by, K = Φ(X)⊤Φ(X) ∈ Rn×n. Gretton et al. in Gretton et al. (2007),
introduced the MMD to estimate the distance between Ds and Dt, which is given by,
MMD =
∣
∣
∣
∣
∣
∣
∣
∣
1
ns
ns∑
i=1
φ(xsi )−
1
nt
nt∑
j=1
φ(xtj)
∣
∣
∣
∣
∣
∣
∣
∣
2
H
(3.12)
The distance between the two distributions is the distance between their means in
the RKHS. When the RKHS is universal, the MMD measure approaches 0 only when
the distributions are the same (Smola et al. (2007)). Many of the methods in the
following subsections apply the MMD in different ways to achieve nonlinear domain
adaptation.
3.3.1 Max Margin Kernel Methods
In an early adoption of the MMD for domain adaptation, Duan et al. in Duan
et al. (2009), introduced a Domain Transfer SVM (DTSVM) model for video concept
detection. The MMD distance between the source and target domains is represented
as,
MMD =
∣
∣
∣
∣
∣
∣
∣
∣
1
ns
ns∑
i=1
φ(xsi )−
1
nt
nt∑
j=1
φ(xtj)
∣
∣
∣
∣
∣
∣
∣
∣
2
H
= ||Φ(X)m||2 = tr(Φ(X)⊤Φ(X)M) = tr(KM), (3.13)
45
where, m ∈ Rn is a vector with its first ns values as 1/ns and the last nt values
as −1/nt. The matrix M ∈ Rn×n is given by M = mm⊤. K and Φ(X) have been
defined at the beginning of the section. Along with the SVM component, the DTSVM
is given by,
minK�0
maxα∈A
1
2tr(KM)2 + λ
(
α⊤1− 1
2(α ◦ y)⊤KL(α ◦ y)
)
(3.14)
The second term is the SVM dual formulation for the labeled data where KL is the
kernel over labeled data and α is the SVM vector of dual variables for the labeled
data. This is a semi-definite programming problem with K � 0. The SVM related
constraint on α is A = {α ∈ Rnl |C1 ≥ α ≥ 0,α⊤y = 0}, where nl is the number of
labeled data points from source and target and C is the SVM constant. The authors
adopt a multiple kernel formulation for K where, k(.) =∑M
i=1 dmkm, with dm ≥ 0
and∑M
m=1 dm = 1 and km for m ∈ {1, . . . ,M}, are predefined positive semi-definite
(psd) kernels. The problem now reduces to estimating coefficients d = [d1, . . . , dM ]⊤
and α. The min-max problem is solved to reach a saddle point through alternating
optimization for d and α. The same idea is elaborated using different loss functions
along with a study on complexity in Duan et al. (2012).
In the Selective Transfer Matching (STM) algorithm Chu et al. (2013), the au-
thors outline a novel SVM technique for domain adaptation for facial action unit
recognition. STM performs instance selection, where a set of source data points are
selected that have a distribution similar to the target and these are used to train a
SVM. The nonlinear SVM component in the primal form is given by,
minβ
1
2β⊤KSβ + C
ns∑
i=1
si.l(yi,KS⊤i β) (3.15)
In the above equation KS, is the kernel matrix for the source (training data). β
are the coefficients for the SVM decision boundary and the regularization over the
46
decision boundary ||w||2 is given by the first term. This is the result of the representer
theorem Kimeldorf and Wahba (1970), where the SVM solution can be expressed as
f(x) =∑ns
i=1 βik(xsi ,x) and βi 6= 0 for support vectors only. l(.) is the hinge loss with
l(y, .) = max(0, 1− y.) and KSi is the ith row of KS. Each of the source data points
has an importance (weight) associated with it which is given by si. The loss value
from each of the source data points is weighted by si. The weights si, i = 1, . . . , ns,
are estimated using MMD in the following quadratic programming problem,∣
∣
∣
∣
∣
∣
∣
∣
1
ns
ns∑
i=1
siφ(xsi )−
1
nt
nt∑
j=1
φ(xtj)
∣
∣
∣
∣
∣
∣
∣
∣
2
H
. (3.16)
In order to simplify the equation, κi :=ns
nt
∑nt
j=1 k(xsi ,x
tj), i = 1, . . . , ns and KSij
=
k(xsi ,x
sj) are defined. The minimization can now be represented as a quadratic pro-
gramming problem,
mins
=1
2s⊤KSs− κ⊤s,
s.t. si ∈ [0, B],
∣
∣
∣
∣
ns∑
i=1
si − ns
∣
∣
∣
∣
≤ nsǫ. (3.17)
In the first constraint, the scope of discrepancy between source and target distri-
butions is limited, with B → 1, leading to an unweighted solution. The second
constraint ensures that w(x)PS(x), remains a probability distribution (Gretton et al.
(2009)). Equations (3.15) and (3.17) are biconvex (convex when either of s or β is
fixed). The solution is arrived at using alternate optimization methods.
3.3.2 MMD - Instance Weighting and Selection Methods
Instance selection in domain adaptation deals with sampling source data points
whose distribution is similar to the target. This approach is based on the reasoning
that although the source and target data are from different distributions, there could
be some overlap between the two distributions. MMD is applied to identify data
points in the source whose distribution is similar to the target data.
47
The Joint Optimization based Transfer and Active Learning (JO-TAL), proposes
a single framework to perform transfer and active learning by solving a convex op-
timization problem in Chattopadhyay et al. (2013). To implement transfer learning,
the framework uses the MMD (Borgwardt et al. (2006)) measure to weight the source
data points xs ∈ Ds, such that their distribution is similar to the unlabeled target
dataset XT u (unlabeled data points from the target). The authors included active
learning into the MMD measure by sampling a block of b data points from XT u.
These are the batch of unlabeled data points that will be queried for labels (active
learning). The model also accounts for labeled target data (if any) with dataset XT l.
The minimization problem is given by,
minα,β
∣
∣
∣
∣
∣
∣
∣
∣
1
ns + nl + b
(
∑
i∈Ds
βiφ(xsi ) +
∑
j∈XT l
φ(xtj) +
∑
k∈XT u
αkφ(xtk)
)
− 1
nu − b∑
k∈XT u
(1− αk)φ(xtk)
∣
∣
∣
∣
∣
∣
∣
∣
2
H
s.t. αi ∈ {0, 1}, βi ∈ [0, 1], α⊤1 = b (3.18)
ns is the number of source data points in Ds, nl is the number of labeled target data
points in XT l, b is the batch size, nu is the number of unlabeled target data points in
XT u, nt = nu + nl, and 1 is a vector of ones. βi are the weights for the source data
points and α is a binary vector that samples b unlabeled target points for query. All
selected points will have αi = 1. Equation (3.18) can be written as,
minα,β
1
2β⊤KSβ +
1
2α⊤KUα+ β⊤KSUα−α⊤KU1− β⊤KSU1
+α⊤KUL1+ β⊤KSL1+ const
s.t. αi ∈ {0, 1}, βi ∈ [0, 1], α⊤1 = b (3.19)
The K terms are the kernels over the source (S), the labeled target (L) and the
unlabeled target (U). The last four terms excluding the constant are linear in α and
48
β. Equation (3.19) can be expressed as a Quadratic Programming problem in a single
variable x = [α⊤,β⊤]⊤. The constraint on α makes the problem NP-hard and it can
be relaxed to αi ∈ [0, 1] and the b highest values are chosen as the active learning
batch. The formulation is very elegant and accounts for all the desirable properties
of transfer and active learning, namely, representativeness, diversity, and minimum
redundancy.
In another example of instance selection, Gong et al. in Gong et al. (2013a) iden-
tify landmarks from the source dataset using multiple kernels. The MMD formulation
is given by,
minα
∣
∣
∣
∣
∣
∣
∣
∣
1∑ns
i αi
ns∑
i
αiφ(xsi )−
1
nt
nt∑
j
φ(xtj)
∣
∣
∣
∣
∣
∣
∣
∣
2
H
s.t.1
∑
i αi
∑
i
αiysic =
1
ns
∑
i
ysic
α ∈ {0, 1}ns , (3.20)
where, ysic indicates ysi = c. The nonzero values of α are the indices of the sampled
source data points. The first constraint ensures class balance in the instance selection.
The second condition makes the problem a hard combinatorial problem. It is relaxed
by introducing variable βi = αi(∑
i αi)−1. Unlike α which is a binary variable, β ∈
[0, 1]ns is continuous. The geodesic flow kernel is used estimate the kernel with,
k(xi,xj) = φ(xi)⊤φ(xj)
= exp(
− (xi − xj)⊤G(xi − xj)
σ2
)
, (3.21)
where G, is determined using subspaces US and UT of the source and target that
are estimated using PCA (Gong et al. (2012a)). The geodesic flow kernel will be
discussed towards the end of the chapter. The authors select landmarks at multiple
granularity by defining different kernels with {σq ∈ [σmin, σmax]}Qq=1. The domain
49
invariant mapping is represented as φq(x) =√
Gqx. These domain invariant vectors
from multiple kernels are concatenated into a super vector f and cast into a multiple
kernel SVM model in order to learn a domain adaptive classifier.
3.3.3 MMD - Spectral Methods
In this class of nonlinear methods, domain adaptation is achieved by nonlinear
projection of the data where the projection matrix is a solution to an eigen-value
problem. All of the methods in this subsection model the domain adaptation problem
using kernel-PCA. In kernel-PCA a coefficient matrix A ∈ Rn×k is determined and
the nonlinear projection of X is obtained using A. The projection matrix is obtained
by solving,
maxA⊤A=I
tr(A⊤KHK⊤A). (3.22)
Here, H is the n×n centering matrix given by H = I− 1n1, I is an identity matrix and
1 is a n × n matrix of 1s. The matrix of coefficients is A ∈ Rn×k and the nonlinear
projected data is given by Z = [z1, . . . , zn] = A⊤K ∈ Rk×n. In order to account for
domian alignment after the projection, MMD can be incorporated as follows:
minA
∣
∣
∣
∣
∣
∣
∣
∣
1
ns
ns∑
i=1
A⊤ki −1
nt
n∑
j=ns+1
A⊤kj
∣
∣
∣
∣
∣
∣
∣
∣
2
H
= tr(A⊤KM0K⊤A), (3.23)
where, M0, is the MMD matrix which given by,
(M0)ij =
1nsns
, xi,xj ∈ Ds
1ntnt
, xi,xj ∈ Dt
−1nsnt
, otherwise,
(3.24)
Maximizing Equation (3.22) and minimizing Equation (3.23) is achieved with the
following optimization problem,
minA⊤KHK⊤A=I
tr(A⊤KM0K⊤A) + γ||A||2F , (3.25)
50
where, the final term is the regularization over A to ensure smoothness. The solution
to Equation (3.25) is a generalized eigen-value problem given by,
(
KM0K⊤ + γI
)
A = KDK⊤AΛ. (3.26)
where the eigen-values are the Lagrangian constants captured in the diagonal matrix
Λ = diag(λ1, . . . , λk). The coefficient matrixA is given by the k-smallest eigen-vectors
of Equation (3.26). The domain-aligned data points are then given by Z = A⊤K.
This is the Transfer Component Analysis (TCA) model described in Pan et al. (2011)
although using a different derivation and notation.
The authors in Long et al. (2014), extend the TCA to incorporate instance sam-
pling with a sparse norm over the coefficient matrix A. The optimization problem
with these constraints is given by,
minA⊤KHK⊤A=I
tr(A⊤KM0K⊤A) + γ
(
||As||2,1 + ||At||2F)
, (3.27)
where, As := A1:ns,: is the transformation matrix for the source instances and At :=
Ans+1:n,: is the transformation matrix for the target. The ℓ2,1-norm regularization
ensures row-sparsity of As effectively doing instance selection of the source data
points. Similar to the TCA, Equation (3.27) is transformed to a generalized eigen-
value problem and A is estimated.
The TCA and the TJM can be viewed as aligning the marginal probabilities
PS(X) and PT (X) of the source and the target data. To ensure the joint distributions
(PS(X, Y ) and PT (X, Y )) of the source and target are aligned, the Joint Distribution
Adaptation (JDA) (Long et al. (2013)), algorithm is adopted, which seeks to align
both the marginal and conditional probability distributions of the projected data. The
marginal distributions are aligned as in Equation (3.23). The conditional distribution
difference can also be minimized by introducing matrices Mc, with c = 1, . . . , C,
51
defined as,
(Mc)ij =
1
n(c)s n
(c)s
, xi,xj ∈ D(c)s
1
n(c)t n
(c)t
, xi,xj ∈ D(c)t
−1
n(c)s n
(c)t
,
xi ∈ D(c)s ,xj ∈ D(c)
t
xj ∈ D(c)s ,xi ∈ D(c)
t
0, otherwise.
(3.28)
Here, Ds and Dt are the subsets of source and target data points respectively. D(c)s
is the subset of source data points whose class label is c and n(c)s = |D(c)
s |. Similarly,
D(c)t is the subset of target data points whose class label is c and n
(c)t = |D(c)
t |. For
the target data, since the labels are not known, the predicted target labels are used
to determine D(c)t . The target data labels are initialized using a classifier trained on
the source data and refined over iterations. Incorporating both the conditional and
marginal distribution alignments, the JDA optimization problem is written as,
minA⊤KHK⊤A=I
tr(A⊤KC∑
c=0
McK⊤A) + γ||A||2F . (3.29)
Similar to the TCA and TJM, Equation (3.29) is transformed to a generalized eigen-
value problem and A is estimated. The domain-aligned data points are then given
by Z = A⊤K.
3.4 Hierarchical Feature Spaces for Domain Adaptation
In recent years, deep neural networks have revolutionized the field of machine
learning. Deep learning based domain adaptation has outperformed non-deep learning
algorithms because of the highly discriminatory nature of the features extracted using
deep neural networks. The features extracted using deep neural networks are termed
‘hierarchical’ due to the hierarchical nature of the model and the nonlinear multilayer
52
structure of the network. This section describes recent work in the last few years in
the area of deep learning based domain adaptation.
3.4.1 Naıve Deep Methods
Deep convolutional neural networks (CNN)s have been shown to be very good
feature extractors. Deep CNNs trained on millions of images are by themselves very
good feature extractors, not just for the dataset they are trained on, but for any
generic image. Razavian et al. in Razavian et al. (2014), have demonstrated how a
deep CNN trained on the ILSVRC 2013 ImageNet dataset (ImageNet (2013)) can be
used for extracting generic features for any image. Regular SVMs trained on these
generic features have shown astounding results across multiple applications like, scene
recognition, fine grained recognition, attribute recognition and image retrieval. A pre-
trained CNN can be used to extract generic features for the source and the target.
This can be termed as a naıve form of domain adaptation.
Pre-trained deep neural networks can also be fine tuned to the task at hand. It
is well documented that the lower layers of a CNN extract generic features that are
common across multiple tasks and the upper layers extract task specific features.
Features transition from general to specific by the last layer of the CNN. The work
by Yosinski et al. in Yosinski et al. (2014), captures the extant of generality and
specificity of neurons in each layer. Transferability has been shown to be negatively
affected by two issues: (i) the specificity of neurons (to the source task) in the higher
layers, adversely affects transfer to the target task, (ii) the fragile nature of depen-
dencies between layers that are task specific. Adding new layers to a pre-trained
(trained on source data) network and retraining it with target data, is another in-
tuitive method to transfer knowledge in a deep learning setting. When the entire
newly adapted network is fine tuned with target data, it can lead to a very efficient
53
adaptation. This form of adaptation has been explored in Oquab et al. (2014). The
authors demonstrate a procedure to reuse the layers trained on the ImageNet dataset
to compute mid-level representations for images. Despite the differences in image
statistics, these features lead to improved results for object and action classification
for different datasets.
The authors in Donahue et al. (2014), study the features extracted from the final
layers of a deep neural network for a fixed set of object classification and detection
tasks. The generic features from the 5th, 6th and 7th fully connected layers of an
AlexNet (Krizhevsky et al. (2012)), show remarkable adaptation properties and out-
perform state-of-the-art methods in classification and detection. Whereas Donahue
et al. (2014) studied adaptation using CNNs, Glorot et al. (2011) studied adaptation
of features extracted using stacked denoising autoencoders for text based sentiment
classification.
3.4.2 Adopted Shallow Methods
These set of deep learning methods adopt shallow (non-deep learning) domain
adaptation procedures in a deep neural network. In these approaches the features
extracted by the layers of the deep network are learned to be domain invariant. Do-
main alignment is achieved either through MMD, moment alignment or a loss function
that drives the source and target classifiers to be indistinguishable. In discussing these
methods, the central idea is outlined leaving out the details of network architecture,
optimization procedures, loss functions etc.
In Tzeng et al. (2014), the authors adapt an AlexNet (Krizhevsky et al. (2012))
to output domain invariant features in the final fully connected fc8 layer in the Deep
Domain Confusion (DDC) algorithm. The network has two loss components, (i)
softmax classification loss for the source data points and (ii) domain confusion loss.
54
The network minimizes a MMD loss over the source and target data outputs of the
fc8 layer in every mini-batch during training. This is termed as the domain confusion
loss. A related idea is studied in Tzeng et al. (2015b), where the network has a domain
confusion loss along with a domain classification loss. The domain classification loss
ensures the output feature representations of the source and target data are distinct.
This is in contrast to the goal of the domain confusion loss which tries to learn
domain invariant representations. The network is trained to alternately minimize
the two losses and reach a equilibrium. Long et al. introduce the Deep Adaptation
Networks (DAN) model (Long et al. (2015)), which extends the concept of domain
confusion by incorporating a MMD loss for all the fully connected layers (fc6, fc7
and fc8) of the AlexNet. The MMD loss is estimated for the feature representations
over every mini-batch during training. The work also introduces MMD estimation
computed with an efficient linear complexity based on Gretton et al. (2012). The
linear MMD estimation is also unbiased because the MMD for the entire source and
target data can be expressed as the sum of MMD across min-batches.
An extension to DAN is achieved with the Residual Transfer Network (RTN) in
Long et al. (2016b), which implements a residual layer as the final layer of the network
in addition to the softmax loss. In the RTN, the feature adaptation is achieved with
MMD loss and the source and target classifier adaptation is implemented through the
residual layer (He et al. (2016)). The source classifier fS(x) is tightly coupled with
the target classifier fT (x) varying with only a slight perturbation ∆f(fT (x)) which
is learned by the network,
fS(x) = fT (x) + ∆f(fT (x)) (3.30)
In addition, the source classifier is constrained by the softmax loss over the source
data and the target classifier is constrained with unlabeled entropy loss over the target
55
data.
Compacting deep neural networks and reducing the number of parameters is es-
sential for creating smaller more manageable networks. These procedures usually
replace the larger convolutional layer kernels with kernels of size 1× 1 and 3× 3. Al-
though such procedures produce networks that maintain the classification accuracies,
the authors Wu et al. Wu et al. (2017), note that the adaptability of these networks
is adversely affected resulting in low accuracies for domain adaptation. Wu et al.
propose a set of layers called Conv-M, which consists of multi-scale convolution and
deconvolution with kernels larger than 3 × 3 . The proposed compact network also
uses MMD to align the source and target features at multiple layers and produces
state-of-the-art results on the standard Office and Office-Caltech datasets. In addi-
tion, the network is guided with a reconstruction loss that tries to reconstruct images
the encoded feature representations. The Domain reconstruction and Classification
Network (DRCN) developed by Ghifary et al. Ghifary et al. (2016), is also guided
by a reconstruction loss that decodes the feature encoding along with a standard
classification loss.
While the MMD is a standard non-parametric measure used to align the features
of the domains, Koniusz et al. Koniusz et al. (2017) propose a technique to align
the higher order statistics of the features. The scatter statistics of samples belong-
ing to a class (within-class) are aligned across the two domains. These include the
means, scale/shear and the orientation measures of samples belonging to a single
class. The procedure also maintains good separation for between-class scatters to
enhance classification accuracies. Unlike the popular unsupervised setting, this deep
learning technique is trained using a few labeled data from the target domain.
In all the above deep domain adaptation approaches, the weights are shared be-
tween the source and the target network to ensure domain invariant features. The
56
authors in Rozantsev et al. (2016), argue that merely ensuring domain invariant fea-
tures may be detrimental to the discriminative power of the features. Their model is
a twin network (one for the source and another for the target) with a loss function
over the weights for every source target layer pair. The loss term ensures the weights
of the source and the target are closely related (but not identical). The source net-
work is trained with a softmax loss over the source data and both the networks also
minimize the MMD loss to extract domain invariant features.
3.4.3 Adversarial Methods
In recent years, one of the most significant advances to deep learning has been the
introduction of Generative Adversarial Networks (GAN) by Goodfellow et al. (2014).
GANs are generative networks that generate data (text, images, audio, etc.), such that
the data follows a predetermined distribution P (X). A vanilla GAN implementation
has two deep networks, generator g(.) and discriminator f(.), competing against each
other. The generator network takes in a noise vector z ∈ Rd sampled from a uniform
or normal distribution and generates an output g(z). The discriminator takes in
x ∈ P (X) and g(z) are tries to discriminate between the two. The generator network
tries to fool the discriminator network by generating data that appears to belong to
P (X) and the discriminator tries to distinguish between real images and fake images.
The equilibrium is a saddle point in the network parameter space.
Apart from GANs, there are other generative models like the Variational Autoen-
coders (VAE) in Kingma and Welling (2013) and PixelRNN in Oord et al. (2016)
and they all have their pros and cons 1. GANs provide the sharpest images but are
very difficult to optimize due to unstable training dynamics. VAEs provide complex
Bayesian inference models but the generated images can be blurry. PixelRNNs have
1https://openai.com/blog/generative-models/
57
a simple and stable training process, but their sampling strategy is inefficient. GANs
have had tremendous success in multiple applications. Some of these include, image
super-resolution (Ledig et al. (2016)), text-to-image generation (Reed et al. (2016);
Zhang et al. (2016)), image-to-image generation (Isola et al. (2016)) and conditional
image generation (Nguyen et al. (2016); Chen et al. (2016)). In this section, some of
the successful adversarial based domain adaptation techniques will be discussed.
In the Domain Adversarial Neural Network (DaNN) in Ganin et al. (2016), the
authors train a deep neural network in a domain adversarial manner for image clas-
sification based domain adaptation. The bottom layers of the network act as feature
extractors. The features from the bottom layers are fed into two branches of the
network. The first branch is a softmax classifier trained with the labeled source data.
The second branch is a domain classifier trained to distinguish between the features
of the source and the target. The key to the DaNN is the gradient reversal layer
connecting the bottom feature extraction layers and the domain classifier. During
backpropagation, the gradient from the domain classifier is reversed when learning
the feature extractor weights. In this way, the feature extractor is trained to extract
domain invariant features. A closely related work is also presented in Ajakan et al.
(2014).
Liu and Tuzel implement a Coupled Generative Adversarial Network model (Co-
GAN) in Liu and Tuzel (2016). The CoGAN trains a coupled network which shares
weights at different layers of the GAN. The CoGAN is setup so that the lower layers
of the generator and the upper layers of the discriminator share weights. A common
noise vector z, is fed into the two generators g1(.) and g2(.) to generate outputs g1(z)
and g2(z). These inputs are fed into two discriminators f1(.) and f2(.). The discrim-
inator f1(.) is trained to discriminate between g1(z) and the source xs. Likewise,
the discriminator f2(.) is trained to discriminate between g2(z) and the target xt.
58
In addition, the source discriminator has an additional softmax layer to classify the
source data points xs. The CoGAN was tested with MNIST and USPS data to yield
impressive unsupervised domain adaptation results. It remains to be seen if these
results can be replicated with object classification datasets.
The Pixel-GAN in Bousmalis et al. (2017), is a straightforward extension of the
GAN for unsupervised domain adaptation. In this model, instead of a noise vector
input z, the generator is input the source images and trained to convert them into
target images. The discriminator on the other hand, is trained to distinguish between
real target images and generated target images (fake target images generated from
the source). In addition, a separate network is trained to classify the generated target
images. There have been many recent works applying adversarial training for domain
adaptation. A few of the most recent procedures are, Kamnitsas et al. (2016), Tzeng
et al. (2017), Sankaranarayanan et al. (2017) and Peng and Saenko (2017).
3.4.4 Sundry Deep Methods
One of the earliest procedures for deep learning domain adaptation was proposed
by Chopra et al. Chopra et al. (2013). The Deep Learning for domain adaptation by
Interpolation between Domains (DLID) learns a cross-domain representation by inter-
polating the path between the source and target domains along the lines of Gopalan
et al. (2011). Multiple datasets with varying ratios of source and target data points
are sampled to create intermediate representations between the two domains. The
final cross-domain feature is a concatenation of all the intermediate feature represen-
tations.
Hu et al. (Hu et al. (2015)), develop a metric learning method for supervised
transfer learning using clustering. The Deep Transfer Metric Learning (DTML) model
trains a deep neural network to minimize intra-class distances and increase inter-class
59
distances. In addition, the features of the last layer of the network are learned to
be domain invariant by minimizing the MMD between the source and target feature
outputs. Further along clustering methods, Sener et al. in Sener et al. (2016), train
a deep neural network to estimate the labels of target data in a transductive setting.
The algorithm learns an asymmetric similarity metric relating the source data with
the target data. The deep network predicts the labels so as to minimize intra-class
distances and maximize inter-class distances.
Sun et al. Sun et al. (2015b) develop a domain transfer method called the local-
ized action frame (LAF) for fine-grained action localization in temporally untrimmed
videos. The LAF motivates domain transfer between weakly labeled web images and
videos. The domain transfer works in both directions; the video frames are used to
select web images that are relevant and drop non-action web images and in turn the
web images are used to select action-like frames and drop non-action frames in the
video. After the relevant frames and images are selected, a long short-term mem-
ory (LSTM) network to is used to train a fine-grained action detector to model the
temporal evolution of actions and classify the action in the frames. The work also
released a dataset of sports videos with over 130, 000 videos from 240 categories.
Bousmalis et al. (Bousmalis et al. (2016)), train Domain Separation Networks
(DSN) to extract domain-invariant feature representations and domain-specific rep-
resentations of source and target data. A shared encoder network Ec(x) is trained to
extract domain invariant feature representations for the source and the target data.
Private encoder networks Esp(x) and E
tp(x) for the source and target respectively, are
trained to extract feature representations that are distinct from the domain-invariant
representations that are the outputs of Ec(x). A shared decoder network is trained to
reconstruct the original input data based on the outputs from Ec(x), Esp(x) and E
tp(x)
A classifier is trained with the source outputs of Ec(x). The feature representations
60
that are the outputs of Ec(x) can be declared to be domain-invariant.
3.5 Miscellaneous Methods for Domain Adaptation
This section outlines an assortment of methods that could not be classified into
the above categories.
3.5.1 Manifold based Methods
Manifold methods for domain adaptation treat the subspaces of the source and
target domains as points on a manifold. Since it is a manifold of subspaces, these
manifolds are termed as Grassmann manifolds. The task of domain adaptation is
to then learn a transformation that transforms one domain to another and can be
represented as a curve on the manifold. In Gopalan et al. (2011) and Gopalan et al.
(2014), the authors sample intermediate subspaces along the manifold curve connect-
ing the source domain to the target domain. These transformations are applied to
the source data to gradually transfer it to the target subspace. The authors in Gong
et al. (2012a), sample an infinite number of such subspaces along the source-target-
curve. The effect of applying a sequence of infinite such subspace transformations is
captured with the Geodesic Flow Kernel.
3.5.2 Dictionary Based Methods
A sparse representation of signals and images has multiple applications. High
dimensional data can often be represented using sparse combination of signals from
a specified dictionary. Dictionary based domain adaptation methods seek to learn
common dictionaries for the source and target and represent data from either domains
as sparse vectors encoding these signals. A framework for transforming a dictionary
learned on one domain to another, while maintaining a sparse domain-invariant signal
61
representation, is implemented in Qiu et al. (2012). In Shekhar et al. (2013), the
authors learn a common domain invariant dictionary for both the domains in a semi-
supervised setting. In order to ensure high correlation between the features from
different domains, the model projects the data to a common low-dimensional subspace
while also maintaining the data structure on the manifold. Ni et al. in Ni et al. (2013),
bring together manifold methods along with dictionary approaches by generating a
set of dictionaries connecting the source dictionary with the target dictionary along
the lines of Gopalan et al. (2011).
3.5.3 Feature Augmentation Methods
One of the less complex approaches to domain adaptation was developed by
Daume III in Daume III (2007). The procedure concatenates the vectors from the
domains to capture domain specific representations and domain invariant represen-
tations. Every data point in the source and target domain is represented as follows:
φs(xs) = [xs⊤,xs⊤,0⊤]⊤, and φt(xt) = [xt⊤,0⊤,xt⊤]⊤. It was demonstrated that
for a linear classifier trained with these features, the first component captured the
domain invariant attributes of the data and the next two components captured the
domain specific attributes. Daume III also introduced a kernelized version of the
algorithm in the same work. In Daume III et al. (2010), the authors extend the work
to semi-supervised domain adaptation.
The authors in Li et al. (2014), extend feature augmentation by including feature
transformation. The new augmented and transformed features are given by,
φs(xs) = [W1xs⊤,xs⊤,0⊤
dT]⊤ and φt(xt) = [W2x
t⊤,0⊤dS,xt⊤]⊤ (3.31)
The method is applicable for heterogeneous domain adaptation where the source
and target data have different dimensions. W1 ∈ Rk×dS and W2 ∈ R
k×dT are the
62
transformation matrices with dS and dT being the source and target data dimensions
respectively. The transformation matrices W1 and W2 map the data to a common
subspace enabling domain adaptation.
This concludes the survey on domain adaptation methods for computer vision.
The survey gives a new perspective on the gamut of research in this area and will
hopefully provide new insights to researchers interested in domain adaptation.
63
Chapter 4
LINEAR FEATURE SPACES FOR DOMAIN ADAPTATION
There are various models that attempt to achieve domain adaptation between the
source and the target. Chapter (3), provides a classification of different approaches
to achieving domain adaptation. This chapter describes an example from one such
class of methods - namely linear approaches. Linear models for domain adaptation
are built using some form of linear transformation of the source and/or the target. A
standard approach is to learn a projection matrix that transforms data from one do-
main, so that cross domain disparity is minimized as in Saenko et al. (2010). Another
standard linear approach is to match the higher order moments of the two domains,
and this reduces domain discrepancy as discussed in Sun et al. (2015a). The most
common approach for linear domain adaptation is based on classifier adaptation using
the Support Vector Machine (SVM). The linear SVM learns a linear decision bound-
ary, which can be expressed as a linear combination of the training data points. In an
adaptive SVM, the decision boundary learned with the source training data, is modi-
fied to act as a decision boundary for the target data, as in Bruzzone and Marconcini
(2010); Aytar and Zisserman (2011).
In this chapter, a linear SVM for domain adaptation is outlined. In this case
the domain adaptive model is a couple of SVM decision boundaries - one for the
source and the other for the target. Section (4.1) introduces the linear model for
domain adaptation which is then developed in Section (4.2). This section derives the
two decision boundary SVM and reduces it to a standard SVM form. Section (4.3)
outlines the experiments that were conducted to test the proposed model. The final
section summarizes the contributions along with proposals for extensions.
64
4.1 A Linear Model for Domain Adaptation
This work considers the problem of semi-supervised domain adaptation, where
labeled samples from the source domain are used along with a limited number of
labeled samples from the target domain, to train a linear classier for the target do-
main. A linear Support Vector Machine called Coupled-SVM model is outlined. The
Coupled-SVM simultaneously estimates linear SVM decision boundaries ws and wt,
for the source and target training data respectively.
Using a technique termed as instance matching, researchers sample source data
points such that the difference between the means of the sampled source and target
data is minimized, Duan et al. (2012); Long et al. (2014). The intuition behind
the Coupled-SVM is along similar lines, where the difference between ws and wt is
penalized. Since the SVM decision boundaries are a linear combination of the data
points, penalizing the difference between ws and wt, can be viewed as penalizing the
difference between the weighted means of the source and target data points. Figure
(4.1a), illustrates standard SVM based domain adaptation, where ws is first learned
for the source and it is perturbed to obtain the target (wt). The perturbed SVM wt
could be vastly different from ws and can over fit the target training data. Figure
(4.1b), depicts the Coupled-SVM, where ws and wt, are learned simultaneously. The
source SVM ws, provides an anchor for the target SVM wt. The difference between
ws and wt is modeled based on the difference between the source and target domains.
In addition, the Coupled-SVM trades training error for improved generalization, as
illustrated in Figure (4.1c).
The following sections describe the Coupled-SVM model. Although the model
outputs two decision boundaries, the Coupled-SVM can be reduced to a standard
SVM formulation. The model is compared with other SVM based domain adaptation
65
models using different datasets.
��
��
��
��
��
��
��� ����������������� ����� ��� ����� ���������
���������������
Figure 4.1: (a) Domain adaptation with a SVM. The SVM for the source ws ismodified to get the target SVM wt. (b) In the Coupled-SVM, both ws and wt arelearned together. In this setting, the training error can be high. (c) The Coupled-SVM does not over fit. Red is source and Green is target data. Filled objects aretrain data and unfilled objects are test data. Image based on Venkateswara et al.(2015b).
4.2 The Coupled Support Vector Machine
A linear classifier is learned with the coupled SVM model for the source and the
target simultaneously. The idea is to modify the source classifier (ws) to learn a
target classifier (wt). The difference ||ws −wt||2, between the source and the target
classifier is penalized by a weight.
4.2.1 Coupled-SVM Notation
The source data is denoted as {(xsi , y
si )}ns
i=1 ⊂ S , where S is the source domain,
xsi ∈ R
d, are data points and ysi ∈ {−1,+1}, are their labels. The following discussion
considers a binary category classification model which can be extended to a multi-
category scenario. Similarly, {(xti, y
ti)}nt
i=1 ⊂ T , where T is the target domain, xti ∈
Rd, are data points and yti ∈ {−1,+1}, are their labels. The goal is to develop a
classifier ft(.), that can predict the labels yt = sgn(
ft(xt))
, ∀xt ∈ {xt|(xt, yt) ∈ T },
66
where, sgn(.) is the sign function. The constraint here is that the number of target
samples, nt, is limited to a few samples from each category, i.e. nt ≪ ns. Training
the classifier ft(.) with only nt samples could lead to over-fitting. The classifier ft(.),
should generalize to a larger subset of T and must not over-fit the training data
{(xti, y
ti)}nt
i=1. To this end, the source data {(xsi , y
si )}ns
i=1 is also used to train the
target classifier.
4.2.2 Coupled-SVM Model
The linear domain adaptation model is outlined as follows:
minfs,ft
λD(fs, ft) +R(fs, ft) + Cs
ns∑
i
L(ysi ,xsi , fs) + Ct
nt∑
i
L(yti ,xti, ft) (4.1)
The aim is to estimate classifiers {fs, ft} such that Equation (4.1) is minimized. The
first term is a measure of the difference between the source classifier fs and target
classifier ft. λ controls the importance of this difference. The hypothesis is that
the source and target classifiers are related. The similarity(dissimilarity) between
the datasets is expected to be captured by the similarity(dissimilarity) between the
classifiers. The second term penalizes the complexity of the classifiers fs and ft. The
third and fourth terms are a measure of the classification error over the source data
and the target data respectively. Cs and Ct, control the importance of the loss terms
for the source and target data respectively. A linear classifier such as a Linear SVM
is considered for the discussion, with f(x) = w⊤x + b, where, w ∈ Rd is the SVM
decision boundary, b is the SVM bias, which is a scalar and w⊤ represents transpose
of w. The square of the L2 norm, ||.||22 is chosen as the Regularizer R(.). The loss
function is set to the standard hinge loss, L(y,x, f) = max(
0, 1− y.f(x))
. The loss
function can be rewritten in the form of constraints for a ‘soft’-margin SVM model
(see Figure (4.2)).
67
SVM Model: This box outlines a brief review of the constrained SVM model
and its equivalence to the unconstrained SVM model. The linearly separable case
for SVM requires solving the constrained optimization problem,
minw||w||22
s.t. yi(w⊤xi + b) ≥ 1 for i ∈ {1, . . . , n} (4.2)
Here, {xi, yi}ni=1 is the labeled training data and xi ∈ Rd and yi ∈ {−1,+1}. The
‘soft’-margin constrained SVM model addresses the case of linear inseparability
and is given by,
minw||w||22 + C
n∑
i=1
ξi
s.t. yi(w⊤xi + b) ≥ 1− ξi and ξi ≥ 0 for i ∈ {1, . . . , n} (4.3)
The slack variables ξi ease the strict constraint on separability by allowing data
points to lie within the margin and also on the incorrect side of the SVM margin.
The constraint yi(w⊤xi+ b) ≥ 1− ξi can be rephrased as, yif(xi) ≥ 1− ξi, where
f(x) = w⊤x+ b. Along with ξi ≥ 0, the constraint can be written as,
ξi = max(0, 1− yif(xi)) (4.4)
Equation (4.3) can then be expressed in an unconstrained form as,
minw||w||22 + C
n∑
i=1
max(0, 1− yif(xi)) (4.5)
The second term in Equation (4.5) is the loss and in this case, it is called the
hinge loss.
Figure 4.2: SVM Model: Constrained and unconstrained formulations.
68
The source classifier is defined by fs(x) = w⊤s x+bs. Similarly, the target classifier
is, ft(x) = w⊤t x + bt. The source and target SVM decision boundaries are ws and
wt, respectively. The source and target biases are bs and bt, respectively. To simplify
the notation, bias values are incorporated into the decision boundaries. The decision
boundaries are redefined as, ws ← [w⊤s , bs]
⊤ and wt ← [w⊤t , bt]
⊤. To account for
the bias, the data variables are augmented with 1. The re-defined data variables are,
xsi ← [xs⊤
i , 1]⊤ and xti ← [xt⊤
i , 1]⊤. Including these definitions, Equation (4.1) can be
redefined as follows,
{w∗s ,w
∗t } = argmin
ws,wt
1
2λ||ws −wt||22 +
1
2||ws||22 +
1
2||wt||22 + Cs
ns∑
i
ξsi + Ct
nt∑
i
ξti
s.t. ysi (w⊤s x
si ) ≥ 1− ξsi , ξsi ≥ 0, i ∈ [1, . . . , ns]
yti(w⊤t x
ti) ≥ 1− ξti , ξti ≥ 0, i ∈ [1, . . . , nt] (4.6)
The terms in Equation (4.6) are rearranged for formatting reasons. The dissimilarity
measure D(.), is defined as the square of the Euclidean distance between ws and wt.
The rest of the terms make up a set of two SVM models, one for the source data
and the other for the target data. Equation (4.6) is a modification of the standard
Linear SVM having two decision boundaries along with an additional term capturing
the relation the two decision boundaries. In the following subsection, Equation (4.6)
is re-formulated as a standard SVM.
4.2.3 Coupled-SVM Solution
In order to reduce Equation (4.6) to a standard SVM problem, a new set of vari-
ables are defined based on the existing variables. The two SVM decision boundaries
are concatenated into a single variable, w ∈ R2(d+1) defined as,
w ← [w⊤s ,w
⊤t ]
⊤ (4.7)
69
The individual SVMs ws and wt can be extracted from w using permutation matrices
Is ∈ R(d+1)×2(d+1) and It ∈ R
(d+1)×2(d+1), where Is and It are binary matrices such
that,
Isw = ws and Itw = wt (4.8)
For example, let v1 = [a, b]⊤, v2 = [c, d]⊤ and v ← [v⊤1 ,v
⊤2 ]
⊤. Then the permutation
matrices Iv1 and Iv2 such that, Iv1v = v1 and Iv2v = v2, are given by,
Iv1 =
1 0 0 0
0 1 0 0
, Iv2 =
0 0 1 0
0 0 0 1
.
New variables (xi, yi, ci) ∀i ∈ {1, . . . , (ns + nt)}, are introduced, where, xi ∈ R2(d+1)
and yi ∈ {−1,+1} are the new data points and ci ∈ {Cs, Ct} is the importance of the
classification error for the ith data point. xi is defined as,
xi ←
[xs⊤
i ,0⊤]⊤ 1 ≤ i ≤ ns
[0⊤,xt⊤
i−ns]⊤ (ns + 1) ≤ i ≤ (ns + nt)
(4.9)
where, 0 ∈ Rd+1 is a vector of zeros. Similarly, {yi, ci} are defined as,
{yi, ci} ←
{ysi , Cs} 1 ≤ i ≤ ns
{yti−ns, Ct} (ns + 1) ≤ i ≤ (ns + nt)
(4.10)
With the definitions of w and xi, it can be noted that
w⊤xi =
w⊤s x
si 1 ≤ i ≤ ns
w⊤t x
ti−ns
(ns + 1) ≤ i ≤ (ns + nt)
(4.11)
For the remaining derivation, it is assumed the data is linearly separable, i.e. slack
variables ξsi = 0 and ξti = 0, ∀i. The slack variables will be re-introduced towards the
70
end. The minimization problem in Equation (4.6) can now be re-formulated as,
where, Ist ← (Is − It) with Isw = ws, Itw = wt for the first term. The second term
is obtained with, 12||ws||22 + 1
2||wt||22 = 1
2||w||22. Since this is a linearly separable case,
the constraints adhere to a margin 1. In order to solve the optimization problem,
Lagrangian variables {αi} are introduced for each of the constraints, yi(w⊤xi) ≥ 1.
The Lagrangian is defined as,
L(w,α) =1
2λ||Istw||22 +
1
2||w||22 −
ns+nt∑
i
αi
(
yi(w⊤xi − 1)
)
(4.13)
The Lagrangian needs to be minimized w.r.t. w and maximized w.r.t. to α. Opti-
mization is carried out first w.r.t. w by setting the derivative ∂L(w,α)∂w
= 0
∂L
∂w= λI⊤stIstw +w −
ns+nt∑
i
αiyixi = 0 (4.14)
=⇒ w = D
ns+nt∑
i
αiyixi (4.15)
where, I is an identity matrix and,
D = (I + λI⊤stIst)−1 (4.16)
Because of the nature of the permutation matrices Is and It, (I + λI⊤stIst) is full rank
and therefore, D exists. Following the definition of v ← ∑ns+nt
i αiyixi and a sub-
stitution for w in Equation (4.13), the optimization problem in terms of Lagrangian
71
variable α is,
{α∗} = argmaxα
λ1
2v⊤D⊤I⊤stIstDv +
1
2v⊤D⊤Dv − v⊤Dv + 1⊤α
= argmaxα
v⊤(1
2D⊤D +
1
2λD⊤I⊤stIstD −D)v + 1⊤α
= argmaxα
v⊤(1
2D⊤(I + λI⊤stIst)D −D)v + 1⊤α
= argmaxα
v⊤(1
2D⊤ −D)v + 1⊤α (Equation (4.16))
= argminα
1
2v⊤Dv − 1⊤α (D is symmetric)
= argminα
1
2v⊤Dv − 1⊤α
= argminα
1
2α⊤Qα− 1⊤α. (4.17)
Equation (4.17) can be viewed as the standard SVM dual where Qij = yiyjx⊤i Dxj
and 1 is a vector of 1s. By defining xi = D1/2xi, the matrix Q can be expressed in a
standard SVM dot product form, Qi,j = yiyjx⊤i xj which reduces Equation (4.17) to
a standard quadratic minimization problem. Any of the standard SVM libraries can
then be used to arrive at a solution for α.
In the space of xi, the decision bondary is given by, w =∑
i αiyixi. In the space
of xi, the decision boundary is given by w = D∑
i αiyixi. Therefore, w = D1/2w.
For the experiments in this chapter, the LIBLINEAR Fan et al. (2008), package was
adapted. It has an implementation of a weighted SVM where weights (importance)
can be specified for each training data point. The slack variables ξsi and ξti will now
be re-introduced through the weight term ci. The ci is introduced as a constraint on
αi. In solving Equation (4.17), the constraint on αi is usually given by, 0 ≤ αi ≤ ∞.
This constraint is used for a linearly separable SVM solution. The linearly inseparable
SVM (‘soft’-margin), can be modeled by modifying the constraint to, 0 ≤ αi ≤ ci.
This is equivalent to solving a ‘soft’-margin SVM by introducing slack variables, ξsi
72
and ξti .
Once w is known, ws and wt, can be estimated using Equation (4.8). The
Coupled-SVM can be easily extended to the multi-class setting using one-vs-one or
one-vs-all settings. SVM packages like LIBLINEAR provide solutions to multi-class
configurations. Using LIBLINEAR, the decision boundaries of the multi-class SVM
with P categories will be a matrixW ∈ R(P×2(d+1)), where each row ofW is a decision
boundary. The first (d + 1) columns of each row correspond to ws and the second
(d+ 1) columns correspond to wt for a particular category. The Coupled-SVM algo-
rithm for the binary case is outlined in Algorithm (1).
Algorithm 1 Coupled-SVM Domain Adaptation
Input: {(xsi , y
si )}ns
i=1, {(xti, y
ti)}nt
i=1, λ, Cs, Ct
Output: w∗s ,w
∗t as in Equation (4.6)
1: D ← (I + λI⊤stIst)−1 (Equation (4.16))
2: for i← 1 to (ns + nt) do
3: Define xi as in Equation (4.9)
4: xi ← D1/2xi
5: Define {yi, ci} as in Equation (4.10)
6: w ← Linear Weighted SVM {xi, yi, ci}(ns+nt)i=1
7: w ← D1/2w
8: w∗s ← Isw and w∗
t ← Itw (Equation (4.8))
4.3 Experimental Analysis for the Coupled-SVM
This section discusses the extensive experiments that were conducted to study
the Coupled-SVM model. The different datasets and their domains are first outlined
followed by a brief introduction to the baselines that are used for camparison. This
73
is followed by the experimental layout and a discussion of the results.
4.3.1 Experimental Setup
The Coupled-SVM was evaluated with multiple datasets from different applica-
tions and with multiple types of features. For all the experiments (except Office-
Caltech) the following setting has been used. For the training data, 20 examples are
sampled from the source domain and 3 examples from the target domain from every
category. The remaining examples in the target domain that are not used for training
are considered as test data. A few sampled images from the datasets are depicted in
Figure (4.3).
Figure 4.3: Top Row: Images of objects from Datasets, Amazon, Dslr, Webcam andCaltech. Second Row: Facial expression images from CKPlus and MMI. Thrid row:Digit images from MNIST and USPS followed by snapshots from HMDB51 and UCF50.
MNIST-USPS datasets: The digit datasets MNIST and USPS consist of images
of individual digits from 0 to 9. They are benchmark datasets for handwritten digit
recognition. In the following experiments, a subset of these datasets (2000 images
74
from MINST and 1800 images from USPS ) based on Long et al. (2014) has been
used. The images are represented as vectors of length 256 after resizing the images
to 16× 16 pixels. These domains are referred to as MNIST and USPS respectively.
Office-Caltech datasets: The experimental setup for this dataset has been adapted
from Gong et al. (2012a). The Office dataset is made up of three domains, Amazon,
Dslr and Webcam. An additional domain called Caltech is included based on the
Caltech256 dataset. All of these domains are made up of images from a set of common
categories viz., {back-pack, bike, calculator, headphones, computer-keyboard, laptop,
monitor, computer-mouse, coffee-mug, video-projector}. For the experiments, the
800 dimension SURF-Bag-of-Words (SURF-BoW) features provided by Gong et al.
(2012a) are used. In creating the training data, 8 examples are sampled from the
source domain (for Amazon 20 are used) and 3 examples are sampled from the target
domain.
CKPlus-MMI dataset: The CKPlus Lucey et al. (2010) and MMI Pantic et al.
(2005) are popular datasets for facial expression recognition. From these datasets, six
categories were selected viz., {anger, disgust, fear, happy, sad, and surprise}, from
video frames with the most intense expression (peak frames) for every facial expres-
sion video sequence. This yields around 1500 images for each dataset with around 250
images per category. These domains are referred to as CKPlus and MMI. A deep neural
network was used to extract feature vectors from the images. Features extracted us-
ing pre-trained convolutional neural networks (CNN) have shown astonishingly good
results across a wide array of applications Razavian et al. (2014). Therefore, the
deep CNN developed by Simonyan and Zisserman Simonyan and Zisserman (2014)
was deployed as an ‘off-the-shelf’ feature extractor. The outputs of the first fully
connected layer from the 16 weight layer model with dimension 4096 were used as
features. These were reduced to 100 dimensions using PCA.
75
HMDB51-UCF50 dataset: Eleven common categories of activity were gathered
from HMDB51 Kuehne et al. (2011) and UCF50 Reddy and Shah (2013). The cat-
egories from UCF50 are as follows with the corresponding categories from HMDB51
in parenthesis, {BaseballPitch(throw), Basketball(shoot ball), Biking(ride bike), Div-
PushUps(pushup), Punch(punch), WalkingWithDog(walk)}. These domains are re-
ferred to as HMDB51 and UCF50. State-of-the-art HOG, HOF, MBHx and MBHy
descriptors were extracted from the videos based on the work of Kantorov and Laptev
(2014). The descriptors were then pooled into a grid 1x1x1, and Fisher Vectors were
estimated with K = 256 Gaussians. The resulting large dimension Fishers Vectors
(202, 752 dimensions) were reduced to 100 dimensions using PCA.
4.3.2 Baselines for Comparison
The Coupled-SVM is compared with the following semi-supervised domain adap-
tation techniques based on SVMs. SVM(T): This is a linear SVM trained on the
target labeled data. It can be expected to over-fit the training data and perform
poorly on the test data. SVM(S): This is a linear SVM with training data from
source domain. It can be expected to perform poorly when there is domain discrep-
ancy between the source and target domains. SVM(S+T): This is a linear SVM
trained with a union of source and target domain training data. In this case too,
domain discrepancy can lead to poor performance. The above three methods are
baselines that the Coupled-SVM is expected to outperform. MMDT: The Max-
Margin Domain Transform Hoffman et al. (2013), learns a single SVM model for the
source and the transformed target data. In the MMDT, the target data is transformed
by a transformation matrix to be similar to the source data and the transformation
is learned in an optimization framework along with the SVM classifier. AMKL: The
76
Adaptive Multiple Kernel Learning Duan et al. (2012), implements a multiple kernel
method where multiple base kernel classifiers are combined with a pre-learned aver-
age classifier obtained from fusing multiple nonlinear SVMs. Unlike in Coupled-SVM
where the similarity between source and target is learned by the model, Widmer et al.
Widmer et al. (2012) use a similar approach to solve multitask problems using graph
Laplacians to model task similarity. The Coupled-SVM holds a unique position in
this wide array of SVM solutions for domain adaptation. The Coupled-SVM trains
a linear SVM for both the source and target domains simultaneously, thereby mini-
mizing the chances of over-fitting, especially when there are very few labeled samples
from the target domain. The Coupled-SVM is labeled C-SVM in the figures and
tables.
4.3.3 Results
Eighteen experiments were conducted using different pairs of datasets as source
and target. Table (4.1) depicts the target recognition accuracies for multiple algo-
rithms. The results were averaged across 20 splits of data for the Office-Caltech
dataset. For the remaining datasets, 100 splits were used. The domain difference is
highlighted with the results from the SVM(S) experiments. Although the datasets
comprise of the same categories, a classifier trained on the source does not perform
as well on the target because of the domain disparity. This fact is further highlighted
by the success of SVM(T), which shows improved accuracies even when a few of the
target training data points are labeled. The SVM(S+T) illustrates that the naıve
union of the source and target training data may be beneficial in some cases but not
always. The Coupled-SVM outperforms the remaining algorithms but is nearly on
par with the AMKL. Although there is little to choose between the two in terms of
recognition accuracies, the Coupled-SVM is faster and it is easier to implement it
77
Table 4.1: Target recognition accuracies (%) for the object, digit, facial expressionand activity datasets across multiple algorithms. {Amazon(A), Webcam(W), Dslr(D),Caltech(C), MNIST(M), USPS(U), CKPlus(K), MMI(I), HMDB51(H), UCF50(F)}. A→Wimplies A is source domain and W is target domain. The best results are highlightedin bold.
compared to the AMKL which is a multiple kernel based method.
Leave-one-out cross validation was applied for the training target data in order
to determine the optimal values of the model parameters {Cs, Ct, λ}. To get a better
idea of the role of labeled data, the number of labeled data points was varied and
the recognition accuracies were studied. Since the Webcam and Dslr datasets have
fewer data points compared to the other domains, they were left out of this anal-
78
(a) Source Training Number (Target=3)
5 10 15 20 25 30
Ac
cu
rac
y (
%)
20
30
40
50
60
70
80
90
A->C
C->A
M->U
U->M
K->I
I->K
H->F
F->H
(a) # bases k
(b) Target Training Number (Source=20)
3 6 9 12 15 18
Ac
cu
rac
y (
%)
20
30
40
50
60
70
80
90
A->C
C->A
M->U
U->M
K->I
I->K
H->F
F->H
(b) # bases k
(c) (Source, Target) Training Number
(5,3) (10,6) (15,9) (20,12)(25,15)(30,18)
Ac
cu
rac
y (
%)
20
30
40
50
60
70
80
90
A->C
C->A
M->U
U->M
K->I
I->K
H->F
F->H
(c) # bases k
Figure 4.4: Average target recognition accuracies (%) for the Coupled-SVM acrossdifferent experiments by varying number of labeled training examples in source andtarget. Images based on Venkateswara et al. (2015b).
ysis. The results of this study are along expected lines. Increasing the number of
source training data points, has little effect on improving the target test accuracies
as illustrated in Figure (4.4a). The reason being, the decision boundaries do not
vary significantly with additional data. To estimate the source decision boundary,
the SVM relies on support vectors. Only the support vectors affect the direction of
the decision boundary. Since most of the data is not made up of support vectors
and lies away from the boundary, additional source training data does not change the
decision boundaries. On the other hand, the effect of additional target training data
79
is comparatively more pronounced as seen in Figure (4.4b). Labeled target data is
limited, and adding any new labeled target data points will help reshape the decision
boundary. Both of the above conclusions are reinforced with the results in Figure
(4.4c). The effect of increasing the number of both the source and target labeled
data points is comparable to increasing the number of labeled target data alone. The
additional source labeled data does not alter the decision boundary as much as the
number of additional labeled target data. In other words, additional source data does
not contribute to the target SVM after the decision boundary has stabilized.
4.4 Conclusions and Summary
The experimental analysis indicates that the Coupled-SVM is a promising alter-
native for domain adaptation. The AMKL is nearly as good as the Coupled-SVM,
but the Coupled-SVM has the advantage of being simple. Being a linear model, the
Coupled-SVM is fast and easy to implement using existing libraries. In its current
form, the Coupled-SVM requires some labeled data in the target domain, which is
not unreasonable. However, a possible extension to the Coupled-SVM could be an
unsupervised version that needs no target data labels. The Coupled-SVM models the
difference between the domains as the difference between their decision boundaries.
In order to estimate the importance of this difference (λ), a target validation dataset
is required which is not available for unsupervised domain adaptation. Chapter (5)
will outline a procedure to create a validation set based on source data.
In this chapter a linear model for semi-supervised domain adaptation was intro-
duced. The Coupled-SVM model trains two SVM classifiers - one for the source and
another for the target. The linear model is efficient, elegant and easy to implement
as it can be implemented using existing SVM libraries. Experimental results indicate
that the Coupled-SVM performs comparably well against competitive linear methods.
80
Chapter 5
NONLINEAR FEATURE SPACES FOR DOMAIN ADAPTATION
A linear model for domain adaptation was introduced in Chapter (4). Sometimes,
a linear model may be overly simplistic and may not effectively ameliorate the dis-
crepancy between the distributions of the two domains. When linear approaches are
ineffective in modeling domain diversity, nonlinear transformations of data are the
next available option for achieving domain adaptation. This chapter introduces a
nonlinear model for unsupervised domain adaptation.
Chapter (3) provided a survey of nonlinear approaches for domain adaptation. A
nonlinear model for domain adaptation is usually a kernel method which projects the
data to a high-dimensional (possibly infinite dimensional) space and aligns the do-
mains in that space. A standard procedure for nonlinear domain alignment is using
the Maximum Mean Discrepancy (MMD), as described in Equation (3.12). Many
nonlinear methods for domain adaptation use some form of MMD for domain align-
ment. The standard nonlinear approaches include max-margin methods (nonlinear
SVMs) like Duan et al. (2009), instance selection based on MMD like Chattopadhyay
et al. (2013) and spectral methods like Kernel-PCA as in Long et al. (2014). This
chapter introduces a spectral based nonlinear method for unsupervised domain adap-
tation. In addition to domain adaptation, the model enhances classification using
manifold based embedding of the nonlinearly transformed data.
The chapter is organized as follows. Section (5.1) provides an introduction to the
proposed nonlinear model with an intuitive toy example. Section (5.2) outlines the
model by deriving the various components. This section also introduces a valida-
tion technique for unsupervised domain adaptation, in the absence of labeled target
81
data. Section (5.3) provides the results of various experiments with the proposed
model. 50 different domain adaptation experiments were conducted to compare the
proposed model with existing competitive procedures for unsupervised domain adap-
tation. The validation procedure was evaluated using 7 popular domain adaptation
image datasets, including object, face, facial expression and digit recognition datasets.
5.1 A Nonlinear Model for Domain Adaptation
Nonlinear techniques are deployed in situations where the source and target do-
mains cannot be aligned using linear transformations. These techniques apply nonlin-
ear transformations on the source and target data in order to align them. For example,
Maximum Mean Discrepancy (MMD) is applied to learn nonlinear representations,
where the difference between the source and target distributions is minimized, as in
Pan et al. (2011). Even though nonlinear transformations may align the domains, the
resulting data may not be conducive to classification. If, after domain alignment, the
data were to be clustered based on similarity, it can lead to effective classification.
A two domain binary classification toy problem demonstrates this intuition. Fig-
ure (5.1a), depicts the two domains of a two-moon dataset. Figure (5.1b), shows the
data after it has been transformed by Kernel-PCA. Kernel-PCA projects the data
along nonlinear directions (‘curves’) of maximum variance, however the data gets
dispersed and there is no domain alignment. In Figure (5.1c), the data is presented
after the domains have been aligned using Maximum Mean Discrepancy (MMD). The
domains are now more aligned, however the classification enhancement is not guar-
anteed. Figure (5.1d), displays the data after it has been embedded using similarity-
based embedding along with MMD based domain alignment. This makes the data
more classification friendly. A classifier trained on the source will now provide en-
hanced accuracies on the target data.
82
Figure 5.1: Two-class classification problem with with source data in blue andtarget data in red. The target labels are unknown. (a) Plot of the original 2-dimensional data, (b) The data after Kernel-PCA based that projects the data alongnonlinear directions of maximum variance, (c) The data after MMD based projectionthat aligns the two domains, (d) The data after MMD+Similarity-based Embeddingthat aligns the domains and also clusters the data to ensure easy classification. Imagesbased on Venkateswara et al. (2017a).
This chapter introduces the Nonlinear Embedding Transform (NET), which per-
forms a nonlinear transformation to align the source and target domains and also
cluster the data based on label-similarity. The NET algorithm is a spectral (eigen)
technique that requires certain parameters (like number of eigen bases, etc.) to be
pre-determined. These parameters are often given random values which need not
be optimal as in Pan et al. (2011); Long et al. (2013) and Long et al. (2014). This
chapter also outlines a validation procedure to fine-tune model parameters with a val-
idation set created from the source data. The two major contributions of this work
are outlined as follows:
83
• Nonlinear embedding transform (NET) algorithm for unsupervised DA.
• Validation procedure to estimate optimal parameters for an unsupervised DA
algorithm.
The following sections outline the NET model and compare its performance with
other state-of-the-art domain adaptation techniques across multiple datasets.
5.2 Nonlinear Embedding Transformation Model
The NET algorithm for unsupervised domain adaptation is outlined in the follow-
ing section. It is followed by the description of a cross-validation procedure that can
be used to estimate the model parameters for the NET algorithm.
As is usually the case for domain adaptation, two domains are considered; source
domain S and target domain T . The source domain is represented by Ds =
{(xsi , y
si )}ns
i=1 ⊂ S and the target domain is represented by Dt = {(xti, y
ti)}nt
i=1 ⊂ T .
In matrix notation, the source data is given by XS = [xs1, . . . ,x
sns] ∈ R
d×ns and
the target data by XT = [xt1, . . . ,x
tnt] ∈ R
d×nt . Similarly, the corresponding la-
bels are represented by YS = [ys1, . . . , ysns] and YT = [yt1, . . . , y
tnt] for the source and
target data respectively. The source and target data have the same dimensional-
ity where, xsi and xt
i ∈ Rd and the label space for the two domains is identical,
i.e., ysi and yti ∈ {1, . . . , C}. Additional terms are introduced for later use like,
X := [x1, . . . ,xn] = [XS,XT ], with n = ns +nt. In the unsupervised domain adapta-
tion setting, the target labels YT are unknown and the joint distributions of the two
domains are different with, PS(X, Y ) 6= PT (X, Y ).
In order to predict the target data labels, a classifier f(x) = p(y|x) is trained. The
posterior p(y|x) is the probability that data point x is assigned label y. The labels
YT = [yt1, . . . , ytnt] are the estimated target data labels corresponding to XT using a
84
classifier trained on Ds and XT . A classifier trained with only the source data Ds,
may not accurately predict the target data labels because the joint distributions of
the source and target data are different. The NET algorithm modifies the source and
target data by projecting them to a common subspace using nonlinear transforma-
tions. In this subspace, the cross domain disparity is reduced and the data points are
clustered together based on the similarity of their labels. A classifier is then trained
using the projected source data which is then used to predict the target data labels.
5.2.1 Nonlinear Domain Alignment
A standard procedure to project data to a common subspace is through Principal
Component Analysis (PCA). A common basis for the source and target data is esti-
mated and this basis is used to linearly transform the source and target data. This
is often a baseline procedure applied to reduce domain disparity and can be viewed
as a naıve form of domain adaptation. PCA maximizes the variance of the projected
data by estimating a projection matrix U = [u1, . . . ,uk] ∈ Rd×k. The variance of the
projected data is given by∑n
i ||U⊤xi||22, where U is determined by solving,
maxU⊤U=I
tr(U⊤XHX⊤U). (5.1)
H is the n×n centering matrix given by H = I− 1n1, I is an identity matrix and 1 is a
n×n matrix of 1s. The constraint ensures the principal components are orthonormal.
The Lagrangian for Equation (5.1) is given by L(U,Λ) = tr(U⊤XHX⊤U)+U⊤UΛ,
where Λ = diag(λ1, . . . , λk) are the Lagrangian constants. Differentiating with re-
spect to U and setting ∂L∂U
= 0 leads to the standard eigen-decomposition problem
XHX⊤U = UΛ. The solution to Equation (5.1) is given by the standard eigen-value
problem XHX⊤U = UΛ, where Λ = diag(λ1, . . . , λk) and (λ1, . . . , λk) are the k-
largest eigen-values. The PCA projected data in the k-dimensional subspace is given
85
by Z = [z1, . . . , zn] = U⊤X ∈ Rk×n.
The PCA provides a linear projection of the source and target data to a com-
mon subspace. When linear projection does not provide a good basis for projection
(i.e. cross-domain disparity is not reduced), often, Kernel-PCA (KPCA) is applied to
estimate a nonlinear projection of the data. In this case, data is mapped to a high-
dimensional (possibly infinite-dimensional) space defined by Φ(X) = [φ(x1), . . . , φ(xn)].
φ : Rd → H defines a mapping function and H is a RKHS (Reproducing Kernel
Hilbert Space). The dot product between the high-dimensional mapped vectors φ(x)
and φ(y), is estimated by the kernel-trick. The dot product is given by the positive
semi-definite (psd) kernel, k(x,y) = φ(x)⊤φ(y). The kernel k(.) can be viewed as
a similarity measure between x and y. The similarity measure between all pairs
of data points in X, is represented using the kernel gram matrix and is given by,
K = Φ(X)⊤Φ(X) ∈ Rn×n. The high-dimensional mapped data Φ(X) is projected
onto a subspace of eigen-vectors (directions of maximum nonlinear variance in the
RKHS). The leading k eigen-vectors in the RKHS are denoted using the representer
theorem U = Φ(X)A Kimeldorf and Wahba (1970). Matrices U and Φ(X) are never
estimated in practice since they are high dimensional or even infinite dimensional.
The coefficient matrix A is instead determined and the projected data is obtained
using A. The kernelized version of Equation (5.1) with A as the projection matrix
that is obtained by solving,
maxA⊤A=I
tr(A⊤KHK⊤A). (5.2)
Here, H is the n×n centering matrix given by H = I− 1n1, I is an identity matrix and
1 is a n × n matrix of 1s. The matrix of coefficients is A ∈ Rn×k and the nonlinear
projected data is given by Z = [z1, . . . , zn] = A⊤K ∈ Rk×n.
Implementing a nonlinear projection of data onto a common subspace may not
86
necessarily account for domain disparity. In such a situation, domain alignment is
achieved using Maximum Mean Discrepancy (MMD) Gretton et al. (2009), which is
a standard nonparametric measure to estimate domain disparity. To ensure the joint
distributions of the source and target are aligned, the Joint Distribution Adaptation
(JDA) Long et al. (2013), algorithm which seeks to align both the the marginal and
conditional probability distributions of the projected data, is adopted. The marginal
distributions are aligned by estimating the coefficient matrix A, which minimizes:
minA
∣
∣
∣
∣
∣
∣
∣
∣
1
ns
ns∑
i=1
A⊤ki −1
nt
n∑
j=ns+1
A⊤kj
∣
∣
∣
∣
∣
∣
∣
∣
2
H
= tr(A⊤KM0K⊤A). (5.3)
M0, is the MMD matrix which given by,
(M0)ij =
1nsns
, xi,xj ∈ Ds
1ntnt
, xi,xj ∈ Dt
−1nsnt
, otherwise,
(5.4)
Likewise, the conditional distribution difference can also be minimized by introducing
matrices Mc, with c = 1, . . . , C, defined as,
(Mc)ij =
1
n(c)s n
(c)s
, xi,xj ∈ D(c)s
1
n(c)t n
(c)t
, xi,xj ∈ D(c)t
−1
n(c)s n
(c)t
,
xi ∈ D(c)s ,xj ∈ D(c)
t
xj ∈ D(c)s ,xi ∈ D(c)
t
0, otherwise.
(5.5)
Here, Ds and Dt are the subsets of source and target data points respectively. D(c)s
is the subset of source data points whose class label is c and n(c)s = |D(c)
s |. Similarly,
D(c)t is the subset of target data points whose class label is c and n
(c)t = |D(c)
t |. For
the target data, since the labels are not known, the predicted target labels are used
87
to determine D(c)t . The target data labels are initialized using a classifier trained on
the source data and refined over iterations. Incorporating both the conditional and
marginal distribution alignments, the JDA model can be written as,
minA
C∑
c=0
tr(A⊤KMcK⊤A). (5.6)
5.2.2 Similarity Based Embedding
Along with domain alignment, the NET algorithm seeks to project the data in a
classification friendly manner (easily classifiable). Laplacian eigenmaps are used in
order to cluster data points on the basis of class label similarity. For this purpose
an (n × n) adjacency matrix W is initialized. The entries in the adjacency matrix
capture the similarity between pairs of data points. These entries are used to weight
the distances between pairs of data points. W is defined as,
Wij :=
1 ysi = ysj or i = j
0 ysi 6= ysj or labels unknown.
(5.7)
The clustering is implemented by minimizing the sum of squared distances weighted
by the adjacency matrix. This is expressed as a minimization problem,
minZ
1
2
∑
ij
∣
∣
∣
∣
∣
∣
∣
∣
zi√di− zj√
dj
∣
∣
∣
∣
∣
∣
∣
∣
2
Wij = minA
tr(A⊤KLK⊤A). (5.8)
Here, di =∑
k Wik and dj =∑
k Wjk. They constitute the entries of D the (n× n)
diagonal matrix. ||zi/√di−zj/
√
dj||2, is the squared normalized distance between the
projected data points zi and zj, which get clustered together when Wij = 1, (as they
belong to the same category). Since the target labels are unknown, distances involving
target data points are not weighted. The proposition to use predicted target labels
refined over multiple iterations did not provide satisfactory results. The normalized
distance provides a more robust clustering measure when compared to the standard
88
Euclidean distance ||zi − zj||2, Chung (1997). By substituting Z = A⊤K, Equation
(5.8) can be expressed in terms of A, where L, denotes the symmetric positive semi-
definite graph laplacian matrix with L := I−D−1/2WD−1/2, and I being an identity
matrix.
5.2.3 Optimization Problem
The different components, namely, nonlinear projection in Equation (5.2), joint
distribution alignment in Equation (5.6) and similarity based embedding in Equation
(5.8) are now included to design the domain adaptation model. Maximizing Equation
(5.2) and minimizing Equations (5.6) and (5.8) is equivalent to maintaining Equation
(5.2) constant and minimizing Equations (5.6) and (5.8), according to the generalized
Raleigh quotient. However, minimizing the similarity embedding in Equation (5.8)
can result in a trivial solution with the projected vectors being embedded in a low
dimensional subspace. A new constraint is introduced in place of A⊤KHK⊤A = I,
in order to enforce subspace dimensionality. The NET is introduced as an optimiza-
tion problem obtained by minimizing Equations (5.6) and (5.8). The goal is still to
determine the (n × k) projection matrix, A. By including Frobenius norm based
regularization for a smooth solution, along with the dimensionality constraint, the
NET optimization problem is given by,
minA⊤KDK⊤A=I
α.tr(A⊤KC∑
c=0
McK⊤A) + β.tr(A⊤KLK⊤A) + γ||A||2F . (5.9)
The first term controls the domain alignment and its importance is given by α. The
second term weighted by β captures similarity based embedding. The third term
is the standard regularization (Frobenius norm) that ensures a smooth projection
matrix A and its importance is denoted by γ. As mentioned earlier, the constraint
on A (in place of A⊤KHK⊤A = I), prevents a trivial solution where the projection
89
could collapse onto a subspace with dimensionality less than k, Belkin and Niyogi
(2003). In order to solve Equation (5.9) the Lagrangian is introduced,
L(A,Λ) =α.tr(
A⊤KC∑
c=0
McK⊤A)
+ β.tr(A⊤KLK⊤A)
+ γ||A||2F + tr((I−A⊤KDK⊤A)Λ), (5.10)
where the Lagrangian constants are captured in the diagonal matrixΛ = diag(λ1, . . . , λk).
With the derivative set to zero, ∂L∂A
= 0, a generalized eigen-value problem is obtained,
(
αKC∑
c=0
McK⊤ + βKLK⊤ + γI
)
A = KDK⊤AΛ. (5.11)
The coefficient matrix A in Equation (5.9) is given by the k-smallest eigen-vectors of
Equation (5.11). The domain-aligned and embedded data points are then given by
Z = A⊤K. The NET algorithm is outlined in Algorithm 2.
5.2.4 Model Selection
In a supervised learning setting, validation data (subset of the training data with
labels) is used to estimate the optimal value of the model parameters. In unsupervised
domain adaptation the target labels are treated as unknown. When estimating the
optimum value for the model parameters, current domain adaptation methods appear
to inherently assume the availability of target labels Long et al. (2013), Long et al.
(2014) for validation. However, in the case of real world applications, when target
labels may not available, it becomes challenging to ensure optimal model parameters.
In the case of the NET model, there are 4 parameters (k, α, β, γ), that need to be
pre-determined. Since the source data is available and it contains labels, a subset of
the source data can be used for validation purposes. A technique using Kernel Mean
Matching (KMM) is introduced to sample the source data to create a validation set.
The KMM is based on the MMD and it is used to weight the source data points in
90
Algorithm 2 Nonlinear Embedding Transform
Input: X, YS, constants α, β, regularization γ and projection dimension k. Number
of iterations to converge, T .
Output: Projection matrix A, projected data Z.
1: Compute kernel matrix K, for predefined kernel k(., .)
2: Define the adjacency matrix W (Equation (5.7))
3: Compute D = diag(d1, . . . , dn), where di =∑
j Wij
4: Compute normalized graph laplacian L = I−D−1/2WD−1/2
5: Train a classifier with source data {[xs1, . . . ,x
sns], YS}
6: Estimate initial target labels yti for t ∈ {1, . . . , nt} with source classifier
7: Construct Mc for c ∈ {0, 1, . . . , C} based on Equations (5.4) and (5.5)
8: for i = 1 to T do
9: Solve Equation (5.11) and select k smallest eigen-vectors as columns of A
10: Estimate Z← A⊤K
11: Train a source classifier with modified data {[z1, . . . , zns], YS}
12: Re-estimate labels yti for t ∈ {1, . . . , nt} with source classifier
13: Re-construct Mc for c ∈ {0, 1, . . . , C} based on Equations (5.4) and (5.5)
14: Solve Equation (5.11) and select k smallest eigen-vectors as columns of A
15: Estimate Z← A⊤K
16: Train a classifier with modified data {[z1, . . . , zns], YS}
order to reduce the domain disparity between the source and target data Fernando
et al. (2013), Gong et al. (2013a). The KMM learns a set of weights for each of the
source data points. The source data points with large weights can be considered to
have a marginal distribution similar to that of the target data. These data points are
then chosen to constitute subset that can be used for cross validation purposes. The
91
weights wi, i = 1, . . . , ns, are estimated by minimizing,∣
∣
∣
∣
∣
∣
∣
∣
1
ns
ns∑
i=1
wiφ(xsi )−
1
nt
nt∑
j=1
φ(xtj)
∣
∣
∣
∣
∣
∣
∣
∣
2
H
. (5.12)
In order to simplify the equation, κi :=ns
nt
∑nt
j=1 k(xsi ,x
tj), i = 1, . . . , ns and KSij
=
k(xsi ,x
sj) are defined. The minimization can now be represented as a quadratic pro-
gramming problem,
minw
=1
2w⊤KSw − κ⊤w,
s.t. wi ∈ [0, B],
∣
∣
∣
∣
ns∑
i=1
wi − ns
∣
∣
∣
∣
≤ nsǫ. (5.13)
In the first constraint the scope of discrepancy between source and target distributions
is limited, with B → 1, leading to an unweighted solution. The second constraint
ensures that w(x)PS(x), remains a probability distribution Gretton et al. (2009).
In conducting the experiments, 10% of the source data with the largest weights was
chosen to form the validation subset. The optimal values of (α, β, γ, k), were estimated
using the validation set. For fixed values of (α, β, γ, k), the NET model is trained
using the source data (without the validation set) and the target data. The model is
tested on the validation subset since it has labels. A grid search for the parameters is
conducted to estimate the optimal values yielding the highest validation accuracies.
5.3 Experimental Analysis of the NET Model
In the Experiments section, the NET algorithm is compared with other nonlin-
ear domain adaptation procedures. The NET algorithm is evaluated across various
datasets like the Office, Office-Caltech, COIL, MMI and CKPlus, MNIST and USPS.
5.3.1 Experimental Setup
The NET model was evaluated extensively with 7 different datasets. The details
of the datasets are outlined in Table (5.1).
92
Table 5.1: Statistics for the benchmark datasets
Dataset Type #Samples #Features #Classes Subsets
MNIST Digit 2,000 256 10 MNIST
USPS Digit 1,800 256 10 USPS
CKPlus Face Exp. 1,496 4096 6 CKPlus
MMI Face Exp. 1,565 4096 6 MMI
COIL20 Object 1,440 1,024 20 COIL1, COIL2
PIE Face 11,554 1,024 68 P05, ..., P29
Ofc-Cal SURF Object 2,533 800 10 A, C, W, D
Ofc-Cal Deep Object 2,505 4096 10 A, C, W, D
MNIST-USPS datasets: This dataset has already been discussed in the previous
chapter. It is presented here again for the sake of completeness. The digit datasets
MNIST and USPS consist of images of individual digits from 0 to 9. They are
benchmark datasets for handwritten digit recognition. In the following experiments,
a subset of these datasets (2000 images from MINST and 1800 images from USPS )
based on Long et al. (2014) has been used. The images are represented as vectors of
length 256 after resizing the images to 16× 16 pixels. These domains are referred to
as MNIST and USPS respectively.
CKPlus-MMI dataset: This dataset has already been discussed in the previous
chapter. It is presented here again for the sake of completeness. The CKPlus Lucey
et al. (2010) and MMI Pantic et al. (2005) are popular datasets for facial expression
recognition. From these datasets, six categories were selected viz., {anger, disgust,
fear, happy, sad, and surprise}, from video frames with the most intense expression
(peak frames) for every facial expression video sequence. This yields around 1500
images for each dataset with around 250 images per category. These domains are
referred to as CKPlus and MMI. A deep neural network was used to extract feature
vectors from the images. Features extracted using pre-trained convolutional neural
93
networks (CNN) have shown astonishingly good results across a wide array of ap-
plications Razavian et al. (2014). Therefore, the deep CNN developed by Simonyan
and Zisserman Simonyan and Zisserman (2014) was deployed as an ‘off-the-shelf’ fea-
ture extractor. The outputs of the first fully connected layer from the 16 weight
layer model with dimension 4096 were used as features. These were reduced to 500
dimensions using PCA.
COIL20 dataset: It is an object recognition dataset which consists of 20 categories
with data belonging to two domains, COIL1 and COIL2. The domains consist of images
of objects captured from views that are 5 degrees apart. The images are 32×32 pixels
with gray scale values Long et al. (2013) that are vectorized to 1024 dimensions.
PIE dataset: The “Pose, Illumination and Expression” (PIE) dataset consists of
face images ( 32×32 pixels) of 68 individuals. The images were captured with different
head-pose, illumination and expression. Along the lines of Long et al. (2013), 5
subsets were selected with differing head-pose to create 5 domains, namely, P05 (C05,
Office-Caltech dataset: This dataset has already been discussed in the previous
chapter. It is presented here again for the sake of completeness. This is currently
the most popular benchmark dataset for object recognition in the domain adaptation
computer vision community. The dataset consists of images of everyday objects. It
consists of 4 domains; Amazon, Dslr and Webcam from the Office dataset and Caltech
domain from the Caltech-256 dataset. The Amazon domain has images downloaded
from the www.amazon.com website. The Dslr and Webcam domains have images
captured using a DSLR camera and a webcam respectively. The Caltech domain is
a subset of the Caltech-256 dataset that was created by selecting categories common
with the Office dataset. The Office-Caltech dataset has 10 categories of objects and
94
a total of 2533 images (data points). Two feature types were used for evaluation with
the Office-Caltech dataset; (i) 800-dimensional SURF features Gong et al. (2012a),
(ii) Deep features. The deep features are extracted using a pre-trained network similar
to the CKPlus-MMI datasets.
5.3.2 Baselines for comparison
The NET algorithm is compared with the following baseline and state-of-the-art
methods. Similar to NET, the TCA, TJM and JDA are all spectral (eigen) methods.
Table 5.2: Baseline methods that are compared with the NET.
Method Reference
SA Subspace Alignment Fernando et al. (2013)
CA Correlation Alignment Sun et al. (2015a)
GFK Geodesic Flow Kernel Gong et al. (2012a)
TCA Transfer Component Analysis Pan et al. (2011)
TJM Transfer Joint Matching Long et al. (2014)
JDA Joint Distribution Adaptation Long et al. (2013)
All the above four algorithms apply MMD to align the source and target datasets,
however the NET, in addition uses nonlinear embedding to enhance classification.
In a setting similar to Equation (5.11), TCA, TJM and JDA, solve for A. But
unlike NET, they do not incorporate the similarity based embedding term. Also,
α = 1, is fixed for all the three algorithms. Therefore, these models only have 2 free
parameters (γ and k), that need to be pre-determined in contrast to NET, which
has 4 parameters, (α, β, γ, k). Since TCA, TJM and JDA, are all quite similar to
each other and they have the same model parameters, for the sake of brevity model
selection (estimating optimal model parameters) is evaluated using cross validation
for only JDA and NET. The other algorithms, SA, CA and GFK, do not have any
critical free model parameters that need to be pre-determined.
95
In the experiments, NETv is treated as a special case of the NET, with model
parameters (α, β, γ, k), being determined using a validation set derived from Equation
(5.13). Likewise, JDAv is a special case of JDA, where (γ, k), are determined using a
validation set derived from Equation (5.13). There is no theoretical guarantee to the
effectiveness of this validation procedure. However, its effectiveness can be measured
empirically by comparing it with a procedure that uses target data as a validation
set. The results obtained using target data as a validation subset are represented by
NET in the figures and tables. For the rest of the algorithms (SA, CA, GFK, TCA,
TJM and JDA), the parameter settings described in their respective works were used.
5.3.3 Experimental Details
The same experimental protocol as in Gong et al. (2012a); Long et al. (2014)
is followed to ensure a fair comparison with existing methods. 50 different domain
adaptation experiments were conducted with the previously mentioned datasets. In
each of these unsupervised domain adaptation experiments, there is one source do-
main (data points and labels) and one target domain (data points only). Since Mc
is refined over multiple iterations, 10 iterations were run to converge to the predicted
test/validation labels. For the kernel function k(.), a Gaussian kernel was used with
a standard width equal to the median of the squared distances over the dataset as
described in Gretton et al. (2009). For all the experiments, the projected target data
was tested using a 1-Nearest Neighbor (NN) classifier that was trained using the pro-
jected source data. Since it does not require tuning of cross-validation parameters, a
NN classifier was chosen as in Gong et al. (2012a); Long et al. (2014). The percentage
of correctly classified target data points are indicated as target recognition accuracies.
96
Table 5.3: Target recognition accuracies (%) for domain adaptation experiments onthe digit and face datasets. {MNIST(M), USPS(U), CKPlus(CK), MMI(MM), COIL1(C1)and COIL2(C2). M→U implies M is source domain and U is target domain. The bestand second best results in every experiment (row) are highlighted in bold and italicrespectively. The shaded columns indicate accuracies obtained using model selection.
Average 41.14 47.55 40.65 47.59 26.79 60.63 72.85 74.67 68.73
97
Table 5.4: Target recognition accuracies (%) for domain adaptation experiments onthe Office-Caltech dataset with SURF and Deep features. {Amazon(A), Webcam(W),Dslr(D), Caltech(C)}. A→W implies A is source and W is target. The best andsecond best results in every experiment (row) are highlighted in bold and italicrespectively. The shaded columns indicate accuracies obtained using model selection.
Average 84.55 85.30 85.44 83.42 87.09 89.12 84.63 90.70 88.71
98
5.3.4 Parameter Estimation Study
The procedure for model selection is evaluated below. There are 4 parameters
(k, α, β, γ), for the NET algorithm and 2 parameters (k, γ), for the JDA that need to
be pre-determined. To determine these parameters a validation subset is created from
the source data by weighting them using Equation (5.13) and selecting 10% of the
source data points with the largest weights. This validation subset has a distribution
similar to the target and it can be used to validate the optimal values for the model
parameters (α, β, γ, k) since the source data points have labels. A grid search is
conducted in the parameter space of k ∈ {10, 20, . . . , 100, 200} and α, β, γ from the
set {0, 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10}. Although a unique set
of parameters can be evaluated for every domain adaptation experiment, for the sake
of brevity, one set of parameters is presented for every dataset. Once the optimum
parameters are evaluated using cross-validation, the NET model is applied on the
entire source data (data and labels) and the target data (data only) to estimate the
projected source and target data points. The target recognition accuracies obtained
are represented as shaded columns JDAv and NETv in Tables (5.3) and (5.4).
In order to evaluate the proposed model selection method, the parameters were
also determined using the target data as a validation set. The NET column in Ta-
bles (5.3) and (5.4) depicts these results. The target recognition accuracies for the
NET columns can be considered as the best accuracies for the NET model. The
JDAv and NETv should be compared with the NET column to evaluate the proposed
cross-validation procedure. For the rest of the column values, SA, CA, GFK, TCA,
TJM and JDA, their model parameters were fixed based on their respective papers.
The target recognition accuracies for NETv is higher than those of the other domain
adaptation methods and is nearly comparable to the NET. This is an empirical vali-
99
dation for the effectiveness of the cross-validation procedure. Interestingly, in Table
(5.3), the JDAv has better performance than the JDA. This highlights the fact that
a validation procedure helps select the optimum model parameters. Both these re-
sults demonstrate that the proposed model selection procedure is a valid technique
for evaluating an unsupervised domain adaptation algorithm in the absence of target
data labels. Figures (5.2a) to (5.2d), depict the variation of average validation set
accuracies for the model parameters in the NET model. Similarly, Figures (5.2e) and
(5.2f), depict the variation of average validation set accuracies for the model param-
eters in the JDA model. The optimal parameters were chosen based on the highest
validation set accuracies for each of the datasets.
5.3.5 NET Algorithm Evaluation
The results of the NET algorithm are depicted under the NET column in Tables
(5.3) and (5.4). The parameters used to obtain these results are depicted in Table
(5.5). NET consistently outperforms non-spectral methods like SA, CA and GFK.
Also, the accuracies obtained with the NET algorithm are better than any of the other
spectral methods (TCA, TJM and JDA). These results confirm that embedding data
based on similarity along with domain alignment improves target data classification.
Table 5.5: Parameters Used for the NET Model
Dataset α β γ k
MNIST & USPS 1.0 0.01 1.0 20
MMI & CK+ 0.01 0.01 1.0 20
COIL 1.0 1.0 1.0 60
PIE 10.0 0.001 0.005 200
Ofc-SURF 1.0 1.0 1.0 20
Ofc-Deep 1.0 1.0 1.0 20
100
(a) # bases k (b) MMD weight α
(c) Embed weight β (d) Regularization γ
(e) # bases k (f) Regularization γ
Figure 5.2: NET and JDA cross-validation study. Each of the figures depicts therecognition accuracies over the source-based validation set. When studying a pa-rameter (say k), the remaining parameters (α, β, γ) are fixed at the optimum value.The legend is, Digit (Di), Coil (Cl), MMI&CK+ Face (Fc), PIE (Pi), Office-CaltechSURF (O-S) and Office-Caltech Deep (O-D). The top 4 figures are for the NET cross-validation study for (k, α, β, γ). The bottom 2 figures are for the JDA cross-validationstudy (k, γ). Images based on Venkateswara et al. (2017a)
101
5.4 Conclusions and Summary
The average accuracies obtained with JDA and NET using the validation set are
comparable to the best accuracies with JDA and NET. This empirically validates
the model selection proposition. However, there is no theoretical guarantee that the
parameters selected are the best. In the absence of theoretical validation, further
empirical analysis is advised when using the proposed technique for model selection.
This chapter introduced the Nonlinear Embedding Transform algorithm for unsu-
pervised domain adaptation. The NET algorithm implemented domain adaptation by
aligning the joint distributions of the source and the target using MMD. The aligned
distributions were embedded onto a manifold to ensure enhanced classification. The
chapter also introduced a validation procedure to estimate model parameters in the
absence of labeled target data. The experimental analysis demonstrated that the
NET performs favorably when compared with competitive visual domain adaptation
methods across multiple datasets.
102
Chapter 6
HIERARCHICAL FEATURE SPACES FOR DOMAIN ADAPTATION
Deep learning models extract hierarchical patterns from large amounts of data as
demonstrated in Krizhevsky et al. (2012). These feature representations have been
found to be highly discriminative in nature. Since a deep network has a hierarchical
set of layers with nonlinear functions at multiple layers, the deep network can be
viewed as a function with very high nonlinearity. Since they are multilayer networks,
they are also termed as hierarchical methods.
Deep learning is a relatively new area in computer vision when compared to hand-
crafted (shallow) feature extraction methods like SIFT, HOG or SURF. Transfer
learning based on deep networks is still in its formative years, although there has been
a significant amount of work in the last few years. Deep learning has been applied to
domain adaptation and has shown remarkable results compared to non-deep learning
procedures. Chapter (3) discusses the different approaches to deep learning based
domain adaptation. These include a direct extension of shallow methods to deep
networks, as in Tzeng et al. (2014) and Long et al. (2015) and adversarial methods
like Ganin et al. (2016).
This chapter introduces a novel unsupervised domain adaptation procedure based
on a deep network that determines hash values. The deep network is trained with
labeled source data and unlabeled target data to learn hash values for the inputs.
The source data is trained using a supervised hash based loss and the target data is
trained with an unsupervised hash based entropy loss. The chapter is organized as
follows. Section (6.1) provides an introduction to the proposed hierarchical model for
unsupervised domain adaptation. It outlines the main contributions of the approach
103
and motivates the reason for a hashing based approach. Section (6.2) describes the
deep learning model, highlighting the different components and their role. This is fol-
lowed by a discussion on the experiments that were conducted to evaluate the model,
in Section (6.3). Two classes of experiments were conducted, (i) unsupervised domain
adaptation experiments and, (ii) unsupervised hashing experiments, to evaluate all
aspects of the model.
6.1 A Hierarchical Feature Model for Domain Adaptation
Conventional shallow transfer learning methods develop their models in two stages,
feature extraction followed by domain adaptation. The features are fixed and then
a model is trained to align the source and target domains, as in Duan et al. (2009)
and Saenko et al. (2010). On the other hand, deep transfer learning procedures
exploit the feature learning capabilities of deep networks to learn transferable feature
representations for domain adaptation and have demonstrated impressive empirical
performance. This chapter outlines a deep hashing network for unsupervised domain
adaptation. In unsupervised domain adaptation, there are no labels for the target
data. It is therefore difficult to train a deep network with target data in a supervised
manner. The proposed model overcomes this challenge by learning hash values for
the data inputs.
The explosive growth of digital data in the modern era has posed fundamental
challenges regarding their storage, retrieval and computational requirements. Against
this backdrop, hashing has emerged as one of the most popular and effective tech-
niques due to its fast query speed and low memory cost Wang et al. (2014). Hashing
techniques transform high dimensional data into compact binary codes and generate
similar binary codes for similar data items. Motivated by this fact, a deep neural
network is trained to output binary hash codes (instead of probability values), which
104
can be used for classification. There are two advantages to estimating a hash value
instead of a standard probability vector in the final layer of the network:
1. Hash values enable efficient storage and retrieval of data due to their fast query
speed and low memory costs.
2. During prediction, the hash code of a test sample can be compared against the
hash codes of the training samples to arrive at a more robust prediction.
In this chapter, a novel deep learning framework is proposed called Domain Adap-
tive Hashing (DAH), to learn informative hash codes to address the problem of un-
supervised domain adaptation. A unique loss function is described to train the deep
network with the following components:
1. Supervised hash loss for labeled source data, which ensures that source samples
belonging to the same class have similar hash codes
2. Unsupervised entropy loss for unlabeled target data, which imposes each target
sample to align closely with exactly one of the source categories and be distinct
from the other categories
3. A loss based on multi-kernel Maximum Mean Discrepancy (MK-MMD), which
seeks to learn transferable features within the layers of the network to minimize
the distribution difference between the source and target domains.
Figure (6.1) illustrates the different layers of the DAH and the components of the loss
function.
6.2 Domain Adaptation Through Hashing
In unsupervised domain adaptation, as in the previous chapters, two domains are
considered; source and target. The source consists of labeled data, Ds = {xsi , y
si }ns
i=1
105
and the target has only unlabeled data Dt = {xti}nt
i=1. The data points x∗i belong
to X , where X is some input space. The corresponding labels are represented by
y∗i ∈ Y := {1, . . . , C}. The paradigm of domain adaptive learning attempts to address
the problem of domain-shift in the data, where the data distributions of the source
and target are different, i.e. Ps(X, Y ) 6= Pt(X, Y ) for random variables X ∈ X and
Y ∈ Y . The domain-disparity notwithstanding, the goal is to train a deep neural
network classifier ψ(.), that can predict the labels {yti}nt
i=1, for the target data.
The neural network is implemented as a deep CNN which consists of 5 convolution
layers conv1 - conv5 and 3 fully connected layers fc6 - fc8 followed by a loss layer.
In this model, a hashing layer hash-fc8 is introduced in place of the standard fully
connected fc8 layer to output a binary code hi, for every data point xi, where hi ∈
{−1,+1}d. Two loss functions direct the hash-fc8 layer, (i) supervised hash loss for
the source data, (ii) unsupervised entropy loss for the target data. The supervised
hash loss is meant to ensure the hash values are distinct and discriminatory, i.e. if
xi and xj belong to the same category, their hash values hi and hj are similar and
different otherwise. The unsupervised entropy loss aligns the target hash values with
source hash values based on the similarity of their feature representations. The output
of the network is represented as ψ(x), where ψ(x) ∈ Rd, which is converted to a hash
code h = sgn(ψ(x)), where sgn(.) is the sign function. Once the network has been
trained, the probability of x being assigned a label y is given by f(x) = p(y|h). The
network was trained using Ds and Dt and the target data labels yt∗ were predicted
using f(.).
In order to address the issue of domain-shift, the feature representations of the
target and the source need to be aligned. It was achieved by reducing the domain
discrepancy between the source and target feature representations at multiple layers
of the network. In the following subsections, the design of the domain adaptive hash
106
(DAH) network is discussed in detail.
Figure 6.1: The Domain Adaptive Hash (DAH) network that outputs hash codesfor the source and the target. The network is trained with a batch of source andtarget data. The convolution layers conv1 - conv5 and the fully connected layers fc6and fc7 are fine tuned from the VGG-F network. The MK-MMD loss trains the DAHto learn feature representations which align the source and the target. The hash-fc8layer is trained to output vectors of d dimensions. The supervised hash loss drivesthe DAH to estimate a unique hash value for each object category. The unsupervisedentropy loss aligns the target hash values to their corresponding source categories.Best viewed in color. Image based on Venkateswara et al. (2017b).
6.2.1 Addressing Domain Disparity
Deep learning methods have been very successful in domain adaptation with state-
of-the-art algorithms Ganin et al. (2016); Long et al. (2015, 2016b); Tzeng et al.
(2015a) in recent years. The feature representations transition from generic to task-
specific as one goes up the layers of a deep CNN Yosinski et al. (2014). The convo-
lution layers conv1 to conv5 have been shown to be generic feature extractors and
so, the extracted features are readily transferable, whereas the feature extractors in
the fully connected layers are more task-specific and need to be adapted before they
can be transferred. In the DAH algorithm, the MK-MMD loss is minimized to reduce
the domain difference between the source and target feature representations for fully
107
connected layers, F = {fc6, fc7, fc8}. Such a loss function has been used in previous
research Long et al. (2015, 2016b). The multi-layer MK-MMD loss is given by,
M(Us,Ut) =∑
l∈F
d2k(U ls,U l
t), (6.1)
where, U ls = {us,l
i }ns
i=1 and U lt = {ut,l
i }nt
i=1 are the set of output representations for the
source and target data at layer l, where u∗,li is the output representation of x∗
i for the
lth layer. The final layer outputs are denoted as Us and Ut. The MK-MMD measure
d2k(.) is the multi-kernel maximum mean discrepancy between the source and target
representations, Gretton et al. (2012). For a nonlinear mapping φ(.) associated with
a reproducing kernel Hilbert space Hk and kernel k(.), where k(x,y) = 〈φ(x), φ(y)〉,
the MMD is defined as,
d2k(U ls,U l
t) =∣
∣
∣
∣
∣
∣E[φ(us,l)]− E[φ(ut,l)]
∣
∣
∣
∣
∣
∣
2
Hk
. (6.2)
The characteristic kernel k(.), is determined as a convex combination of κ PSD kernels,
{km}κm=1, K :={
k : k =∑κ
m=1 βmkm,∑κ
m=1 βm = 1, βm ≥ 0, ∀m}
. According to
Long et al. (2016b), βm is to 1/κ and it works well in practice.
6.2.2 Supervised Hash Loss
The Hamming distance for a pair of hash values hi and hj has a unique relationship
with the dot product 〈hi,hj〉, given by: distH(hi,hj) =12(d− h⊤
i hj), where d is the
hash length. The dot product 〈hi,hj〉 can be treated as a similarity measure for
the hash codes. Larger the value of the dot product (high similarity), smaller is the
distance distH and smaller the dot product (low similarity), larger is the distance
distH . Let sij ∈ {0, 1} be the similarity between xi and xj. If xi and xj belong to
the same category, sij = 1 and 0, otherwise. The probability of similarity between
xi and xj given the corresponding hash values hi and hj, can be expressed as a
108
likelihood function, given by,
p(sij|hi,hj) =
σ(h⊤i hj), sij = 1
1− σ(h⊤i hj), sij = 0,
(6.3)
where, σ(x) = 11+e−x is the sigmoid function. As the dot product 〈hi,hj〉 increases,
the probability of p(sij = 1|hi,hj) also increases, i.e., xi and xj belong to the same
category. As the dot product decreases, the probability p(sij = 1|hi,hj) also de-
creases, i.e., xi and xj belong to different categories. The (ns× ns) similarity matrix
S = {sij}, is constructed for the source data with the provided labels, where sij = 1
if xi and xj belong to the same category and 0, otherwise. Let H = {hi}ns
i=1 be the
set of source data hash values. If the elements of H are assumed to be i.i.d., the
negative log likelihood of the similarity matrix S given H can be written as,
minHL(H) = −log p(S|H)
= −∑
sij∈S
(
sijh⊤i hj − log
(
1 + exp(h⊤i hj)
)
)
. (6.4)
By minimizing Equation (6.4), the hash values H can be determined for the source
data which are consistent with the similarity matrix S. The hash loss has been used in
previous research for supervised hashing Li et al. (2016); Zhu et al. (2016). Equation
(6.4) is a discrete optimization problem that is challenging to solve. A relaxation is
introduced on the discrete constraint hi ∈ {−1,+1}d by instead solving for ui ∈ Rd,
where Us = {ui}ns
i=1 is the output of the network and ui = ψ(xi) (the superscript
denoting the domain is dropped for ease of representation). However, the continuous
relaxation gives rise to (i) approximation error, when 〈hi,hj〉 is substituted with
〈ui,uj〉 and, (ii) quantization error, when the resulting real codes ui are binarized
Zhu et al. (2016). The approximation error is accounted for by having a tanh(.) as
the final activation layer of the neural network, so that the components of ui are
109
bounded between −1 and +1. In addition, a quantization loss ||ui − sgn(ui)||22 is
introduced along the lines of Gong et al. (2013c), where sgn(.) is the sign function.
The continuous optimization problem for supervised hashing can now be outlined as;
minUs
L(Us) =−∑
sij∈S
(
siju⊤i uj − log
(
1 + exp(u⊤i uj)
)
)
+ns∑
i=1
∣
∣
∣
∣ui − sgn(ui)∣
∣
∣
∣
2
2. (6.5)
6.2.3 Unsupervised Entropy Loss
In the absence of target data labels, the similarity measure 〈ui,uj〉, is used to
guide the network to learn discriminative hash values for the target data. An ideal
target output uti, needs to be similar to many of the source outputs from the jth
category(
{usjk }Kk=1
)
. It is assumed without loss of generality, that there exist K
source data points for every category j where, j ∈ {1, . . . , C} and usjk is the kth
source output from category j. In addition, uti must be dissimilar to most other
source outputs uslk belonging to a different category (j 6= l). Enforcing similarity
with all the K data points makes for a more robust target data category assignment.
A probability measure to capture this intuition is outlined as follows. Let pij be the
probability that input target data point xi is assigned to category j where,
pij =
∑Kk=1 exp(u
ti⊤u
sjk )
∑Cl=1
∑Kk=1 exp(u
ti⊤u
slk )
(6.6)
The exp(.) is introduced for ease of differentiability and the denominator ensures
∑
j pij = 1. When the target data point output is similar to one category only and
dissimilar to all the other categories, the probability vector pi = [pi1, . . . , piC ]T tends
to be a one-hot vector. A one-hot vector can be viewed as a low entropy realization
of pi. It can therefore be envisaged that all the pi are one-hot vectors (low entropy
probability vectors), where the target data point outputs are similar to source data
110
point outputs in one and only one category. To this end a loss is introduced to capture
the entropy of the target probability vectors. The entropy loss for the network outputs
is given by,
H(Us,Ut) = −1
nt
nt∑
i=1
C∑
j=1
pijlog(pij) (6.7)
Minimizing the entropy loss yields probability vectors pi that tend to be one-hot vec-
tors, i.e., the target data point outputs are similar to source data point outputs from
any one category only. Enforcing similarity with K source data points from a cate-
gory, guarantees that the hash values are determined based on a common similarity
between multiple source category data points and the target data point.
6.2.4 The Domain Adaptive Hash (DAH) Network
A model for deep unsupervised domain adaptation is proposed based on hashing
that incorporates unsupervised domain adaptation between the source and the tar-
get in Equation (6.1), the supervised hashing for the source in Equation (6.5) and
unsupervised hashing for the target in Equation (6.7) in a deep convolutional neural
network. The DAH network is trained to minimize
minUJ = L(Us) + γM(Us,Ut) + ηH(Us,Ut), (6.8)
where, U := {Us ∪ Ut} and (γ, η) control the importance of domain adaptation (6.1)
and target entropy loss (6.7) respectively. The hash values H are obtained from the
output of the network using H = sgn(U). The loss terms in Equation (6.5) and
Equation (6.7) are determined in the final layer of the network with the network
output U . The MK-MMD loss in Equation (6.1) is determined between layer outputs
{U ls,U l
t} at each of the fully connected layers F = {fc6, fc7, fc8}, where the linear
time estimate for the unbiased MK-MMD was adopted as described in Gretton et al.
111
(2012) and Long et al. (2015). The DAH is trained using standard back-propagation.
The detailed derivation of the derivative of Equation (6.8) w.r.t. U is provided in
Appendix B.
6.2.5 Network Architecture
Owing to the paucity of images in a domain adaptation setting, the need to train
a deep CNN with millions of images was circumvented by adapting the pre-trained
VGG-F Chatfield et al. (2014) network to the DAH. The VGG-F was trained on the
ImageNet 2012 dataset and it consists of 5 convolution layers (conv1 - conv5) and 3
fully connected layers (fc6, fc7, fc8). The hashing layer hash-fc8 was introduced that
outputs vectors in Rd in the place of fc8. A tanh() layer was introduced To account for
the hashing approximation. However, the issue of vanishing gradients Hochreiter et al.
(2001) was encountered when using tanh() as it saturates with large inputs. Therefore
the tanh() is prefaced with a batch normalization layer which prevents the tanh() from
saturating. In effect, the fc8 is replaced by hash-fc8 := {fc8→ batch-norm→ tanh()}.
The hash-fc8 provides greater stability when fine-tuning the learning rates than the
deep hashing networks Li et al. (2016); Zhu et al. (2016). Figure (6.1) illustrates the
proposed DAH network.
6.3 Experimental Analysis of the DAH Model
This section discusses the experiments that were conducted to evaluate the DAH
algorithm. Since a domain adaptation technique based on hashing has been pro-
posed, the DAH is evaluated for objection recognition accuracies for unsupervised
domain adaptation and the discriminatory capabilities of the learned hash codes for
unsupervised domain adaptive hashing are also studied.
112
6.3.1 Experimental Datasets
Office Saenko et al. (2010): This is currently the most popular benchmark dataset
for object recognition in the domain adaptation computer vision community. The
dataset consists of images of everyday objects in an office environment. It has 3
domains; Amazon (A), Dslr (D) and Webcam (W). The Amazon domain has images
downloaded from amazon.com. The Dslr and Webcam domains have images captured
using a DSLR camera and a webcam respectively. The dataset has around 4, 100
images with a majority of the images (2816 images) in the Amazon domain. The
common evaluation protocol of different pairs of transfer tasks for this dataset Long
et al. (2015, 2016b) was adopted. 6 transfer tasks were considered for all combinations
of source and target pairs for the 3 domains. A → D, D → A, A →W, W → A,
D→W and W→ D. A→ D implies, A is the source and D is the target.
Office-Home1: This is a new dataset that was designed, developed and released
to the research community as part of this dissertation. It consists of 4 domains Art
(Ar), Clipart (Cl), Product (Pr) and Real-World (Rw) and 12 pairs of transfer
were evaluated in a manner similar to the Office dataset. More details about the
dataset are provided in Chapter (2).
6.3.2 Implementation Details for the DAH
The DAH was implemented using the MatConvnet framework Vedaldi and Lenc
(2015). Since a pre-trained VGG-F was deployed, the weights of layers conv1-conv5,
fc6 and fc7 were fine tuned. Their learning rates were set to 1/10th the learning rate
of hash-fc8. The learning rate was varied between 10−4 to 10−5 over 300 epochs with
a momentum 0.9 and weight decay 5× 10−4. K = 5 was the number of samples from
distance between two domains that can be viewed as the discrepancy between two
domains. Although it is difficult to estimate its exact value, an approximate distance
measure is given by 2(1−2ǫ), where ǫ is the generalization error for a binary classifier
trained to distinguish between the two domains. A LIBLINEAR SVM, Fan et al.
(2008), classifier with 5-fold cross-validation was applied to estimate ǫ. Figure (6.2a)
indicates that the DAH features have the least discrepancy between the source and
target compared to DAN and Deep features. This is also confirmed with the t-SNE
embeddings in Figures (6.2b-6.2d). The Deep features show very little overlap be-
tween the domains and the categories depict minimal clustering. Domain overlap and
116
Ar -> Cl Ar -> Pr Ar -> Rw
A-D
ista
nc
e
0
0.5
1
1.5
2Deep
DAN
DAH
(a) A-Distance (b) Deep Features
(Ar,Cl)
(c) DAN Features
(Ar,Cl)
(d) DAH Features
(Ar,Cl)
Figure 6.2: Feature analysis of fc7 layer. (a) A-distances for Deep, DAN and DAH,(b), (c) and (d) t-SNE embeddings for 10 categories from Art (•) and Clipart(+)domains. Best viewed in color. Images based on Venkateswara et al. (2017b).
clustering improves in DAN and DAH features, with DAH providing the best visual-
izations. This corroborates the efficacy of the DAH algorithm to exploit the feature
learning capabilities of deep neural networks to learn representative hash codes and
achieve domain adaptation.
6.3.4 Unsupervised Domain Adaptive Hashing
This subsection demonstrates the performance of the DAH algorithm to generate
compact and efficient hash codes from the data, for classifying unseen test instances,
when no labels are available. This problem has been addressed in the literature, with
promising empirical results by Carreira-Perpinan and Raziperchikolaei (2015); Do
et al. (2016) and Gong and Lazebnik (2011). However, in a real-world setting, labels
may be available from a different, but related (source) domain; a strategy to utilize the
labeled data from the source domain, to learn representative hash codes for the target
domain, is therefore of immense practical importance. The following scenarios were
considered to address this real-world challenge: (i) No labels are available for a given
dataset and the hash codes need to be learned in a completely unsupervised manner.
DAH was evaluated against baseline unsupervised hashing methods (ITQ) by Gong
and Lazebnik (2011) and (KMeans) by He et al. (2013) and also state-of-the-art
117
Recall
0 0.2 0.4 0.6 0.8 1
Precis
ion
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NoDA
ITQ
KMeans
BA
BDNN
DAH
SuH
(a) Art
Recall
0 0.2 0.4 0.6 0.8 1
Precis
ion
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NoDA
ITQ
KMeans
BA
BDNN
DAH
SuH
(b) Clipart
Recall
0 0.2 0.4 0.6 0.8 1
Precis
ion
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NoDA
ITQ
KMeans
BA
BDNN
DAH
SuH
(c) Product
Recall
0 0.2 0.4 0.6 0.8 1
Precis
ion
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NoDA
ITQ
KMeans
BA
BDNN
DAH
SuH
(d) Real-World
Figure 6.3: Precision-Recall curves @64 bits for the Office-Home dataset. Com-parison of hashing without domain adaptation (NoDA), shallow unsupervised hash-ing (ITQ, KMeans), state-of-the-art deep unsupervised hashing (BA, BDNN),unsupervised domain adaptive hashing (DAH) and supervised hashing (SuH). Bestviewed in color. Images based on Venkateswara et al. (2017b).
methods for unsupervised hashing (BA) by Carreira-Perpinan and Raziperchikolaei
(2015) and (BDNN) by Do et al. (2016). (ii) Labeled data is available from a
different, but related source domain. A hashing model was trained on the labeled
source data and was used to learn hash codes for the target data. This method
is referred to as NoDA, as no domain adaptation is performed. The deep pairwise-
supervised hashing (DPSH) algorithm by Li et al. (2016), was deployed to train a deep
network with the source data and the network was applied to generate hash codes
for the target data. (iii) Labeled data is available from a different, but related source
domain and the DAH formulation is applied to learn hash codes for the target domain,
by reducing domain disparity. (iv) Labeled data is available in the target domain.
This method falls under supervised hashing (SuH) (as it uses labeled data in the
target domain to learn hash codes in the same domain) and denotes the upper bound
on the performance. It was included to compare the performance of unsupervised
hashing algorithms relative to the supervised algorithm. The DPSH algorithm by Li
et al. (2016), was used to train a deep network on the target data and generate hash
codes on a validation subset.
Results and Discussion: Precision-Recall curves and the mean average precision
118
Recall
0 0.2 0.4 0.6 0.8 1
Precis
ion
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NoDA
ITQ
KMeans
BA
BDNN
DAH
SuH
(a) Amazon
Recall
0 0.2 0.4 0.6 0.8 1
Precis
ion
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NoDA
ITQ
KMeans
BA
BDNN
DAH
SuH
(b) Webcam
Figure 6.4: Precision-Recall curves @64 bits for the Office dataset. Comparison ofhashing without domain adaptation (NoDA), shallow unsupervised hashing (ITQ,KMeans), state-of-the-art deep unsupervised hashing (BA, BDNN), unsuperviseddomain adaptive hashing (DAH) and supervised hashing (SuH). Best viewed incolor. Images based on Venkateswara et al. (2017b).
Table 6.3: Mean average precision @64 bits. For the NoDA and DAH results, Art isthe source domain for Clipart, Product and Real-World and Clipart is the sourcedomain for Art. Similarly, Amazon and Webcam are source target pairs.
(mAP) measures were used to evaluate the efficacy of the hashing methods, similar to
previous research in Carreira-Perpinan and Raziperchikolaei (2015); Do et al. (2016)
and Gong and Lazebnik (2011). The results are depicted in Figures (6.3) and (6.4)
(precision-recall curves) and Table (6.3) (mAP values), for hashing with code length
d = 64 bits. For the sake of brevity, the results with Dslr are dropped, as it is very
similar to Webcam, with little domain difference. It can be verified that the NoDA has
the poorest performance due to domain mismatch. This demonstrates the fact that
domain disparity needs to be considered before deploying a hashing network to ex-
119
tract hash codes. The unsupervised hashing methods ITQ, KMeans, BA and BDNN
perform slightly better compared to NoDA. The proposed DAH algorithm encom-
passes hash code learning and domain adaptation in a single integrated framework.
It is thus able to leverage the labeled data in the source domain in a meaningful
manner in order to learn efficient hash codes for the target domain. This accounts
for its improved performance, as is evident in Figures (6.3) and (6.4) and Table (6.3).
The supervised hashing technique (SuH), uses labels from the target and therefore
depicts the best performance. The proposed DAH framework consistently delivers the
best performance relative to SuH, when compared with the other hashing procedures.
This demonstrates the merit of the DAH framework in learning representative hash
codes by utilizing labeled data from a different domain. Such a framework is bound
to be immensely useful in a real-world setting.
6.3.5 Effect of Batch-size for Linear-MMD
The domain alignment between the source and the target is achieved using multi-
kernel maximum mean discrepancy (MK-MMD). The outputs of the fully connected
layers F = {fc6, fc7, fc8} are aligned using a linear MK-MMD loss. The MMD loss
outlined in Section (3.3) is based on Gretton et al. (2007). The MMD gives a measure
of discrepancy between two i.i.d. datasets. However, the MMD measure described
in Gretton et al. (2007) requires all the samples from the source and the target
to estimate the discrepancy and it is also quadratic in complexity. The DAH is
trained using back-propagation, an iterative algorithm where only a batch of data
points are available at any given instant. A batch-wise MMD would be a biased
estimate of the MMD for the entire data and also expensive for back-propagation
(since it is quadratic). The DAH therefore uses an online linear version of the MMD.
The online linear version of the MMD outlined in Gretton et al. (2012), provides an
120
empirical estimate for the MMD with linear complexity and works in an online setting;
where all the data samples are not available at once. The linear-MMD estimates the
discrepancy between the source and the target using only a batch of data points at a
time unlike the MMD which needs all the data samples from the two domains. The
DAH applies the linear-MMD over every batch of data for the fully connected layers.
An experiment was conducted to study the effect of batch size on domain adaptation
using the linear-MMD. In the DAH the batch size is controlled by the value of K (the
number of source data points for every category). Varying the values of K is bound
to vary the target recognition accuracies since the number of source samples available
for supervised hashing will vary. In order to study the effect of batch size on the
recognition accuracies, K has to remain constant across all batch sizes. However, it is
not possible to vary the batch size without changing K. Since the goal is to study the
effect of batch size on domain alignment, hashing and entropy loss can be replaced
with regular cross-entropy loss. The resulting model is equivalent to the DAN in
Long et al. (2015), where domain alignment is achieved with MK-MMD and the final
layer has a standard cross-entropy loss. Table (6.4) outlines the target recognition
accuracies when different batch sizes are used for this setting. It can be observed
that varying the batch size has no effect on the recognition accuracies when using the
linear-MMD for domain alignment. It can be concluded that the linear-MMD is a
consistent online empirical estimate for MMD.
6.3.6 Classification Experiments with Varying Hash Size
The previous subsections discussed the results for unsupervised domain adapta-
tion based object recognition with d = 64 bits. Here, the classification results with
d = 16 (DAH-16) and d = 128 (DAH-128) bits for the Office-Home dataset are out-
lined in Table (6.5). The (DAH-64), DAN and DANN results are also presented for
121
Table 6.4: Effect of batch size on domain alignment when using linear MMD. Thevalues represent recognition accuracies (%) for domain adaptation experiments on theOffice-Home dataset. {Art (Ar), Clipart (Cl), Product (Pr), Real-World (Rw)}.Ar→Cl implies Ar is source and Cl is target.Batch Size Ar→Cl Ar→Pr Ar→Rw Cl→Ar Cl→Pr Cl→Rw Pr→Ar Pr→Cl Pr→Rw Rw→Ar Rw→Cl Rw→Pr Avg.
Table 6.5: Recognition accuracies (%) for domain adaptation experiments on theOffice-Home dataset. {Art (Ar), Clipart (Cl), Product (Pr), Real-World (Rw)}.Ar→Cl implies Ar is source and Cl is target.
tropy H(X), characterizes the uncertainty about the random variable X. Mu-
tual Information (MI) between two random variables X and Y , is a measure of
information shared between them and is represented as I(X;Y ). It is symmetric
with I(X;Y ) = I(Y ;X). In terms of entropy, mutual information is defined as
I(X;Y ) = H(X)−H(X|Y ), where H(X|Y ) is the conditional entropy. Mutual
information between random variables X and Y can also be understood as the
reduction in entropy of X (or Y ) due to the presence of Y (or X). Conditional
Mutual Information (CMI) denoted as I(X;Y |Z), is the expected mutual infor-
mation of two random variables X and Y , given a third random variable Z. Areas
in entropy based Venn diagrams do not always correspond to positive quantities.
In case of two variable overlap it is true as in {e}, {d} or {f}. But, {g} need not
be positive. I(X;Y )− I(X;Y |Z) can be less than 0.
135
7.2 Experiments
This section outlines some of the experiments conducted to test the feature selec-
tion model. The experiments were conducted using MATLAB on an Intel Core-i7 2.3
GHz processor with 16GB of memory.
7.2.1 Feature Selectors: A Test of Scalability
Greedy feature selectors have time complexities of the order O(dk), which is neg-
ligible compared to the time complexities of the global feature selectors. Table (7.1)
lists the time complexities for the global algorithms. To study time complexities, mul-
tiple experiments (d, k), were conducted, where a CMI matrix Q was simulated by a
random positive symmetric matrix of size [d× d], and k features were selected. The
time complexity for experiment (d, k), is the average time of convergence over 10 runs.
The same set of random matrices were used for each of the algorithms in the exper-
iment. Figure (7.2) depicts the convergence times for different experiments. Linear
algorithm is the most efficient, followed closely by Spectral and TPower methods.
CVX Grant and Boyd (2014) implementation with SDPT3 solver Toh et al. (1999),
was used for all the SDP experiments. The SDP solver has a huge memory footprint
and with matrix sizes d ≥ 700, the computer ran into ‘Out of Memory’ errors. For
the LowRank method, the following parameters were used, r = 3, ǫ = 0.1, δ = 0.1
for all the experiments.
Table 7.1: Time complexities for the Global approximate solutions for BQP innumber of features d
Linear Spectral SDP LowRank* TPower**
O(dk) O(d2) O(d4.5) O(d(r+1)) O(td2)
*r is approximation rank, **
t is number of iterations
136
Figure 7.2: Average time in seconds for an algorithm to select k features fromdata containing d features in experiment (d, k). Image based on Venkateswara et al.(2015a).
7.2.2 BQP Methods: A Test of Approximation
Figure 7.3: The average percentage difference of the BQP objective values comparedwith the Linear BQP objective value. In experiment (d, k), d is the matrix dimensionand k is the number of features selected. Image based on Venkateswara et al. (2015a).
For the next set of experiments, the degree of approximation of the global al-
gorithms was examined. Since the optimal solution is unknown for the BQP, the
methods on evaluated by their relative objective values. The binary feature vector x
137
was estimated after applying each of the methods and then the objective value x⊤Qx
was evaluated. The percentage difference of every algorithm’s objective value with
the Linear method’s objective was compared. Similar to the experimental evaluation
used for time complexity, random data was generated for each experiment (d, k) and
the values were averaged over 10 random runs. Figure (7.3) presents the results of
the experiment. TPower and LowRank displayed the largest percentage increase
from the Linear. Since TPower and LowRank approximate the BQP better than
other methods, they must therefore be better feature selectors compared to Linear,
Spectral and SDP. The TPower is also very efficient in terms of execution time
and would therefore be the ideal feature selector when considering both speed and
accuracy.
7.2.3 Feature Selectors: A Test of Classification Error
In these experiments, the TPower and the LowRank are compared with other
algorithms in terms of classification accuracies. 13 publicly available datasets are
chosen that are widely used to study mutual information based feature selection as
in Rodriguez-Lujan et al. (2010); Brown et al. (2012); Nguyen et al. (2014); Peng
et al. (2005). The details of the datasets are captured in Table (7.2). The feature
selection was performed for a set of k values and classification performance was tested
across all values of k. Starting at k = 10, the feature selection was incremented in
steps 1 till d or 100, whichever was smaller. The classifier performance was evaluated
using Leave-One-Out cross validation (if n ≤ 100) or 10-fold cross validation and
the cross validation errors(%) for each fold were obtained. Since the average error
across all values of k is not a good measure of the classifier performance, the paired
t-test was applied, as also mentioned in Rodriguez-Lujan et al. (2010); Nguyen et al.
(2014); Herman et al. (2013), across the cross validation folds. For a fixed dataset
138
and a fixed value of k, to compare TPower with say, MaxRel, the one sided paired
t-test was applied at 5% significance over the error(%) of the cross validation folds
for the two algorithms. The performance of TPower vs. MaxRel set to win =
w, tie = t and loss = l, based on the largest number of t-test decisions across all
the k values. Along the lines of earlier studies in feature selection, a linear SVM
classifier was used. To estimate the conditional mutual information, the features were
discretized. The role of discretization is not unduly critical as long as it is consistent
across all the experiments. The features were discretized using the Class Attribute
Interdependence Maximization (CAIM) algorithm developed by Kurgan and Cios
(2004). Feature selection was performed on discretized data but the classification
(after feature selection) was performed on the original feature space.
Table 7.2: Datasets details: d is number of features, n is number of samples, c isnumber of categories, Error: is average cross validation error (%) using all features.
Data d n c Error Ref.
Arrhythmia 258 420 2 31.1 Lichman (2013)
Colon 2000 62 2 37.0 Ding and Peng (2005)
Gisette 4995 6000 2 2.5 Lichman (2013)
Leukemia 7070 72 2 26.4 Ding and Peng (2005)
Lung 325 73 7 9.6 Ding and Peng (2005)
Lymphoma 4026 96 9 81.3 Ding and Peng (2005)
Madelon 500 2000 2 45.5 Lichman (2013)
Multi-Feat 649 2000 10 1.5 Lichman (2013)
Musk2 166 6598 2 4.6 Lichman (2013)
OptDigits 62 3823 10 3.3 Lichman (2013)
Promoter 57 106 2 26.0 Lichman (2013)
Spambase 57 4601 2 7.5 Lichman (2013)
Waveform 21 5000 3 13.1 Lichman (2013)
139
Using the above procedure, the performance of TPower and LowRank were com-
pared with all the other algorithms. Tables (7.3) and (7.4) display the results of the
experiment. The values in Table (7.3) (likewise Table (7.4)) correspond to the differ-
ence in the average of classification error(%) between TPower (likewise LowRank)
and all the other algorithms. From the results in these tables, it can be gathered that
TPower and LowRank outperform most of the datasets across all the algorithms.
When compared against each other LowRank outperforms TPower. For the sake
of brevity, the comparison between other pairs of algorithms has not displayed The
win/tie/loss numbers by themselves do not provide a complete picture of the com-
parison. The difference in the average error also needs to be taken into account to
assess the performance. A large percentage of negative values in the columns and their
magnitudes indicate the low error values in classification for TPower and LowRank.
Figure (7.4) displays the average classification error(%) trends for varying values of k
for 3 datasets. Figures (7.4a) and (7.4d) for the Colon dataset, suggest that the ad-
dition of more features does not necessarily reduce classification error. Classification
error trends also guide in validating the best value of k for a dataset. For a given
dataset, the error trends between the global and greedy procedures follow a similar
pattern. This perhaps indicates that nearly similar features are being selected using
both types of methods. It should be noted that for huge datasets with large values
of d, greedy methods may not be a bad choice for feature selection.
140
Table 7.3: Comparison of TPower with other algorithms. The table values measurethe difference in average classification accuracies of TPower with other algorithms.w, t and l indicate one-sided paired t-test results. The last row displays the totalnumber of Wins(W ), Ties(T ) and Loss(L). N/A indicates comparison data wasunavailable for large datasets using SDP.
Data MaxRel MRMR JMI QPFS Spectral SDP LowRank
Arrythmia -0.37 ± 1.4 t 0.32 ± 1.0 l 0.02 ± 1.0 l 0.20 ± 1.8 l -0.08 ± 1.1 t -0.18 ± 1.0 w -0.05 ± 0.8 l
Colon -7.28 ± 4.6 w -4.42 ± 4.2 w -2.47 ± 3.8 w -6.70 ± 4.6 w -0.60 ± 2.8 w N/A 4.03 ± 4.5 l
Gisette -1.32 ± 0.6 w 0.00 ± 0.7 w -1.12 ± 0.6 w -1.38 ± 0.7 w -1.26 ± 0.6 w N/A 0.33 ± 0.6 l
Leukemia 0.11 ± 1.4 w 1.40 ± 1.6 l 1.59 ± 1.8 l 0.41 ± 1.1 t -0.03 ± 0.6 w N/A 1.49 ± 1.3 l
Lung -9.43 ± 4.1 w -2.52 ± 4.2 w -3.83 ± 4.2 w 0.60 ± 2.8 l -0.88 ± 2.2 w -0.88 ± 2.1 w -1.59 ± 2.4 w
Lymphoma -2.76 ± 4.8 w 3.35 ± 4.7 l 2.93 ± 5.3 l 4.99 ± 3.3 l -1.86 ± 2.5 w N/A 3.29 ± 4.2 l
Madelon 0.32 ± 0.5 l 0.80 ± 0.9 l 0.01 ± 0.4 w -0.22 ± 0.7 w -0.01 ± 0.6 w 0.15 ± 0.6 l -0.11 ± 0.4 t
MultiFeatures 0.02 ± 0.3 w 0.24 ± 0.3 l 0.17 ± 0.3 l -0.42 ± 0.3 w 0.10 ± 0.3 w 0.11 ± 0.3 w 0.01 ± 0.3 l
Musk2 -0.45 ± 0.6 w -0.22 ± 0.7 w -0.18 ± 0.5 w -0.31 ± 0.6 w 0.06 ± 0.4 w 0.03 ± 0.5 w 0.05 ± 0.4 w
OptDigits -0.19 ± 0.5 w -0.01 ± 0.6 t 0.16 ± 0.6 l -0.65 ± 1.0 w 0.03 ± 0.3 l -2.53 ± 13.0 w 0.08 ± 0.4 l
Promoter 0.73 ± 3.0 l -0.04 ± 3.2 w 0.19 ± 3.0 l -1.29 ± 3.8 w -0.48 ± 2.8 w -0.56 ± 2.9 w -0.17 ± 3.1 w
Spambase -0.34 ± 0.3 w 0.06 ± 0.2 l -0.23 ± 0.3 w 0.03 ± 0.4 l -0.09 ± 0.3 w -0.10 ± 0.3 w 0.02 ± 0.1 l
Waveform -0.13 ± 0.3 w 0.06 ± 0.3 l -0.01 ± 0.0 t 0.04 ± 0.2 t 0.04 ± 0.1 t 0.00 ± 0.2 t -0.01 ± 0.2 t
Table 7.4: Comparison of LowRank with other algorithms. Table structure similarto Table 7.3
Data MaxRel MRMR JMI QPFS Spectral SDP TPower
Arrythmia -0.32 ± 1.3 l 0.36 ± 1.0 l 0.07 ± 1.0 l 0.25 ± 1.7 l -0.03 ± 1.1 t -0.13 ± 1.0 w 0.05 ± 0.8 w
Colon -11.31 ± 4.7 w -8.45 ± 4.3 w -6.50 ± 3.9 w -10.73 ± 5.3 w -4.63 ± 5.0 w N/A -4.03 ± 4.5 w
Gisette -1.65 ± 0.5 w -0.32 ± 0.5 w -1.44 ± 0.7 w -1.70 ± 0.6 w -1.58 ± 0.6 w N/A -0.33 ± 0.6 w
Leukemia -1.39 ± 1.4 w -0.09 ± 1.5 t 0.09 ± 1.7 t -1.09 ± 1.3 w -1.52 ± 1.2 w N/A -1.49 ± 1.3 w
Lung -7.83 ± 4.1 w -0.92 ± 4.5 w -2.23 ± 4.5 w 2.19 ± 3.7 l 0.71 ± 2.5 l 0.71 ± 2.3 l 1.59 ± 2.4 l
Lymphoma -6.06 ± 3.5 w 0.06 ± 2.1 l -0.36 ± 2.1 l 1.70 ± 2.5 l -5.15 ± 3.8 w N/A -3.29 ± 4.2 w
Madelon 0.43 ± 0.5 l 0.91 ± 0.8 l 0.12 ± 0.5 l -0.11 ± 0.7 w 0.10 ± 0.6 l 0.26 ± 0.5 l 0.11 ± 0.4 t
MultiFeatures 0.01 ± 0.4 l 0.23 ± 0.3 l 0.16 ± 0.4 l -0.43 ± 0.3 w 0.10 ± 0.4 l 0.10 ± 0.4 l -0.01 ± 0.3 w
Musk2 -0.50 ± 0.4 w -0.27 ± 0.7 w -0.23 ± 0.5 w -0.36 ± 0.5 w 0.02 ± 0.3 l -0.02 ± 0.4 l -0.05 ± 0.4 l
OptDigits -0.26 ± 0.6 w -0.09 ± 0.4 w 0.08 ± 0.4 l -0.72 ± 1.2 w -0.04 ± 0.4 t -2.61 ± 13.1 w -0.08 ± 0.4 w
Promoter 0.90 ± 3.1 l 0.13 ± 2.7 w 0.35 ± 2.9 w -1.13 ± 3.7 w -0.31 ± 2.8 t -0.40 ± 2.5 t 0.17 ± 3.1 l
Spambase -0.36 ± 0.3 w 0.04 ± 0.2 l -0.24 ± 0.3 w 0.02 ± 0.4 t -0.11 ± 0.3 w -0.12 ± 0.3 w -0.02 ± 0.1 w
Waveform -0.12 ± 0.4 w 0.07 ± 0.1 l 0.00 ± 0.2 t 0.05 ± 0.1 t 0.05 ± 0.1 t 0.02 ± 0.1 t 0.01 ± 0.2 t
#W/T/L: 9/0/4 6/1/6 6/2/5 8/2/3 5/4/4 3/2/4 8/2/3
141
1 11 21 31 41 51 61 71 81 91
5
10
15
20
25
30
Number of Features
Cro
sval
idat
ion
Err
or R
ates
Max−Rel
MRMR
JMI
TPower
LowRank
(a) Colon
1 11 21 31 41 51 61 71 81 91
4
6
8
10
12
14
Number of Features
Cro
sval
idat
ion
Err
or R
ates
Max−Rel
MRMR
JMI
TPower
LowRank
(b) Gisette
1 11 21 31 41 51 61 71 81 91
4
5
6
7
8
9
10
11
12
13
14
Number of Features
Cro
sval
idat
ion
Err
or R
ates
Max−Rel
MRMR
JMI
TPower
LowRank
(c) Musk2
1 11 21 31 41 51 61 71 81 91
5
10
15
20
25
30
Number of Features
Cro
sval
idat
ion
Err
or R
ates
QPFS
Spectral
TPower
LowRank
(d) Colon
1 11 21 31 41 51 61 71 81 91
4
6
8
10
12
14
16
Number of Features
Cro
sval
idat
ion
Err
or R
ates
QPFS
Spectral
TPower
LowRank
(e) Gisette
1 11 21 31 41 51 61 71 81 91
4
5
6
7
8
9
10
11
12
13
Number of Features
Cro
sval
idat
ion
Err
or R
ates
QPFS
Spectral
SDP
TPower
LowRank
(f) Musk2
Figure 7.4: Average cross validation error(%) vs. Number of features. FirstRow: Comparison of Greedy methods with TPower and LowRank for 3 datasets.Second Row: Comparison of Global methods across 3 datasets. Images based onVenkateswara et al. (2015a).
142
7.3 Nonlinear Feature Selection for Domain Adaptation
The following subsections outline a domain adaptation model that reduces cross
domain disparity by selecting samples (instance selection) and also by selecting fea-
tures (feature selection). The hypothesis is that performing both instance and feature
selection can align the source and target domains. As in standard unsupervised do-
main adaptation, two domains are considered; source and target. The source consists
of labeled data, Ds = {xsi , y
si }ns
i=1 and the target has only unlabeled data Dt = {xti}nt
i=1.
The data points x∗i belong to R
d, where d is the dimension of each feature. The cor-
responding labels are represented by y∗i ∈ Y := {1, . . . , c}. The main idea behind
feature selection based domain adaptation is to select k < d features to reduce do-
main disparity between Ds and Dt.
7.3.1 Instance Selection
The Maximum Mean Discrepancy (MMD) measure, proposed by Borgwardt et
al., is used to determine if two distributions are similar (Borgwardt et al. (2006)).
The MMD is a non-parametric criterion that measures the distances between distri-
butions in terms of distances between their means in a Reproducing Kernel Hilbert
Space (RKHS). It has been discussed in detail in previous chapters. Researchers
have adapted the MMD measure to perform instance selection for domain adaptation
Long et al. (2014). Here, data points are sampled from the source domain such that
their distribution is similar to the target distribution. A classifier trained on these
sampled source data points can then be used for target data prediction. The MMD
143
for instance selection of source data points is formulated below,
minβ
∣
∣
∣
∣
∣
∣
∣
∣
1
ns
ns∑
i
βiφ(xsi )−
1
nt
nt∑
j
φ(xtj)
∣
∣
∣
∣
∣
∣
∣
∣
2
H
= minβ
1
n2s
β⊤Ksβ − 2
nsnt
β⊤Kst1+1
n2t
1⊤Kt1
s.t. β ∈ [0, B]ns and
∣
∣
∣
∣
1
ns
ns∑
i
βi − 1
∣
∣
∣
∣
≤ ǫ (7.12)
In Equation (7.12), the weights β indicate the importance of the source data points.
The kernel Ks is the ns × ns source data kernel matrix, Kt is the nt × nt target
kernel matrix and Kst is the ns × nt source target kernel matrix. The kernel entries
correspond to a nonlinear mapping where k(xi,xj) =< φ(xi), φ(xj) >. B defines
the scope bounding discrepancy between the source and target distributions. B → 1
gives an unweighted solution. The second constraint ensures that βip(xsi ) is still a
probability distribution. The indices of β with the largest values are the indices of
the sampled data points from the source. A nonlinear classifier trained with these
source instances can be used to classify the target data.
7.3.2 Nonlinear Feature Selection
In Equation (7.12), instance sampling is performed by transforming the data
points into a high dimensional (possibly infinite dimensional) space. In addition,
feature selection must also be performed in a high dimensional space. Mutual infor-
mation based feature selection discussed in the previous sections is infeasible when
confronted with high dimensional (infinite dimensional) spaces. To perform feature
selection in infinite dimensional spaces, the data can be visualized in terms of indi-
vidual features. The source data can also be represented as Xs = [u1,u2, . . . ,ud]⊤,
where up ∈ Rns for p ∈ {1, . . . , d}. In order to do feature selection, the source kernel
Ks is redefined as the sum over individual feature kernels, Ks = Ks1 + Ks
2 . . . , Ksd,
144
where, Ksp ∈ R
ns×ns is the kernel using just the pth feature of Xs. Feature selection
is implemented as selection over feature kernels as outlined in Venkateswara et al.
(2013). The elements Ksi,j = k(upi,upj), where upl is the l
th element of up. Similarly,
Kst = Kst1 + Kst
2 . . . , Kstd and Kt = Kt
1 + Kt2 . . . , K
td. Nonlinear feature selection is
implemented by selecting the most important features by assigning weights to the
respective feature kernels Ksp , K
stp and Kt
p. The optimization problem is outlined
below,
minβ,γ
1
n2s
β⊤
( d∑
p=1
γpKsp
)
β − 2
nsnt
β⊤
( d∑
p=1
γpKstp
)
1+1
n2t
1⊤
( d∑
p=1
γpKtp
)
1
s.t. β ∈ [0, B]ns ,
∣
∣
∣
∣
1
ns
ns∑
i
βi − 1
∣
∣
∣
∣
≤ ǫ
γ ∈ [0, 1]d,d∑
i
γi = k (7.13)
The above equation needs to be minimized in two variables β and γ. Equation (7.13)
is an example of a bi-convex problem. A function f : X × Y → R is bi-convex, if
f(x, y) is convex in y for a fixed x ∈ X, and f(x, y) is convex in x for a fixed y ∈ Y .
Equation (7.13) is quadratic in β when γ is a constant and is therefore convex in β.
Similarly, Equation (7.13) is linear in γ when β is constant and is therefore convex
in γ. Therefore, Equation (7.13) is a bi-convex problem. Under these conditions,
an alternating minimization approach is bound to converge to a critical point Gorski
et al. (2007). Therefore, an alternating optimization procedure is applied to estimate
β and γ. When γ is fixed, the solution for β is a standard MMD solution which
is solved by quadratic programming. When β is fixed, the solution for γ can be
expressed as a linear problem given by,
minα: 0≤αp≤1,∑
p αp=k
d∑
p=1
αpvp (7.14)
145
where vp is given by:
vp = β⊤Kspβ −
2
nt
β⊤Kstp 1+
1
n2t
1⊤Ktp1
The constraint∑
p αp = k ensures that at least k features are selected. The k features
to be chosen are given by the indices of the k largest elements of v = [v1, . . . , vd].
7.4 Experiments
For the experiments, digit datasets MNIST (LeCun et al. (1998)) and USPS
(Roweis (2000)) datasets were used. Some examples from both the datasets are
depicted in Figure (7.5). To the naked eye, there is a clear difference between the
images in the two datasets. 500 samples were chosen from both the datasets. The
digit image sizes were resized to 16× 16 pixels resulting in images of 256 dimensions.
Equation (7.13) was solved using alternate minimization.
Figure (7.6a) depicts the original datasets. The data points have been reduced
to 2 dimensions using the unsupervised tSNE (Van der Maaten and Hinton (2008)).
The two datasets cluster separately showing the distinct difference between the data
points in the two domains. Equation (7.13) was solved with MNIST as source and
USPS as target dataset and 100 features were selected out of 256 to reduce the
domain disparity. Figure (7.6b) depicts the tSNE embeddings of two datasets using
the reduced features. The size of the source data points indicates their importance
Figure 7.5: Example digits from the MNIST and the USPS datasets. There is avisible distinction between the digits from the two datasets.
146
(β) for domain adaptation. Similarly, Equation (7.13) was solved with USPS as the
source dataset and MNIST as the target dataset and 100 features were selected out
of 256 to reduce the domain disparity. The quadratic problem (estimating β) was
solved using MATLAB’s quadprog function and the linear problem (estimating γ)
was solved using MATLAB’s linprog function. In both the cases, the alternating
optimization converges in a few iterations. Figure (7.6c) depicts the tSNE embeddings
of two datasets using the reduced features. In both Figures (7.6b), (7.6c), the datasets
seem to have much more overlap compared to Figure (7.6a). Table (7.5) compares the
classification accuracies for unsupervised domain adaptation using the two datasets.
Table 7.5: Unsupervised domain adaptation experiments with digit data. In allcases, the classifier is trained with source data and tested on the target. MNIST→USPS means, MNIST is source and USPS is target. SVM - regular linear SVM.SVM(β) - source data points are weighted by β, SVM(γ) - source and target featuresare selected with γ, SVM(β, γ) - source data points are weighted by β and sourceand target features are selected with γ.
Expt. SVM SVM(β) SVM(γ) SVM(β, γ)
MNIST→ USPS 36.6% 36.2% 35.2% 41.8%
USPS→ MNIST 21.8% 26.6% 23.8% 24.0%
In the MNIST→ USPS experiment, there is marked improvement using both the
weighted source and feature selection. The results in Table (7.5) show that feature
selection with instance sampling gives better accuracies than just doing feature selec-
tion or instance sampling in the case of MNIST → USPS. In the USPS → MNIST
experiment, there is not much improvement compared to other methods and doing
only instance sampling appears to perform better than feature selection and instance
sampling. In these experiments, the model estimation (β and γ) and classification are
carried out in two steps. A more rigorous implementation would be to include both
model estimation and classification in the same step, thereby learning an efficient
classifier for the source with weighted training data points and reduced dimensions.
147
(a)
(b)
(c)
Figure 7.6: The tSNE embeded images depicting both the datasets in two dimen-sions. (a) The two datasets are clearly different distributions. There appears to belittle overlap between the clustered digits in the two datasets. (b) tSNE image aftersolving Equation (7.13) with MNIST as source and USPS as target and selecting 100features. The importance of the source data points β is denoted by the size of theunfilled circles. There is more overlap between the source and target datasets whencompared to (a). (c) tSNE image after solving Equation (7.13) with USPS as sourceand MNIST as target and selecting 100 features.
148
7.5 Conclusions and Summary
There is scope for improvement in the area of mutual information based feature
selection. Feature selection is a NP-hard problem and newer methods to approximate
the solution will help drive research in this area. Currently, information gain based
feature selection is based on conditional independence assumptions that is similar to
Naıve Bayesian models. While mutual information and conditional mutual informa-
tion are the existing criteria for determining the importance of 2 or 3 features at a
time, there is need to derive measures to better approximate the importance of a
group of selected features. In the area of feature selection for domain adaptation,
the proposed model is a promising start. Although the preliminary results on feature
selection are encouraging, more experiments need to be conducted to firmly establish
the hypothesis that feature selection can reduce domain disparity.
In summary, there were two parts to this chapter. In the first part, two global pro-
cedures for feature selection were introduced, namely the Truncated Power Method
TPower and the Low Rank Bilinear Approximation LowRank. The experiments
compared the proposed technique with existing feature selectors and showed that
both TPower and LowRank perform better than existing global and iterative tech-
niques across most of the datasets. While LowRank slightly outperforms TPower,
it does not compare well with regards to time. The role of conditional mutual infor-
mation in feature selection was also demonstrated by the theorems. In the second
half of the chapter, a nonlinear feature selection and instance selection model for do-
main adaptation was introduced. Preliminary experiments were conducted with digit
datasets MNIST and USPS. These results indicate that feature selection can be used
to increase domain overlap.
149
Chapter 8
DOMAIN ADAPTATION - FUTURE DIRECTIONS
This chapter proposes some directions for future research in domain adaptation.
8.1 Understanding Domain Shift
The concept of a domain has been defined vaguely in computer vision. Images
from different datasets are viewed as belonging to different domains. It is true that
datasets have an inherent bias and images from a dataset have certain properties that
can help identify the dataset (Torralba and Efros (2011)). However, there has been
limited effort in understanding what creates this bias and on modeling the domain
shift between datasets. The authors in Tommasi et al. (2016) attempt to identify the
domainness - a measure for domain specificity of an image.
The difficult problem of modeling domain shift in computer vision has been rarely
addressed. There has been work on identifying domains from a mixture of multiple
datasets and then studying transfer of knowledge between the domains (Gong et al.
(2013b)). Although this does not necessarily model a domain, it provides some di-
rection towards identifying a domain through analysis. The difficulties of modeling
domain shift in computer vision mostly arise due to variations in representation and
not merely variations in the data being represented. The very process of imaging
(camera perspective, occlusion), storage (resolution, size) and representation (color,
brightness, contrast) can lead to variations. Image background (context) is another
cause for variation. Finally, the diversity in the ‘real data’ itself could also lead
variations in their images.
Most domain adaptation systems create adaptive models that perform generic
150
domain adaptation. The models are often guided by the datasets that are used. On
the other hand, it might be beneficial to tailor the adaptation model to a specific
variation in the data. This would however need a comprehensive understanding of
domain shift. It might also lead to task-specific domain adaptation models based on
the nature of domain-shift, leading to increased adoption of domain adaptation in
real world applications.
8.2 Datasets
The standard datasets for computer vision based domain adaptation are, facial
expression datasets CKPlus (Lucey et al. (2010)) and MMI (Pantic et al. (2005)),
digit datasets SVHN (Netzer et al. (2011)), USPS and MNIST (Jarrett et al. (2009)),
head pose recognition datasets PIE (Long et al. (2013)), object recognition datasets
COIL (Long et al. (2013)), Office (Saenko et al. (2010)) and Office-Caltech (Gong
et al. (2012a)).
The following reasons outline the need for new datasets.
1. The currently available datasets for domain adaptation are not suitable for deep
learning. These datasets were created before deep learning became popular
and are insufficient for training and evaluating deep learning based domain
adaptation approaches. These datasets are small with a limited number of
categories. For instance, the most popular object-recognition dataset Office,
has 4110 images across 31 categories and the newly introduced Office-Home
has 15, 000 images across 65 categories. A deep learning model with millions
of parameters requires millions of images for training. Current approaches fine-
tune pre-trained deep networks avoiding over-fitting issues. In order to train
large scale adaptive systems, deep learning is the preferred model which in turn
demands large datasets.
151
2. The datasets were developed for evaluating generic domain adaptation. As also
discussed in the previous section, these datasets were created without modeling
domain shift. The domain boundaries are determined by dataset identity and
not by the nature of domain shift. When datasets are modeled on a specific
domain shift it will lead to the generation of task-specific domain adaptation
models leading to a rich family of domain adaptation models. This will in turn
lead to widespread adoption of domain adaptation for real world problems.
Modeling of domain-shift and creation of new datasets complement each other.
8.3 Generative Models
Generative models, like Generative Adversarial Networks (GANs) are currently
very popular in the research community. They have a wide range of applications
from image super-resolution (Ledig et al. (2016)), text-to-image generation (Reed
et al. (2016); Zhang et al. (2016)), image-to-image generation (Isola et al. (2016))
and conditional image generation (Nguyen et al. (2016); Chen et al. (2016)). The
intuition behind the idea of generative models is based on a quote from physicist
Richard Feynman 1,
‘What I cannot create, I do not understand.’
It is common knowledge that when humans see an object for the first time, they form
a mental image of the object and call upon that generated image when referencing the
object at a later point in time. They are also able to understand the object based on
their previous learning of similar looking objects. For e.g. they can guess the texture
and weight of an object or ‘imagine’ how the object would appear when viewed from a
different perspective. All of this is perhaps due to the generative models and transfer
learning models in the brain that have evolved and perfected over millions of years.
Based on this common understanding of the human brain and encouraged by the
current success of generative adversarial models for domain adaptation, it would not
be farfetched to hypothesize that transfer learning and generative models are closely
related. In order to advance transfer learning and domain adaptation to the next
level, generative models appear to hold the most promise.
8.4 Aligning Joint Distributions
Current forms of adaptation merely align the marginal distributions of the source
and the target PS(X) and PT (X). The popular Maximum Mean Discrepancy (MMD)
measure from Borgwardt et al. (2006), is often applied to align the marginal distribu-
tions of the source and target data, as described in this instance selection approach
Gong et al. (2013a). The goal of domain adaptation is not merely aligning the do-
mains but also being able to use the models trained on the source upon the target.
In most cases, the domain adaptive models are created for classification. It would
therefore make more sense to align the joint distributions PS(X, Y ) with PT (X, Y )
rather than merely the marginal distributions. The alignment of joint distributions
will make a classifier trained on the source, an effective classifier for the target.
The challenge with this approach is that target labels are not available in unsu-
pervised domain adaptation. The workaround is to impute the target data labels and
refine them iteratively. There has been work in this regard as in Long et al. (2013),
where the joint distributions are aligned in a spectral method using Kernel-PCA,
by imputing the labels and refining them over multiple iterations. A deep learning
approach has also been attempted in this regard in Long et al. (2016a), using a trans-
ductive approach to learn the target labels while also minimizing the joint domain
discrepancy. Conditional generative models (discussed the Section 8.3) along with
153
joint distribution alignment could be the next wave of domain adaptation models.
8.5 Person-Centered Domain Adaptation
Very soon, computing is going to become all pervasive. The environment is
plugged with computing devices and an average person carries quite a few smart-
devices like a phone, watch, wristband, etc. Can this computing be adapted to every
user ? Computing that adapts to a user’s needs and idiosyncrasies can be called
Person-Centered Computing (PCC) (Panchanathan et al. (2012)). This would mean
that personal devices would model their interaction and response based on the user’s
needs, rather than a one-size-fits-all approach where users train themselves to adapt
to their devices, in order to effectively interact with them. This paradigm, where the
user and the device adapt to each other, is termed as co-adaptation.
These personalized devices will need to be designed to have core functional com-
ponents making them applicable to a broad range of users. In addition, they must
also have co-adaptive components that help customize the device at an individual
user level. The device must adapt to the user based on patterns gathered from user
interaction with the device. The learning models for co-adaptation will be based on
unsupervised domain adaptation, which would involve gleaning patterns from unla-
beled user interaction data. There has been no work so far in domain adaptation
literature for person-centered device adaptation and these person-centered adaptive
models would make technology more accessible, especially to individuals with disabil-
ities.
154
Chapter 9
SUMMARY
The dissertation approaches the problem of domain adaptation from the concept of
feature spaces. The literature survey was presented along these lines and it provides a
new perspective on domain adaptation approaches. The dissertation presented three
models for domain adaptation based on linear, nonlinear and hierarchical feature
spaces.
The linear model was a max-margin solution with a pair of linear decision bound-
aries for the source and the target that was learned simultaneously. The nonlinear
model was a kernel-PCA solution based on domain alignment using maximum mean
discrepancy. The method also embedded data onto a manifold to ensure enhanced
classification. A highlight of this work was a validation procedure that can be used
to learn model parameters for unsupervised domain adaptation. The hierarchical
method was a deep learning model based on estimating domain aligned hash values
for the source and target data. This work introduced an unsupervised hash loss for
the unlabeled target. The dissertation introduced the Office-Home dataset for do-
main adaptation consisting of 4 domains and around 15, 000 images. It also proposed
a nonlinear method for domain adaptation based on feature selection.
The solutions for domain adaptation developed over the years can be viewed
as dataset or feature specific. Although the proposed models have been tested on
different datasets, the range of problems being addressed has been narrow. The
gamut of applications that can tackled by domain adaptation, have not been fully
explored. The lack of variety in datasets has restricted the range of domain adaptation
problems that have been explored. The chapter on future directions provides some
155
insights into the promising areas of research in domain adaptation. The advent of deep
learning and its unprecedented success in tackling domain adaptation, will eventually
lead to the introduction of new datasets and generic problem statements in domain
adaptation.
156
BIBLIOGRAPHY
Ajakan, H., P. Germain, H. Larochelle, F. Laviolette and M. Marchand, “Domain-adversarial neural networks”, arXiv preprint arXiv:1412.4446 (2014).
Aytar, Y. and A. Zisserman, “Tabula rasa: Model transfer for object category detec-tion”, in “Proceedings of the IEEE Intl. Conf. on Computer Vision (ICCV)”, pp.2252–2259 (2011).
Bakker, B. and T. Heskes, “Task clustering and gating for bayesian multitask learn-ing”, J. Mach. Learn. Res. 4, 83–99 (2003).
Belkin, M. and P. Niyogi, “Laplacian eigenmaps for dimensionality reduction anddata representation”, Neural computation 15, 6, 1373–1396 (2003).
Ben-David, S., J. Blitzer, K. Crammer, A. Kulesza, F. Pereira and J. W. Vaughan,“A theory of learning from different domains”, Machine learning 79, 1-2, 151–175(2010).
Bengio, Y., A. Courville and P. Vincent, “Representation learning: A review and newperspectives”, IEEE Trans. on Pattern Analysis and Machine Intelligence 35, 8,1798–1828 (2013).
Bengio, Y. et al., “Deep learning of representations for unsupervised and transferlearning.”, Workshops, Proceedings of the ACM Intl. Conf. on Machine Learning(ICML) 27, 17–36 (2012).
Bickel, S., M. Bruckner and T. Scheffer, “Discriminative learning under covariateshift”, J. Mach. Learn. Res. 10, 2137–2155 (2009).
Blitzer, J., K. Crammer, A. Kulesza, F. Pereira and J. Wortman, “Learning boundsfor domain adaptation”, in “Advances in Neural Information Processing Systems(NIPS)”, pp. 129–136 (2008).
Borgwardt, K. M., A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Scholkopf and A. J.Smola, “Integrating structured biological data by kernel maximum mean discrep-ancy”, Bioinformatics 22, 14, e49–e57 (2006).
Bousmalis, K., N. Silberman, D. Dohan, D. Erhan and D. Krishnan, “Unsupervisedpixel-level domain adaptation with generative adversarial networks”, in “acceptedto the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)”, (2017).
Bousmalis, K., G. Trigeorgis, N. Silberman, D. Krishnan and D. Erhan, “Domainseparation networks”, in “Advances in Neural Information Processing Systems(NIPS)”, pp. 343–351 (2016).
157
Brown, G., A. Pocock, M.-J. Zhao and M. Lujan, “Conditional likelihood maximisa-tion: a unifying framework for information theoretic feature selection”, J. Mach.Learn. Res. 13, 1, 27–66 (2012).
Bruzzone, L. and M. Marconcini, “Domain adaptation problems: A DASVM clas-sification technique and a circular validation strategy”, IEEE Trans. on PatternAnalysis and Machine Intelligence 32, 5, 770–787 (2010).
Carreira-Perpinan, M. A. and R. Raziperchikolaei, “Hashing with binary autoen-coders”, in “Proceedings of the IEEE Conf. on Computer Vision and PatternRecognition (CVPR)”, pp. 557–566 (2015).
Caruana, R., “Multitask learning”, Machine learning 28, 1, 41–75 (1997).
Castrejon, L., Y. Aytar, C. Vondrick, H. Pirsiavash and A. Torralba, “Learningaligned cross-modal representations from weakly aligned data”, in “Proceedingsof the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)”, (2016).
Chapelle, O., B. Scholkopf and A. Zien, eds., Semi-Supervised learning (MIT Press,Cambridge, 2006).
Chatfield, K., V. Lempitsky, A. Vedaldi and A. Zisserman, “The devil is in the details:an evaluation of recent feature encoding methods”, in “Proceedings of the BritishMachine Vision Conference (BMVC)”, (2011a).
Chatfield, K., V. S. Lempitsky, A. Vedaldi and A. Zisserman, “The devil is in thedetails: an evaluation of recent feature encoding methods.”, in “British MachineVision Conference (BMVC)”, p. 8 (2011b).
Chatfield, K., K. Simonyan, A. Vedaldi and A. Zisserman, “Return of the devil inthe details: Delving deep into convolutional nets”, in “British Machine VisionConference (BMVC)”, (2014).
Chattopadhyay, R., W. Fan, I. Davidson, S. Panchanathan and J. Ye, “Joint trans-fer and batch-mode active learning”, in “Proceedings of the ACM Intl. Conf. onMachine Learning (ICML)”, pp. 253–261 (2013).
Chattopadhyay, R., Q. Sun, W. Fan, I. Davidson, S. Panchanathan and J. Ye, “Mul-tisource domain adaptation and its application to early detection of fatigue”, ACMTransactions on Knowledge Discovery from Data (TKDD) 6, 4, 18 (2012).
Chen, X., Y. Duan, R. Houthooft, J. Schulman, I. Sutskever and P. Abbeel, “Info-GAN: Interpretable representation learning by information maximizing generativeadversarial nets”, in “Advances in Neural Information Processing Systems (NIPS)”,pp. 2172–2180 (2016).
Chopra, S., S. Balakrishnan and R. Gopalan, “DLID: Deep learning for domain adap-tation by interpolating between domains”, in “ICML Workshop on Challenges inRepresentation Learning”, (2013).
158
Chu, W.-S., F. De la Torre and J. F. Cohn, “Selective transfer machine for personal-ized facial action unit detection”, in “Proceedings of the IEEE Conf. on ComputerVision and Pattern Recognition (CVPR)”, pp. 3515–3522 (2013).
Chung, F. R., Spectral graph theory, vol. 92 (American Mathematical Soc., 1997).
Collobert, R. and J. Weston, “A unified architecture for natural language processing:Deep neural networks with multitask learning”, in “Proceedings of the ACM Intl.Conf. on Machine Learning (ICML)”, pp. 160–167 (2008).
Csurka, G., “Domain adaptation for visual applications: A comprehensive survey”,arXiv preprint arXiv:1702.05374 (2017).
Daume III, H., “Frustratingly easy domain adaptation”, in “Conference of the Asso-ciation for Computational Linguistics (ACL)”, (2007).
Daume III, H., A. Kumar and A. Saha, “Frustratingly easy semi-supervised do-main adaptation”, in “Workshop on Domain Adaptation for NLP”, (2010), URLhttp://hal3.name/docs/#daume10easyss.
Davis, J. V., B. Kulis, P. Jain, S. Sra and I. S. Dhillon, “Information-theoretic metriclearning”, in “Proceedings of the ACM Intl. Conf. on Machine Learning (ICML)”,pp. 209–216 (2007).
Deng, J., W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database”, in “Proceedings of the IEEE Conf. on Com-puter Vision and Pattern Recognition (CVPR)”, (2009).
Ding, C. and H. Peng, “Minimum redundancy feature selection from microarray geneexpression data”, Journal of bioinformatics and computational biology 3, 02, 185–205 (2005).
Do, T.-T., A.-D. Doan and N.-M. Cheung, “Learning to hash with binary deep neuralnetwork”, in “European Conference on Computer Vision”, pp. 219–234 (Springer,2016).
Donahue, J., Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng and T. Darrell,“DeCAF: A deep convolutional activation feature for generic visual recognition.”,in “Proceedings of the ACM Intl. Conf. on Machine Learning (ICML)”, vol. 32, pp.647–655 (2014).
Duan, L., I. W. Tsang and D. Xu, “Domain transfer multiple kernel learning”, IEEETrans. on Pattern Analysis and Machine Intelligence 34, 3, 465–479 (2012).
Duan, L., I. W. Tsang, D. Xu and S. J. Maybank, “Domain transfer SVM for videoconcept detection”, in “Proceedings of the IEEE Conf. on Computer Vision andPattern Recognition (CVPR)”, pp. 1375–1381 (2009).
Dudık, M., S. J. Phillips and R. E. Schapire, “Correcting sample selection bias in max-imum entropy density estimation”, in “Advances in Neural Information ProcessingSystems (NIPS)”, pp. 323–330 (2005).
Everingham, M., S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn and A. Zis-serman, “The pascal visual object classes challenge: A retrospective”, InternationalJournal of Computer Vision (IJCV) 111, 1, 98–136 (2015).
Evgeniou, A. and M. Pontil, “Multi-task feature learning”, Advances in Neural In-formation Processing Systems (NIPS) 19, 41 (2007).
Evgeniou, T. and M. Pontil, “Regularized multi-task learning”, in “Proceedings ofthe ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining”, pp.109–117 (2004).
Fan, R.-E., K.-W. Chang, C.-J. Hsieh, X.-R. Wang and C.-J. Lin, “LIBLINEAR: Alibrary for large linear classification”, J. Mach. Learn. Res. 9, 1871–1874 (2008).
Fei, G., S. Wang and B. Liu, “Learning cumulatively to become more knowledgeable”,in “Proceedings of the ACM SIGKDD Intl. Conf. on Knowledge Discovery and DataMining”, pp. 1565–1574 (ACM, 2016).
Fei-Fei, L., R. Fergus and P. Perona, “One-shot learning of object categories”, IEEETrans. on Pattern Analysis and Machine Intelligence 28, 4, 594–611 (2006).
Fernando, B., A. Habrard, M. Sebban and T. Tuytelaars, “Unsupervised visual do-main adaptation using subspace alignment”, in “Proceedings of the IEEE Conf. onComputer Vision and Pattern Recognition (CVPR)”, pp. 2960–2967 (2013).
Ganin, Y., E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marc-hand and V. Lempitsky, “Domain-adversarial training of neural networks”, J.Mach. Learn. Res. 17, 59, 1–35 (2016).
Garey, M. R. and D. S. Johnson, Computers and Intractability; A Guide to the Theoryof NP-Completeness (W. H. Freeman & Co., New York, NY, USA, 1990).
Ghifary, M., W. B. Kleijn, M. Zhang, D. Balduzzi and W. Li, “Deep reconstruction-classification networks for unsupervised domain adaptation”, in “Proceedings ofthe European Conf. on Computer Vision (ECCV)”, pp. 597–613 (2016).
Glorot, X., A. Bordes and Y. Bengio, “Domain adaptation for large-scale sentimentclassification: A deep learning approach”, in “Proceedings of the ACM Intl. Conf.on Machine Learning (ICML)”, pp. 513–520 (2011).
Goemans, M. X. and D. P. Williamson, “Improved approximation algorithms formaximum cut and satisfiability problems using semidefinite programming”, Journalof the ACM (JACM) 42, 6, 1115–1145 (1995).
Gong, B., K. Grauman and F. Sha, “Connecting the dots with landmarks: Discrimi-natively learning domain-invariant features for unsupervised domain adaptation.”,in “Proceedings of the ACM Intl. Conf. on Machine Learning (ICML)”, pp. 222–230(2013a).
Gong, B., K. Grauman and F. Sha, “Reshaping visual datasets for domain adapta-tion”, in “Advances in Neural Information Processing Systems (NIPS)”, pp. 1286–1294 (2013b).
Gong, B., Y. Shi, F. Sha and K. Grauman, “Geodesic flow kernel for unsuperviseddomain adaptation”, in “Proceedings of the IEEE Conf. on Computer Vision andPattern Recognition (CVPR)”, (2012a).
Gong, P., J. Ye and C. Zhang, “Robust multi-task feature learning”, in “Proceedingsof the ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining”, pp.895–903 (2012b).
Gong, Y. and S. Lazebnik, “Iterative quantization: A procrustean approach to learn-ing binary codes”, in “Proceedings of the IEEE Conf. on Computer Vision andPattern Recognition (CVPR)”, pp. 817–824 (2011).
Gong, Y., S. Lazebnik, A. Gordo and F. Perronnin, “Iterative quantization: A pro-crustean approach to learning binary codes for large-scale image retrieval”, IEEETrans. on Pattern Analysis and Machine Intelligence 35, 12, 2916–2929 (2013c).
Goodfellow, I., Y. Bengio and A. Courville, Deep Learning (MIT Press, 2016),http://www.deeplearningbook.org.
Goodfellow, I., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,A. Courville and Y. Bengio, “Generative adversarial nets”, in “Advances in NeuralInformation Processing Systems (NIPS)”, pp. 2672–2680 (2014).
Gopalan, R., R. Li and R. Chellappa, “Domain adaptation for object recognition:An unsupervised approach”, in “Proceedings of the IEEE Intl. Conf. on ComputerVision (ICCV)”, pp. 999–1006 (IEEE, 2011).
Gopalan, R., R. Li and R. Chellappa, “Unsupervised adaptation across domain shiftsby generating intermediate data representations”, IEEE Trans. on Pattern Analysisand Machine Intelligence 36, 11, 2288–2302 (2014).
Gorski, J., F. Pfeuffer and K. Klamroth, “Biconvex sets and optimization with bi-convex functions: a survey and extensions”, Mathematical Methods of OperationsResearch 66, 3, 373–407 (2007).
Grant, M. and S. Boyd, “CVX: Matlab software for disciplined convex programming,version 2.1”, http://cvxr.com/cvx (2014).
Gretton, A., K. M. Borgwardt, M. Rasch, B. Scholkopf, A. J. Smola et al., “A kernelmethod for the two-sample-problem”, Advances in Neural Information ProcessingSystems (NIPS) 19, 513 (2007).
Gretton, A., D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizuand B. K. Sriperumbudur, “Optimal kernel choice for large-scale two-sample tests”,in “Advances in Neural Information Processing Systems (NIPS)”, pp. 1205–1213(2012).
Gretton, A., A. Smola, J. Huang, M. Schmittfull, K. Borgwardt and B. Scholkopf,“Covariate shift by kernel mean matching”, Dataset shift in machine learning 3, 4,5 (2009).
Griffin, G., A. Holub and P. Perona, “Caltech-256 object categorydataset”, Tech. Rep. 7694, California Institute of Technology, URLhttp://authors.library.caltech.edu/7694 (2007).
He, K., F. Wen and J. Sun, “K-Means hashing: An affinity-preserving quantizationmethod for learning binary compact codes”, in “Proceedings of the IEEE Conf. onComputer Vision and Pattern Recognition (CVPR)”, pp. 2938–2945 (2013).
He, K., X. Zhang, S. Ren and J. Sun, “Deep residual learning for image recognition”,in “Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition(CVPR)”, pp. 770–778 (2016).
Heckman, J. J., “Sample selection bias as a specification error”, Econometrica: Jour-nal of the econometric society pp. 153–161 (1979).
Helleputte, T. and P. Dupont, “Feature selection by transfer learning with linear reg-ularized models”, in “Machine Learning and Knowledge Discovery in Databases”,pp. 533–547 (Springer, 2009).
Herman, G., B. Zhang, Y. Wang, G. Ye and F. Chen, “Mutual information-basedmethod for selecting informative feature sets”, Pattern Recognition 46, 12, 3315–3327 (2013).
Hochreiter, S., Y. Bengio, P. Frasconi and J. Schmidhuber, “Gradient flow in recurrentnets: the difficulty of learning long-term dependencies”, (2001).
Hoffman, J., B. Kulis, T. Darrell and K. Saenko, “Discovering latent domains for mul-tisource domain adaptation”, in “Proceedings of the European Conf. on ComputerVision (ECCV)”, pp. 702–715 (2012).
Hoffman, J., E. Rodner, J. Donahue, K. Saenko and T. Darrell, “Efficient learning ofdomain-invariant image representations”, in “Intl. Conf. on Learning Representa-tions (ICLR)”, (2013).
Hu, J., J. Lu and Y.-P. Tan, “Deep transfer metric learning”, in “Proceedings of theIEEE Conf. on Computer Vision and Pattern Recognition (CVPR)”, pp. 325–333(2015).
Huang, J., A. Gretton, K. M. Borgwardt, B. Scholkopf and A. J. Smola, “Correct-ing sample selection bias by unlabeled data”, in “Advances in Neural InformationProcessing Systems (NIPS)”, pp. 601–608 (2006).
Isola, P., J.-Y. Zhu, T. Zhou and A. A. Efros, “Image-to-image translation withconditional adversarial networks”, arXiv preprint arXiv:1611.07004 (2016).
Jarrett, K., K. Kavukcuoglu, Y. Lecun et al., “What is the best multi-stage architec-ture for object recognition?”, in “Proceedings of the IEEE Intl. Conf. on ComputerVision (ICCV)”, pp. 2146–2153 (2009).
Jhuo, I.-H., D. Liu, D. Lee and S.-F. Chang, “Robust visual domain adaptation withlow-rank reconstruction”, in “Proceedings of the IEEE Conf. on Computer Visionand Pattern Recognition (CVPR)”, pp. 2168–2175 (2012).
Jiang, W., F. Nie, F.-l. K. Chung and H. Huang, “Algorithm and theoretical anal-ysis for domain adaptation feature learning with linear classifiers”, arXiv preprintarXiv:1509.01710 (2015).
Joachims, T., “Transductive inference for text classification using support vectormachines”, in “Proceedings of the ACM Intl. Conf. on Machine Learning (ICML)”,vol. 99, pp. 200–209 (1999).
Kamnitsas, K., C. Baumgartner, C. Ledig, V. F. Newcombe, J. P. Simpson, A. D.Kane, D. K. Menon, A. Nori, A. Criminisi, D. Rueckert et al., “Unsuperviseddomain adaptation in brain lesion segmentation with adversarial networks”, in“International Conference on Information Processing in Medical Imaging (IPMI)2017”, (2016).
Kang, Z., K. Grauman and F. Sha, “Learning with whom to share in multi-taskfeature learning”, in “Proceedings of the ACM Intl. Conf. on Machine Learning(ICML)”, pp. 521–528 (2011).
Kantorov, V. and I. Laptev, “Efficient feature extraction, encoding, and classificationfor action recognition”, in “Proceedings of the IEEE Conf. on Computer Visionand Pattern Recognition (CVPR)”, (2014).
Kifer, D., S. Ben-David and J. Gehrke, “Detecting change in data streams”, in “Pro-ceedings of the Thirtieth international conference on Very large data bases-Volume30”, pp. 180–191 (VLDB Endowment, 2004).
Kimeldorf, G. S. and G. Wahba, “A correspondence between Bayesian estimationon stochastic processes and smoothing by splines”, The Annals of MathematicalStatistics 41, 2, 495–502 (1970).
Kingma, D. P. and M. Welling, “Auto-encoding variational bayes”, arXiv preprintarXiv:1312.6114 (2013).
Koniusz, P., Y. Tas and F. Porikli, “Domain adaptation by mixture of alignments ofsecond-or higher-order scatter tensors”, in “accepted to the IEEE Conf. on Com-puter Vision and Pattern Recognition (CVPR)”, (2017).
163
Krizhevsky, A., I. Sutskever and G. E. Hinton, “Imagenet classification with deepconvolutional neural networks”, in “Advances in Neural Information ProcessingSystems (NIPS)”, pp. 1097–1105 (2012).
Kuehne, H., H. Jhuang, E. Garrote, T. Poggio and T. Serre, “HMDB: a large videodatabase for human motion recognition”, in “Proceedings of the IEEE Intl. Conf.on Computer Vision (ICCV)”, (2011).
Kulis, B., K. Saenko and T. Darrell, “What you saw is not what you get: Domainadaptation using asymmetric kernel transforms”, in “Proceedings of the IEEE Conf.on Computer Vision and Pattern Recognition (CVPR)”, pp. 1785–1792 (2011).
Kumar, A. and H. Daume III, “Learning task grouping and overlap in multi-tasklearning”, in “Proceedings of the ACM Intl. Conf. on Machine Learning (ICML)”,(2012), URL http://hal3.name/docs/#daume12gomtl.
Kurgan, L. A. and K. J. Cios, “Caim discretization algorithm”, IEEE Transactionson Knowledge and Data Engineering 16, 2, 145–153 (2004).
Larochelle, H., D. Erhan and Y. Bengio, “Zero-data learning of new tasks”, in “Pro-ceedings of the AAAI Conf. on Artificial Intelligence”, vol. 1, p. 3 (2008).
LeCun, Y., C. Cortes and C. J. Burges, “The mnist database of handwritten digits”,(1998).
Ledig, C., L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken,A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single image super-resolutionusing a generative adversarial network”, arXiv preprint arXiv:1609.04802 (2016).
Lee, H., R. Grosse, R. Ranganath and A. Y. Ng, “Convolutional deep belief networksfor scalable unsupervised learning of hierarchical representations”, in “Proceedingsof the ACM Intl. Conf. on Machine Learning (ICML)”, pp. 609–616 (2009).
Lewis, D. D., “Feature selection and feature extraction for text categorization”, in“Proceedings of the workshop on Speech and Natural Language”, pp. 212–217(Association for Computational Linguistics, 1992).
Li, W., L. Duan, D. Xu and I. W. Tsang, “Learning with augmented features forsupervised and semi-supervised heterogeneous domain adaptation”, IEEE Trans.on Pattern Analysis and Machine Intelligence 36, 6, 1134–1148 (2014).
Li, W.-J., S. Wang and W.-C. Kang, “Feature learning based deep supervised hash-ing with pairwise labels”, in “Proceedings of the Intl. Joint Conf. on ArtificialIntelligence”, (2016).
Li, Z. and D. Hoiem, “Learning without forgetting”, in “Proceedings of the EuropeanConf. on Computer Vision (ECCV)”, pp. 614–629 (Springer, 2016).
Lichman, M., “UCI machine learning repository”, URLhttp://archive.ics.uci.edu/ml (2013).
Liu, J., S. Ji and J. Ye, “Multi-task feature learning via efficient l2, 1-norm minimiza-tion”, in “Proceedings of the Twenty-Fifth conference on Uncertainty in ArtificialIntelligence”, pp. 339–348 (AUAI Press, 2009).
Liu, M.-Y. and O. Tuzel, “Coupled generative adversarial networks”, in “Advancesin Neural Information Processing Systems (NIPS)”, pp. 469–477 (2016).
Long, M., Y. Cao, J. Wang and M. Jordan, “Learning transferable features withdeep adaptation networks”, in “Proceedings of the ACM Intl. Conf. on MachineLearning (ICML)”, pp. 97–105 (2015).
Long, M., J. Wang, G. Ding, J. Sun and P. S. Yu, “Transfer feature learning withjoint distribution adaptation”, in “Proceedings of the ACM Intl. Conf. on MachineLearning (ICML)”, pp. 2200–2207 (2013).
Long, M., J. Wang, G. Ding, J. Sun and P. S. Yu, “Transfer joint matching forunsupervised domain adaptation”, in “Proceedings of the IEEE Conf. on ComputerVision and Pattern Recognition (CVPR)”, (2014).
Long, M., J. Wang and M. I. Jordan, “Deep transfer learning with joint adaptationnetworks”, arXiv preprint arXiv:1605.06636 (2016a).
Long, M., H. Zhu, J. Wang and M. I. Jordan, “Unsupervised domain adaptationwith residual transfer networks”, in “Advances in Neural Information ProcessingSystems (NIPS)”, (2016b).
Lucey, P., J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar and I. Matthews, “Theextended cohn-kanade dataset (CK+): A complete dataset for action unit andemotion-specified expression”, in “Workshops, Proceedings of the IEEE Conf. onComputer Vision and Pattern Recognition (CVPRW)”, (2010).
Mairal, J., F. Bach, J. Ponce and G. Sapiro, “Online learning for matrix factorizationand sparse coding”, J. Mach. Learn. Res. 11, 19–60 (2010).
Mansour, Y., M. Mohri and A. Rostamizadeh, “Domain adaptation with multiplesources”, in “Advances in Neural Information Processing Systems (NIPS)”, pp.1041–1048 (2009).
Mensink, T., J. Verbeek, F. Perronnin and G. Csurka, “Distance-based image clas-sification: Generalizing to new classes at near-zero cost”, IEEE Trans. on PatternAnalysis and Machine Intelligence 35, 11, 2624–2637 (2013).
Meyer, P. E., C. Schretter and G. Bontempi, “Information-theoretic feature selec-tion in microarray data using variable complementarity”, IEEE Journal of SelectedTopics in Signal Processing 2, 3, 261–274 (2008).
Netzer, Y., T. Wang, A. Coates, A. Bissacco, B. Wu and A. Y. Ng, “Read-ing digits in natural images with unsupervised feature learning”, in “Workshops- Advances in Neural Information Processing Systems (NIPS)”, (2011), URLhttp://ufldl.stanford.edu/housenumbers/nips2011_housenumbers.pdf.
Ng, A., “Hiring your first chief AI officer”, Harvard Business Review URLhttps://hbr.org/2016/11/hiring-your-first-chief-ai-officer (2016).
Nguyen, A., J. Yosinski, Y. Bengio, A. Dosovitskiy and J. Clune, “Plug & playgenerative networks: Conditional iterative generation of images in latent space”,arXiv preprint arXiv:1612.00005 (2016).
Nguyen, X. V., J. Chan, S. Romano and J. Bailey, “Effective global approaches formutual information based feature selection”, in “Proceedings of the ACM SIGKDDIntl. Conf. on Knowledge Discovery and Data Mining”, pp. 512–521 (2014).
Ni, J., Q. Qiu and R. Chellappa, “Subspace interpolation via dictionary learning forunsupervised domain adaptation”, in “Proceedings of the IEEE Conf. on ComputerVision and Pattern Recognition (CVPR)”, pp. 692–699 (2013).
Nowozin, S. and C. H. Lampert, “Structured learning and prediction in computervision”, Foundations and Trends in Computer Graphics and Vision 6, 3-4, 185–365, URL http://dx.doi.org/10.1561/0600000033 (2011).
Obozinski, G., B. Taskar and M. I. Jordan, “Joint covariate selection and joint sub-space selection for multiple classification problems”, Statistics and Computing 20,2, 231–252 (2010).
Oord, A. v. d., N. Kalchbrenner and K. Kavukcuoglu, “Pixel recurrent neural net-works”, arXiv preprint arXiv:1601.06759 (2016).
Oquab, M., L. Bottou, I. Laptev and J. Sivic, “Learning and transferring mid-levelimage representations using convolutional neural networks”, in “Proceedings of theIEEE Conf. on Computer Vision and Pattern Recognition (CVPR)”, pp. 1717–1724(2014).
Palatucci, M., D. Pomerleau, G. E. Hinton and T. M. Mitchell, “Zero-shot learn-ing with semantic output codes”, in “Advances in Neural Information ProcessingSystems (NIPS)”, pp. 1410–1418 (2009).
Pan, S. J., J. T. Kwok and Q. Yang, “Transfer learning via dimensionality reduction.”,in “Proceedings of the AAAI Conf. on Artificial Intelligence”, vol. 8, pp. 677–682(2008).
Pan, S. J., I. W. Tsang, J. T. Kwok and Q. Yang, “Domain adaptation via transfercomponent analysis”, IEEE Trans. on Neural Networks 22, 2, 199–210 (2011).
Pan, S. J. and Q. Yang, “A survey on transfer learning”, IEEE Trans. on Knowledgeand Data Engineering 22, 10, 1345–1359 (2010).
Panchanathan, S., T. McDaniel and V. Balasubramanian, “Person-centered accessibletechnologies: Improved usability and adaptation through inspirations from disabil-ity research”, in “Proceedings of the 2012 ACM workshop on User experience ine-learning and augmented technologies in education”, pp. 1–6 (ACM, 2012).
Pantic, M., M. Valstar, R. Rademaker and L. Maat, “Web-based database for facialexpression analysis”, in “Proceedings of the IEEE Conf. on Multimedia and Expo(ICME)”, (2005).
Papailiopoulos, D., I. Mitliagkas, A. Dimakis and C. Caramanis, “Finding densesubgraphs via low-rank bilinear optimization”, in “Proceedings of the ACM Intl.Conf. on Machine Learning (ICML)”, pp. 1890–1898 (2014).
Papailiopoulos, D. S., A. G. Dimakis and S. Korokythakis, “Sparse PCA throughlow-rank approximations.”, in “Proceedings of the ACM Intl. Conf. on MachineLearning (ICML)”, pp. 747–755 (2013).
Patel, V. M., R. Gopalan, R. Li and R. Chellappa, “Visual domain adaptation: Asurvey of recent advances”, IEEE signal processing magazine 32, 3, 53–69 (2015).
Peng, H., F. Long and C. Ding, “Feature selection based on mutual informationcriteria of max-dependency, max-relevance, and min-redundancy”, IEEE Trans. onPattern Analysis and Machine Intelligence 27, 8, 1226–1238 (2005).
Peng, X. and K. Saenko, “Synthetic to real adaptation with deep generative correla-tion alignment networks”, arXiv preprint arXiv:1701.05524 (2017).
Qiu, Q., V. M. Patel, P. Turaga and R. Chellappa, “Domain adaptive dictionarylearning”, in “Proceedings of the European Conf. on Computer Vision (ECCV)”,pp. 631–645 (2012).
Quionero-Candela, J., M. Sugiyama, A. Schwaighofer and N. D. Lawrence, Datasetshift in machine learning (The MIT Press, 2009).
Raina, R., A. Battle, H. Lee, B. Packer and A. Y. Ng, “Self-taught learning: transferlearning from unlabeled data”, in “Proceedings of the ACM Intl. Conf. on MachineLearning (ICML)”, pp. 759–766 (2007).
Razavian, A. S., H. Azizpour, J. Sullivan and S. Carlsson, “CNN features off-the-shelf: an astounding baseline for recognition”, in “Workshops, Proceedings of theIEEE Conf. on Computer Vision and Pattern Recognition (CVPRW)”, pp. 512–519(2014).
Rebuffi, S.-A., A. Kolesnikov and C. H. Lampert, “iCaRL: Incremental classifier andrepresentation learning”, in “accepted to the IEEE Conf. on Computer Vision andPattern Recognition (CVPR)”, (2017).
Reddy, K. K. and M. Shah, “Recognizing 50 human action categories of web videos”,Machine Vision and Applications 24, 5, 971–981 (2013).
Reed, S., Z. Akata, X. Yan, L. Logeswaran, B. Schiele and H. Lee, “Generativeadversarial text to image synthesis”, in “Proceedings of The 33rd InternationalConference on Machine Learning”, vol. 3 (2016).
Rodriguez-Lujan, I., R. Huerta, C. Elkan and C. S. Cruz, “Quadratic programmingfeature selection”, J. Mach. Learn. Res. 11, 1491–1516 (2010).
167
Roweis, S., “The usps database of handwritten digits”,http://www.cs.nyu.edu/~roweis/data.html (2000).
Rozantsev, A., M. Salzmann and P. Fua, “Beyond sharing weights for deep domainadaptation”, arXiv preprint arXiv:1603.06432 (2016).
Saenko, K., B. Kulis, M. Fritz and T. Darrell, “Adapting visual category modelsto new domains”, in “Proceedings of the European Conf. on Computer Vision(ECCV)”, (2010).
Sankaranarayanan, S., Y. Balaji, C. D. Castillo and R. Chellappa, “Generate toadapt: Aligning domains using generative adversarial networks”, arXiv preprintarXiv:1704.01705 (2017).
Sener, O., H. O. Song, A. Saxena and S. Savarese, “Unsupervised transductive domainadaptation”, arXiv preprint arXiv:1602.03534 (2016).
Shekhar, S., V. M. Patel, H. V. Nguyen and R. Chellappa, “Generalized domain-adaptive dictionaries”, in “Proceedings of the IEEE Conf. on Computer Vision andPattern Recognition (CVPR)”, pp. 361–368 (2013).
Shimodaira, H., “Improving predictive inference under covariate shift by weightingthe log-likelihood function”, Journal of statistical planning and inference 90, 2,227–244 (2000).
Simonyan, K. and A. Zisserman, “Very deep convolutional networks for large-scaleimage recognition”, CoRR abs/1409.1556 (2014).
Smola, A., A. Gretton, L. Song and B. Scholkopf, “A Hilbert space embedding fordistributions”, in “International Conference on Algorithmic Learning Theory”, pp.13–31 (Springer, 2007).
Socher, R., M. Ganjoo, C. D. Manning and A. Ng, “Zero-shot learning through cross-modal transfer”, in “Advances in Neural Information Processing Systems (NIPS)”,pp. 935–943 (2013).
Srivastava, N. and R. R. Salakhutdinov, “Multimodal learning with deep boltzmannmachines”, in “Advances in Neural Information Processing Systems (NIPS)”, pp.2222–2230 (2012).
Sugiyama, M., S. Nakajima, H. Kashima, P. V. Buenau and M. Kawanabe, “Directimportance estimation with model selection and its application to covariate shiftadaptation”, in “Advances in Neural Information Processing Systems (NIPS)”, pp.1433–1440 (2008).
Sun, B., J. Feng and K. Saenko, “Return of frustratingly easy domain adaptation”,in “Workshops, Proceedings of the IEEE Intl. Conf. on Computer Vision (ICCVTASK-CV)”, (2015a).
Sun, B. and K. Saenko, “Deep Coral: Correlation alignment for deep domain adap-tation”, in “Workshops, Proceedings of the European Conf. on Computer Vision(ECCV)”, pp. 443–450 (2016).
Sun, C., S. Shetty, R. Sukthankar and R. Nevatia, “Temporal localization of fine-grained actions in videos by domain transfer from web images”, in “Proceedings ofthe ACM Intl. Conf. on Multimedia (ACM-MM)”, pp. 371–380 (2015b).
Sun, Q., R. Chattopadhyay, S. Panchanathan and J. Ye, “A two-stage weightingframework for multi-source domain adaptation”, in “Advances in Neural Informa-tion Processing Systems (NIPS)”, pp. 505–513 (2011).
Thrun, S., “Is learning the n-th thing any easier than learning the first?”, in “Ad-vances in neural information processing systems”, pp. 640–646 (MORGAN KAUF-MANN PUBLISHERS, 1996).
Thrun, S. and L. Pratt, eds., Learning to learn (Kluwer Academic Publishers, Norwell,MA, USA, 1998).
Thrun, S. and L. Pratt, Learning to learn (Springer Science & Business Media, 2012).
Toh, K.-C., M. J. Todd and R. H. Tutuncu, “Sdpt3 matlab software package forsemidefinite programming, version 1.3”, Optimization methods and software 11,1-4, 545–581 (1999).
Tommasi, T., M. Lanzi, P. Russo and B. Caputo, “Learning the roots of visual domainshift”, in “Workshops, Proceedings of the IEEE Intl. Conf. on Computer Vision(ICCV TASK-CV)”, pp. 475–482 (Springer, 2016).
Torralba, A. and A. A. Efros, “Unbiased look at dataset bias”, in “Proceedings of theIEEE Conf. on Computer Vision and Pattern Recognition (CVPR)”, pp. 1521–1528(2011).
Tzeng, E., J. Hoffman, T. Darrell and K. Saenko, “Simultaneous deep transfer acrossdomains and tasks”, in “Proceedings of the IEEE Intl. Conf. on Computer Vision(ICCV)”, pp. 4068–4076 (2015a).
Tzeng, E., J. Hoffman, T. Darrell and K. Saenko, “Simultaneous deep transfer acrossdomains and tasks”, in “Proceedings of the IEEE Intl. Conf. on Computer Vision(ICCV)”, (2015b).
Tzeng, E., J. Hoffman, K. Saenko and T. Darrell, “Adversarial discriminative domainadaptation”, Technical Report (2017).
Tzeng, E., J. Hoffman, N. Zhang, K. Saenko and T. Darrell, “Deep domain confusion:Maximizing for domain invariance”, arXiv preprint arXiv:1412.3474 (2014).
Uguroglu, S. and J. Carbonell, “Feature selection for transfer learning”, in “MachineLearning and Knowledge Discovery in Databases”, pp. 430–442 (Springer, 2011).
169
Van der Maaten, L. and G. Hinton, “Visualizing data using t-sne”, Journal of MachineLearning Research 9, 2579-2605, 85 (2008).
Vapnik, V., The nature of statistical learning theory (Springer science & businessmedia, 2013).
Vedaldi, A. and K. Lenc, “Matconvnet – convolutional neural networks for MAT-LAB”, in “Proceedings of the ACM Intl. Conf. on Multimedia (ACM-MM)”, (2015).
Venkateswara, H., V. N. Balasubramanian, P. Lade and S. Panchanathan, “Multires-olution match kernels for gesture video classification”, in “Proceedings of the IEEEIntl. Conf. on Multimedia and Expo Workshops (ICME)”, pp. 1–4 (2013).
Venkateswara, H., S. Chakraborty, T. McDaniel and S. Panchanathan, “Model selec-tion with nonlinear embedding for unsupervised domain adaptation”, in “KnowProsWorkshop - Proceedings of the AAAI Conf. on Artificial Intelligence”, (2017a).
Venkateswara, H., S. Chakraborty and S. Panchanathan, “Nonlinear embeddingtransform for unsupervised domain adaptation”, in “Workshops, Proceedings ofthe European Conf. on Computer Vision (ECCV)”, pp. 451–457 (Springer, 2016).
Venkateswara, H., J. Eusebio, S. Chakraborty and S. Panchanathan, “Deep hashingnetwork for unsupervised domain adaptation”, in “accepted to the IEEE Conf. onComputer Vision and Pattern Recognition (CVPR)”, (2017b).
Venkateswara, H., P. Lade, B. Lin, J. Ye and S. Panchanathan, “Efficient approximatesolutions to mutual information based global feature selection”, in “Proceedings ofthe (IEEE) Intl. conf. on Data Mining (ICDM)”, pp. 1009–1014 (2015a).
Venkateswara, H., P. Lade, J. Ye and S. Panchanathan, “Coupled support vectormachines for supervised domain adaptation”, in “Proceedings of the ACM Intl.Conf. on Multimedia (ACM-MM)”, pp. 1295–1298 (2015b).
Wang, J., H. T. Shen, J. Song and J. Ji, “Hashing for similarity search: A survey”,arXiv preprint arXiv:1408.2927 (2014).
Weinberger, K., A. Dasgupta, J. Langford, A. Smola and J. Attenberg, “Featurehashing for large scale multitask learning”, in “Proceedings of the ACM Intl. Conf.on Machine Learning (ICML)”, pp. 1113–1120 (2009).
Widmer, C., M. Kloft, N. Gornitz, G. Ratsch, P. Flach, T. De Bie and N. Cristianini,“Efficient training of graph-regularized multitask SVMs”, in “Proceedings of theEuropean Conference on Machine Learning, (ECML)”, (2012).
Wu, C., W. Wen, T. Afzal, Y. Zhang, Y. Chen and H. Li, “A compact dnn: Approach-ing googlenet-level accuracy of classification and domain adaptation”, in “acceptedto the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)”, (2017).
Yang, H. H. and J. Moody, “Data visualization and feature selection: New algorithmsfor non-gaussian data”, Advances in Neural Information Processing Systems (NIPS)12 (1999).
170
Yang, J., R. Yan and A. G. Hauptmann, “Adapting SVM classifiers to data withshifted distributions”, in “Data Mining Workshops, 2007. ICDM Workshops 2007.Seventh IEEE International Conference on”, pp. 69–76 (2007a).
Yang, J., R. Yan and A. G. Hauptmann, “Cross-domain video concept detectionusing adaptive SVMs”, in “Proceedings of the ACM Intl. Conf. on Multimedia(ACM-MM)”, pp. 188–197 (2007b).
Yang, J., K. Yu, Y. Gong and T. Huang, “Linear spatial pyramid matching usingsparse coding for image classification”, in “Proceedings of the IEEE Conf. on Com-puter Vision and Pattern Recognition (CVPR)”, pp. 1794–1801 (2009).
Yao, T., Y. Pan, C.-W. Ngo, H. Li and T. Mei, “Semi-supervised domain adaptationwith subspace learning for visual recognition”, in “Proceedings of the IEEE Conf.on Computer Vision and Pattern Recognition (CVPR)”, pp. 2142–2150 (2015).
Yosinski, J., J. Clune, Y. Bengio and H. Lipson, “How transferable are features indeep neural networks?”, in “Advances in Neural Information Processing Systems(NIPS)”, pp. 3320–3328 (2014).
Yuan, X.-T. and T. Zhang, “Truncated power method for sparse eigenvalue prob-lems”, J. Mach. Learn. Res. 14, 1, 899–925 (2013).
Zadrozny, B., “Learning and evaluating classifiers under sample selection bias”, in“Proceedings of the ACM Intl. Conf. on Machine Learning (ICML)”, p. 114 (2004).
Zhang, C., L. Zhang and J. Ye, “Generalization bounds for domain adaptation”,in “Advances in Neural Information Processing Systems (NIPS)”, pp. 3320–3328(2012).
Zhang, H., T. Xu, H. Li, S. Zhang, X. Huang, X. Wang and D. Metaxas, “Stack-GAN: Text to photo-realistic image synthesis with stacked generative adversarialnetworks”, arXiv preprint arXiv:1612.03242 (2016).
Zhang, K., K. Muandet, Z. Wang et al., “Domain adaptation under target andconditional shift”, in “Proceedings of the ACM Intl. Conf. on Machine Learning(ICML)”, pp. 819–827 (2013).
Zhu, H., M. Long, J. Wang and Y. Cao, “Deep hashing network for efficient similarityretrieval”, in “Proceedings of the Thirtieth Conference on the Association for theAdvancement of Artificial Intelligence”, (2016).
171
APPENDIX A
LOWER BOUND FOR BQP
172
This section estimates the goodness of approximation to BQP provided by thesolution to LP2. An equivalent problem to BQP is defined as,
Proposition A.0.1.
q : {0, 1}n → (−∞, 0], where
q(x) = BQP− ||Q||1is equivalent to BQP.
Proof. ||Q||1 is the sum of all elements in Q. Since Qij ≥ 0, BQP ≤ ||Q||1 ∀x.Therefore, q(x) ≤ 0, ∀x. Since ||Q||1 is a constant for a matrix, under the same setof constraints,
argmaxx
BQP ≡ argmaxx
q(x)
�
A few new terms are defined before the derivation of the bound. Let x∗ be thesolution of BQP. Let x be the solution of LP2. Since ||Q||1 =
∑
ij Qij, ||Q||1 can be
expanded in terms of any binary vector x. Specifically ||Q||1, is defined in terms ofx,
Definition A.0.1.
||Q||1 = Q0 +Q1 +Q2 where, (A.1)
Q0 ←∑
i,j|xi+xj=0
Qij (A.2)
Q1 ←∑
i,j|xi+xj=1
Qij (A.3)
Q2 ←∑
i,j|xi+xj=2
Qij ≡ x⊤Qx (A.4)
Lemma A.0.1. ||Q||1 − x⊤Qx ≥ Q1
Proof. From (A.1),
||Q||1 = Q0 +Q1 +Q2
||Q||1 ≥ Q1 +Q2
||Q||1 − x⊤Qx ≥ Q1 using (A.4)
�
Let Q∗ denote the maximum value of BQP and let Q∗LP2 denote maximum value
of LP2. If x∗ is the solution of BQP and x is the solution of LP2. The followingresult is obtained:
Lemma A.0.2. Q∗LP2 ≥ Q∗
173
Proof.
Q∗LP2 = max ||Qx||1
= ||Qx||1≥ ||Qx∗||1≥ x∗⊤Qx∗
= Q∗
�
The bound for LP2 can now be stated.
Theorem A.0.1.
2q(x∗) ≤ q(x) (A.5)
Proof. From Lemma (A.0.2):
x∗⊤Qx∗ ≤ 1
2
∑
ij
Qij(xi + xi) (A.6)
x∗⊤Qx∗ ≤ 1
2Q1 + x⊤Qx (A.3, A.4) (A.7)
2x∗⊤Qx∗ ≤ Q1 + 2x⊤Qx (A.8)
2x∗⊤Qx∗ ≤ ||Q||1 + x⊤Qx Lemma (A.0.1) (A.9)
2q(x∗) ≤ q(x) (A.10)
�
The last statement (A.10) is arrived at by adding −2||Q||1 on both sides. Sinceq(x) ≤ 0, 2q(x∗) ≤ q(x) implies that q(x) is a lower bound for q(x∗). This guaranteesa lower bound for BQP by solving LP2 and Equation (A.10) provides the tightnessof the bound.
174
APPENDIX B
DERIVATIVES FOR THE DAH LOSS FUNCTION
175
In this section the partial derivative of Equation (6.8) for the backpropagationalgorithm is outlined;
minUJ = L(Us) + γM(Us,Ut) + ηH(Us,Ut), (6.8)
where, U := {Us ∪ Ut} and (γ, η) control the importance of domain adaptation (6.1)and target entropy loss (6.7) respectively. In the following subsections, the partialderivative of the individual terms w.r.t. the input U is outlined.
B.1 Derivative for MK-MMD
M(Us,Ut) =∑
l∈F
d2k(U ls,U l
t), (6.1)
d2k(U ls,U l
t) =∣
∣
∣
∣
∣
∣E[φ(us,l)]− E[φ(ut,l)]
∣
∣
∣
∣
∣
∣
2
Hk
. (6.2)
The linear MK-MMD loss is implemented according to Gretton et al. (2012). For thisderivation, the loss at just one layer is considered. The derivative for the MK-MMDloss at every other layer can be derived in a similar manner. The output of ith sourcedata point at layer l is represented as ui and the output of the ith target data point isrepresented as vi. For ease of representation, the superscripts for the source (s), thetarget (t) and the layer (l) are dropped. Unlike the conventional MMD loss whichis O(n2), the MK-MMD loss outlined in Gretton et al. (2012) is O(n) and can beestimated online (does not require all the data). The loss is calculated over everybatch of data points during the back-propagation. Let n be the number of sourcedata points U := {ui}ni=1 and the number of target data points V := {vi}ni=1 in thebatch. It is assumed there are equal number of source and target data points ina batch and that n is even. The MK-MMD is defined over a set of 4 data pointswi = [u2i−1,u2i,v2i−1,v2i], ∀i ∈ {1, 2, . . . , n/2}. The MK-MMD is given by,
M(U ,V) =κ∑
m=1
βm1
n/2
n/2∑
i=1
hm(wi), (B.1)
where, κ is the number of kernels and βm = 1/κ is the weight for each kernel and,
where, I{.} is the indicator function which is 1 if the condition is true, else it is false.The partial derivative w.r.t. the target data output vq is,
The partial derivative of Equation (6.5) w.r.t. source data output up is given by,
∂L∂uq
=∑
sij∈S
[
I{i = q}(
σ(u⊤i uj)− sij
)
uj + I{j = q}(
σ(u⊤i uj)− sij
)
ui
]
+ 2(uq − sgn(uq)) (B.6)
where, σ(x) = 11+exp(−x)
. It is assumed sgn(.) is a constant in order to avoid the
differentiability issues with sgn(.) at 0. Since S is symmetric, the partial derivativecan be reduced to,
∂L∂uq
=ns∑
j=1
[
2(
σ(u⊤q uj)− sqj
)
uj
]
+ 2(
uq − sgn(uq))
. (B.7)
177
B.3 Derivative for Unsupervised Entropy Loss
The partial derivative of dHdU
is outlined in the following section, where H is definedas,
H(Us,Ut) = −1
nt
nt∑
i=1
C∑
j=1
pijlog(pij) (6.7)
and pij is the probability of target data output uti belonging to category j, given by
pij =
∑Kk=1 exp(u
ti⊤u
sjk )
∑Cl=1
∑Kk′=1 exp(u
ti⊤u
slk′)
(6.6)
For ease of representation, the target output uti is denoted as vi and superscript t is
dropped. Similarly, the kth source data point in the jth category usjk is denoted as
ujk, after dropping the domain superscript. The probability pij with the new terms is
now,
pij =
∑Kk=1 exp(vi
⊤ujk)
∑Cl=1
∑Kk′=1 exp(vi
⊤ulk′)
(B.8)
Further simplification is achieved by replacing exp(v⊤i u
jk) with exp(i, jk). Equation
(B.8) can now be represented as,
pij =
∑Kk=1 exp(i, jk)
∑Cl=1
∑Kk′=1 exp(i, lk
′)(B.9)
The outer summations are dropped (along with the -ve sign) and will be reintroducedat a later time. The entropy loss can be re-phrased using log(a
b) = log(a) - log(b) as,
Hij =
∑Kk=1 exp(i, jk)
∑Cl=1
∑Kk′=1 exp(i, lk
′)log(∑K
k=1 exp(i, jk))
(B.10)
−∑K
k=1 exp(i, jk)∑C
l=1
∑Kk′=1 exp(i, lk
′)log(∑C
l=1
∑Kk′=1 exp(i, lk
′))
(B.11)
Both,∂Hij
∂vifor the target and
∂Hij
∂upqfor the source need to be estimated. ∂up
q is used
to refer to source data. The partial derivative∂Hij
∂upqfor Equation (B.10) is,
[
∂Hij
∂upq
]
B.10=
vi∑
l,k′ exp(i, lk′)
[
∑
k I{j=p,k=q}exp(i, jk).log
(∑
k exp(i, jk))
+∑
k I{j=p,k=q}exp(i, jk)− pijexp(i, pq)log
(∑
k exp(i, jk))
]
, (B.12)
178
where, I{.} is an indicator function, which is 1 only when both the conditions within
are true, else it is 0. The partial derivative∂Hij
∂upqfor Equation (B.11) is,
[
∂Hij
∂upq
]
B.11=− vi
∑
l,k′ exp(i, lk′)
[
∑
k I{j=p,k=q}exp(i, jk).log
(∑
l,k′ exp(i, lk′))
+ pijexp(i, pq)− pijexp(i, pq)log(∑
l,k′ exp(i, lk′))
]
(B.13)
Expressing∂Hij
∂upq
=[
∂Hij
∂upq
]
B.10+[
∂Hij
∂upq
]
B.11, and defining pijk = exp(i,jk)∑
l,k′ exp(i,lk′)
the
partial derivative w.r.t. the source is,
∂Hij
∂upq=vi
[
∑
k I{j=p,k=q}pijk.log
(∑
k exp(i, jk))
+∑
k I{j=p,k=q}pijk
− pij pipqlog(∑
k exp(i, jk))
−∑k I{j=p,k=q}pijk.log
(∑
l,k′ exp(i, lk′))
− pij pipq + pij pipqlog(∑
l,k′ exp(i, lk′))
]
(B.14)
=vi
[
∑
k I{j=p,k=q}pijklog(pij)− pij pipqlog(pij) +
∑
k I{j=p,k=q}pijk − pij pipq
]
=vi
(
log(pij) + 1)
[
∑
k I{j=p,k=q}pijk − pij pipq
]
(B.15)
The partial derivative of H w.r.t the source output upq is given by,
∂H∂up
q= − 1
nt
nt∑
i=1
C∑
j=1
vi
(
log(pij) + 1)
[
∑
k I{j=p,k=q}pijk − pij pipq
]
(B.16)
The partial derivative ∂H∂vi
for Equation (B.10) is outlined as,
[
∂Hij
∂vi
]
B.10=
1∑
l,k′ exp(i, lk′)
[
log(∑
k exp(i, jk))∑
k exp(i, jk)ujk +
∑
k exp(i, jk)ujk
− 1∑
l,k′ exp(i, lk′)
∑
k exp(i, jk)log(∑
k exp(i, jk))∑
l,k′ exp(i, lk′)ul
k′
]
,
(B.17)
and the partial derivative ∂H∂vi
for Equation (B.11) as,
[
∂Hij
∂vi
]
B.11=− 1
∑
l,k′ exp(i, lk′)
[
log(∑
l,k′ exp(i, lk′))∑
k exp(i, jk)ujk
+
∑
k exp(i, jk)∑
l,k′ exp(i, lk′)
∑
l,k′
exp(i, lk′)ulk′
− 1∑
l,k′ exp(i, lk′)
∑
k exp(i, jk)log(∑
l,k′ exp(i, lk′))∑
l,k′ exp(i, lk′)ul
k′
]
,
(B.18)
179
Expressing∂Hij
∂vi=[
∂Hij
∂vi
]
B.10+[
∂Hij
∂vi
]
B.11,
∂Hij
∂vi
=1
∑
l,k′ exp(i, lk′)
[
log(∑
k exp(i, jk))∑
k exp(i, jk)ujk
− log(∑
l,k′ exp(i, lk′))∑
k exp(i, jk)ujk
+∑
k exp(i, jk)ujk − pij
∑
l,k′ exp(i, lk′)ul
k′
− pijlog(∑
k exp(i, jk))∑
l,k′ exp(i, lk′)ul
k′
+ pijlog(∑
l,k′ exp(i, lk′))∑
l,k′ exp(i, lk′)ul
k′
]
(B.19)
=[
log(∑
k exp(i, jk))∑
k pijkujk − log
(∑
l,k′ exp(i, lk′))∑
k pijkujk
+∑
k pijkujk − pij
∑
l,k′ pijk′ulk′
− pijlog(∑
k exp(i, jk))∑
l,k′ pijk′ulk′ + pijlog
(∑
l,k′ exp(i, lk′))∑
l,k′ pijk′ulk′
]
=(
log(pij) + 1)∑
k pijkujk −
(
log(pij) + 1)
pij∑
l,k′ pijk′ulk′ (B.20)
=(
log(pij) + 1)(∑
k pijkujk − pij
∑
l,k′ pijk′ulk′
)
(B.21)
The partial derivative of H w.r.t. target output vq is given by,
∂H∂vq
= − 1
nt
C∑
j=1
(
log(pqj) + 1)(∑
k pqjkujk − pqj
∑
l,k′ pqjk′ulk′
)
(B.22)
The partial derivative of H w.r.t. the source outputs is given by Equation (B.16) andw.r.t. the target outputs is given by Equation (B.22).
180
APPENDIX C
PERMISSION STATEMENTS FROM CO-AUTHORS
181
Permission for including co-authored material in this dissertation was obtainedfrom co-authors, Prof. Sethuraman Panchanathan, Prof. Jieping Ye, Dr. VineethBalasubramanian, Dr. Troy McDaniel, Dr. Shayok Chakraborty, Dr. Binbin Lin, Dr.Prasanth Lade and Jose Eusebio.