Ganin et al, JMLR 2016, [link]
tags: domain adaptation - representation learning - adversarial - icml - jmlr - 2016
Several theoretical studies of the domain adaptation problem have proposed upper bounds of the risk on the target domain, involving the risk on the source domain and a notion of distance between the source and target distribution, and . Here, the authors specifically consider the work of . First, they define the -divergence:
where is a space of (here, binary) hypothesis functions. In the case where is a symmetric hypothesis class (i.e., ), one can reduce (1) to the empirical form:
It is difficult to estimate the minimum over the hypothesis class . Instead,  propose to approximate Equation (2) by training a classifier on samples with label 0 and with label 1, and replacing the
minimum term by the empirical risk of .
Given this definition of the -divergence,  further derives an upper bound on the empirical risk on the target domain, which in particular involves a trade-off between the empirical risk on the source domain, , and the divergence between the source and target distributions, .
where designates the Vapnik–Chervonenkis dimensions and the number of samples.
The rest of the paper directly stems from this intuition: in order to minimize the target risk the proposed Domain Adversarial Neural Network (
DANN) aims to build an “internal representation that contains no discriminative information about the origin of the input (source or target), while preserving a low risk on the source (labeled) examples”.
The goal of the model is to learn a classifier , which can be decomposed as , where is a feature extractor and a small classifier on top that outputs the target label. This architecture is trained with a standard classification objective to minimize:
DANN introduces a domain prediction branch, which is another classifier on top of the feature representation and whose goal is to approximate the domain discrepancy as (2), which leads to the following training objective to maximize:
The final objective can thus be written as:
Applying standard gradient descent, the
DANN objective leads to the following gradient update rules:
In the case of neural networks, the gradients of the loss with respect to parameters are obtained with the backpropagation algorithm. The current system equations are very similar to the standard backpropagation scheme, except for the opposite sign in the derivative of with respect to and . The authors introduce the gradient reversal layer (
GRL) to evaluate both gradients in one standard backpropagation step.
The idea is that the output of is normally propagated to , however during backpropagation, its gradient is multiplied by a negative constant:
In other words, for the update of , the gradients of with the respect to activations are computed normally (minimization), but they are then propagated with a minus sign in the feature extraction part of the network (maximization). Augmented with the gradient reversal layer, the final model is trained by minimizing the sum of losses , which corresponds to the optimization problem in (1-3).
Figure: The proposed architecture includes a deep feature extractor and a deep label predictor. Unsupervised domain adaptation is achieved by adding a domain classifier connected to the feature extractor via a gradient reversal layer that multiplies the gradient by a certain negative constant during backpropagation.
The paper presents extensive results on the following settings:
NNmodel which has the same architecture but without the
GRL: in other words, the baseline directly minimizes both the task and domain classification losses.
Setting hyperparameters is a difficult problem, as we cannot directly evaluate the model on the target domain (no labeled data available). Instead of standard cross-validation, the authors use reverse validation based on a technique introduced in : First, the (labeled) source set and (unlabeled) target set are each split into a training and validation set, and (resp. and ). Using these splits, a model is trained on . Then a second model is trained for the reverse direction on the set . This reverse classifier is then finally evaluated on the labeled validation set , and this accuracy is used as a validation score.
In general, the proposed method seems to perform very well for aligning the source and target domains in an unsupervised domain adaptation framework. Its main advantage is its simplicity, both in terms of theoretical motivation and implementation. In fact, the
GRL is easily implemented in standard Deep Learning frameworks and can be added to any architectures.
The main shortcomings of the method are that (i) all experiments deal with only two sources and extensions to multiple domains might require some tweaks (e.g., considering the sum of pairwise discrepancies as an upper-bound) and (ii) in practice, training can become unstable due to the adversary training scheme; In particular, the experiment sections show that some stability tricks have to be used during training, such as using momentum or slowly increasing the contribution of the domain classification branch.
t-SNE projections of the embeddings for the source (MNIST) and target (SVHN) datasets without (left) and with (right)
Long et al, NeurIPS 2018[link]
In this work, the authors propose to for Domain Adversarial Networks. More specifically, the domain classifier is conditioned on the input’s class: However, since part of the samples are unlabeled, the conditioning uses the output of the target classifier branch as a proxy for the class information. Instead of simply concatenating the feature input with the condition, the authors consider a multilinear conditioning technique which relies on the cross-covariance operator. Another related paper is . It also uses the multi-class information of the input domain, although in a simpler way.