VQ-VAE
, a variant of the Variational Autoencoder (VAE
) framework with a discrete latent space, using ideas from vector quantization. The two main motivations are (i) discrete variables are potentially better fit to capture the structure of data such as text and (ii) to prevent the posterior collapse in VAE
s that leads to latent variables being ignored when the decoder is too powerful.
The model is based on VAE
[1], where image \(x\) is generated from random latent variable \(z\) by a decoder \(p(x\ \vert\ z)\). The posterior (encoder) captures the latent variable distribution \(q_{\phi}(z\ \vert\ x)\) and is generally trained to match a certain distribution \(p(z)\) from which \(z\) is sampled from at inference time.
Contrary to the standard framework, in this work the latent space is discrete, i.e., \(z \in \mathbb{R}^{K \times D}\) where \(K\) is the number of codes in the latent space and \(D\) their dimensionality. More precisely, the input image is first fed to \(z_e\), that outputs a continuous vector, which is then mapped to one of the latent codes in the discrete space via nearest-neighbor search.
Adapting the \(\mathcal{L}_{\text{ELBO}}\) to this formalism, the KL divergence term greatly simplifies and we obtain:
\[\begin{align} \mathcal{L}_{\text{ELBO}}(x) &= \text{KL}(q(z | x) \| p(z)) - \mathbb{E}_{z \sim q(\cdot | x)}(\log p(x | z))\\ &= - \log(p(z_k)) - \log p(x | z_k)\\ \mbox{where }& z_k = z_q(x) = \arg\min_z \| z_e(x) - z \|^2 \tag{1} \end{align}\]In practice, the authors use a categorical uniform prior for the latent codes, meaning the KL divergence is constant and the objective reduces to the reconstruction loss.
Figure: A figure describing the VQ-VAE
(left). Visualization of the embedding space (right)). The output of the encoder z(x) is mapped to the nearest point. The gradient (in red) will push the
encoder to change its output, which could alter the configuration, hence the code assignment, in the next forward pass.
As we mentioned previously, the \(\mathcal{L}_{\text{ELBO}}\) objective reduces to the reconstruction loss and is used to learn the encoder and decoder parameters. However the mapping from \(z_e\) to \(z_q\) is not straight-forward differentiable (Equation (1)). To palliate this, the authors use a straight-through estimator, meaning the gradients from the decoder input \(z_q(x)\) (quantized) are directly copied to the encoder output \(z_e(x)\) (continuous). However, this means that the latent codes that intervene in the mapping from \(z_e\) to \(z_q\) do not receive gradient updates that way.
Hence in order to train the discrete embedding space, the authors propose to use Vector Quantization (VQ
), a dictionary learning technique, which uses mean squared error to make the latent code closer to the continuous vector it was matched to:
where \(x \mapsto \overline{x}\) denotes the stop gradient
operator. The first term is the reconstruction loss stemming from the ELBO, the second term is the vector quantization contribution. Finally, the last term is a commitment loss to control the volume of the latent space by forcing the encoder to “commit” to the latent code it matched with, and not grow its output space unbounded.
A second contribution of this work consists in learning the prior distribution. As mentioned, during the training phase, the prior \(p(z)\) is a uniform categorical distribution. After the training is done, we fit an autoregressive distribution over the space of latent codes. This is in particular enabled by the fact that the latent space is discrete.
Note: It is not clear to me if the autoregressive model is trained on latent codes sampled from the prior \(z \sim p(z)\) or from the encoder distribution \(x \sim \mathcal{D};\ z \sim q(z\ \vert\ x)\)
The proposed model is mostly compared to the standard continuous VAE
framework. It seems to achieve similar log-likelihood and sample quality, while taking advantage of the discrete latent space. In particular
For ImageNet for instance, they consider \(K = 512\) latent codes with dimensions \(1\). The output of the fully-convolutional encoder \(z_e\) is a feature map of size \(32 \times 32 \times 1\) which is then quantized pixel-wise. Interestingly, the model still performs well when using a powerful decoder (here, PixelCNN [2]) which seems to indicate it does not suffer from posterior collapse as strongly as the standard continuous VAE
.
A second set of experiments tackles the problem of audio modeling. The performance of the model are once again satisfying. Furthermore, it does seem like the discrete latent space actually captures relevant characteristics of the input data structure, although this is a purely qualitative observation.
Several theoretical studies of the domain adaptation problem have proposed upper bounds of the risk on the target domain, involving the risk on the source domain and a notion of distance between the source and target distribution, \(\mathcal D_S\) and \(\mathcal D_T\). Here, the authors specifically consider the work of [1]. First, they define the \(\mathcal H\)-divergence:
\[\begin{align} d_{\mathcal H}(\mathcal D_S, \mathcal D_T) = 2 \sup_{h \in \mathcal H} \left| \mathbb{E}_{x\sim\mathcal{D}_s} (h(x) = 1) - \mathbb{E}_{x\sim\mathcal{D}_T} (h(x) = 1) \right| \tag{1} \end{align}\]where \(\mathcal H\) is a space of (here, binary) hypothesis functions. In the case where \(\mathcal H\) is a symmetric hypothesis class (i.e., \(h \in \mathcal H \implies -h \in \mathcal H\)), one can reduce (1) to the empirical form:
\[\begin{align} d_{\mathcal H}(\mathcal D_S, \mathcal D_T) &\simeq 2 \sup_{h \in \mathcal H} \left|\frac{1}{|D_S|} \sum_{x \in D_S} [\!|h(x) = 1 |\!] - \frac{1}{|D_T|} \sum_{x \in D_T} [\!|h(x) = 1 |\!] \right|\\ &= 2 \sup_{h \in \mathcal H} \left|\frac{1}{|D_S|} \sum_{x \in D_S} 1 - [\!|h(x) = 0 |\!] - \frac{1}{|D_T|} \sum_{x \in D_T} [\!|h(x) = 1 |\!] \right|\\ &= 2 - 2 \min_{h \in \mathcal H} \left|\frac{1}{|D_S|} \sum_{x \in D_S} [\!|h(x) = 0 |\!] + \frac{1}{|D_T|} \sum_{x \in D_T} [\!|h(x) = 1 |\!] \right| \tag{2} \end{align}\]It is difficult to estimate the minimum over the hypothesis class \(\mathcal H\). Instead, [1] propose to approximate Equation (2) by training a classifier \(\hat{h}\) on samples \(\mathbf{x_S} \in \mathcal{D}_S\) with label 0 and \(\mathbf{x_T} \in \mathcal D_T\) with label 1, and replacing the minimum
term by the empirical risk of \(\hat h\).
Given this definition of the \(\mathcal H\)-divergence, [1] further derives an upper bound on the empirical risk on the target domain, which in particular involves a trade-off between the empirical risk on the source domain, \(\mathcal{R}_{D_S}(h)\), and the divergence between the source and target distributions, \(d_{\mathcal H}(D_S, D_T)\).
where \(\mbox{VC}\) designates the Vapnik–Chervonenkis dimensions and \(n\) the number of samples.
The rest of the paper directly stems from this intuition: in order to minimize the target risk the proposed Domain Adversarial Neural Network (DANN
) aims to build an “internal representation that contains no discriminative information about the origin of the input (source or target), while preserving a low risk on the source (labeled) examples”.
The goal of the model is to learn a classifier \(\phi\), which can be decomposed as \(\phi = G_y \circ G_f\), where \(G_f\) is a feature extractor and \(G_y\) a small classifier on top that outputs the target label. This architecture is trained with a standard classification objective to minimize:
\[\begin{align} \mathcal{L}_y(\theta_f, \theta_y) = \frac{1}{N_s} \sum_{(x, y) \in D_s} \ell(G_y(G_f(x)), y) \end{align}\]Additionally DANN
introduces a domain prediction branch, which is another classifier \(G_d\) on top of the feature representation \(G_f\) and whose goal is to approximate the domain discrepancy as (2), which leads to the following training objective to maximize:
The final objective can thus be written as:
\[\begin{align} E(\theta_f, \theta_y, \theta_d) &= \mathcal{L}_y(\theta_f, \theta_y) - \lambda \mathcal{L}_d(\theta_f, \theta_d) \tag{1}\\ \theta_f^\ast, \theta_y^\ast &= \arg\min E(\theta_f, \theta_y, \theta_d) \tag{2}\\ \theta_d^\ast &= \arg\max E(\theta_f, \theta_y, \theta_d) \tag{3} \end{align}\]Applying standard gradient descent, the DANN
objective leads to the following gradient update rules:
In the case of neural networks, the gradients of the loss with respect to parameters are obtained with the backpropagation algorithm. The current system equations are very similar to the standard backpropagation scheme, except for the opposite sign in the derivative of \(\mathcal{L}_d\) with respect to \(\theta_d\) and \(\theta_f\). The authors introduce the gradient reversal layer (GRL
) to evaluate both gradients in one standard backpropagation step.
The idea is that the output of \(\theta_f\) is normally propagated to \(\theta_d\), however during backpropagation, its gradient is multiplied by a negative constant:
\[\begin{align} \frac{\partial \mathcal L_d}{\partial \theta_f} = \frac{\bf{\color{red}{-}} \partial \mathcal L_d}{\partial G_f(x)} \frac{\partial G_f(x)}{\partial \theta_f} \end{align}\]In other words, for the update of \(\theta_d\), the gradients of \(\mathcal L_d\) with the respect to activations are computed normally (minimization), but they are then propagated with a minus sign in the feature extraction part of the network (maximization). Augmented with the gradient reversal layer, the final model is trained by minimizing the sum of losses \(\mathcal L_d + \mathcal L_y\) , which corresponds to the optimization problem in (1-3).
Figure: The proposed architecture includes a deep feature extractor and a deep label predictor. Unsupervised domain adaptation is achieved by adding a domain classifier connected to the feature extractor via a gradient reversal layer that multiplies the gradient by a certain negative constant during backpropagation.
The paper presents extensive results on the following settings:
DANN
to a NN
model which has the same architecture but without the GRL
: in other words, the baseline directly minimizes both the task and domain classification losses.Setting hyperparameters is a difficult problem, as we cannot directly evaluate the model on the target domain (no labeled data available). Instead of standard cross-validation, the authors use reverse validation based on a technique introduced in [3]: First, the (labeled) source set \(S\) and (unlabeled) target set \(T\) are each split into a training and validation set, \(S'\) and \(S_V\) (resp. \(T'\) and \(T_V\)). Using these splits, a model \(\eta\) is trained on \(S'\rightarrow T'\). Then a second model \(\eta_r\) is trained for the reverse direction on the set \(\{ (x, \eta(x)),\ x \in T'\} \rightarrow S'\). This reverse classifier \(\eta_r\) is then finally evaluated on the labeled validation set \(S_V\), and this accuracy is used as a validation score.
In general, the proposed method seems to perform very well for aligning the source and target domains in an unsupervised domain adaptation framework. Its main advantage is its simplicity, both in terms of theoretical motivation and implementation. In fact, the GRL
is easily implemented in standard Deep Learning frameworks and can be added to any architectures.
The main shortcomings of the method are that (i) all experiments deal with only two sources and extensions to multiple domains might require some tweaks (e.g., considering the sum of pairwise discrepancies as an upper-bound) and (ii) in practice, training can become unstable due to the adversary training scheme; In particular, the experiment sections show that some stability tricks have to be used during training, such as using momentum or slowly increasing the contribution of the domain classification branch.
Figure: t-SNE
projections of the embeddings for the source (MNIST) and target (SVHN) datasets without (left) and with (right) DANN
adaptation.
Long et al, NeurIPS 2018[link]
In this work, the authors propose to for Domain Adversarial Networks. More specifically, the domain classifier is conditioned on the input’s class: However, since part of the samples are unlabeled, the conditioning uses the output of the target classifier branch as a proxy for the class information. Instead of simply concatenating the feature input with the condition, the authors consider a multilinear conditioning technique which relies on the cross-covariance operator. Another related paper is [4]. It also uses the multi-class information of the input domain, although in a simpler way.
Given a random noise vector \(z\) and conditioned on an image \(x_0\), the goal of conditional image generation is to generate image \(x = f_{\theta}(z; x_0)\) (where the random nature of \(z\) provides a sampling strategy for \(x\)); for instance, the task of generating a high quality image \(x\) from its lower resolution counterpart \(x_0\).
In particular, this encompasses inverse tasks such as denoising, super-resolution and inpainting that acts at the local pixel level. Such tasks can often be phrased with an objective of the following form:
\[\begin{align} \theta^{\ast} = \arg\min E(x, x_0) + R(x) \end{align}\]where \(E\) is a cost function and \(R\) is a prior on the output space acting as a regularizer. \(R\) is often a hand-crafted prior, for instance a smoothness constraint like Total Variation [1], or, for more recent techniques, it can be implemented with adversarial training (e.g., GAN
s).
In this paper, the goal is to replace \(R\) by an implicit prior captured by the neural network, relatively to input noise \(z\). In other words
\[\begin{align} R(x) &= 0\ \mbox{if}\ \exists \theta\ \mbox{s.t.}\ x = f_{\theta}(z)\\ R(x) &= + \infty,\ \mbox{otherwise} \end{align}\]Which results in the following workflow:
\[\begin{align} \theta^{\ast} = \arg\min E(f(z; x_0), x_0) \mbox{ and } x^{\ast} = f_{\theta^{\ast}}(z; x_0) \end{align}\]One could wonder if this is a good choice for a prior at all. In fact, \(f\), being instantiated as a neural network, should be powerful enough that any image \(x\) can be generated from \(z\) for a certain choice of parameters \(\theta\), which means the prior should not be constraining.
However, the structure of the network* itself effectively affects how optimization algorithms such as gradient descent will browse the output space:
To quantify this effect, the authors perform a reconstruction experiment (i.e., \(E(x) = \| x - x_0 \|\)) for different choices of the input image \(x_0\) ((i)** natural image, (ii) same image with small perturbations, (iii) with large perturbations, and (iv) white noise) using a U-Net
[2] inspired architecture. Experimental results show that the network descends faster to natural-looking images (case (i) and (ii)), than to random noise (case (iii) and (iv)).
Figure: Learning curves for the reconstruction task using: a natural image, the same plus i.i.d. noise, the same but randomly scrambled, and white noise.
The experiments focus on three image analysis tasks:
The method seems to outperform most non-trained methods, when available, (e.g. Bicubic upsampling for Super-Resolution) but is still often outperformed y learning-based ones. The inpainting results are particularly interesting, and I do not know of any other non-trained baselines for this task. Obviously performs poorly when the obscured region requires highly semantic knowledge, but it seems to perform well on more reasonable benchmarks.
Additionally, the authors test the proposed prior for diagnosing neural networks by generating natural pre-images for neural activations of deep layers. Qualitative images look better than other handcrafted priors (total variation) and are not biased to specific datasets as are trained methods.
Figure: Example comparison between the proposed Deep Image Prior and various baselines for the task of Super-Resolution.
Heckel and Hand, [link]
This paper builds on Deep Image Prior but proposes a much simpler architecture which is under-parametrized and non-convolutional. In particular, there are fewer weight parameters than the dimensionality of the output image (in comparison, DIP was using a
U-Net
based architecture). In particular, this property implies that the weights of the network can additionally be used as a compressed representation of the image. In order to test for compression, the authors use their architecture to reconstruct image \(x\) for different compression ratios \(k\) (i.e., number of network parameters \(N\), is \(k\)-times smaller as the output dimension of the images).
The deep decoder architecture combines standard blocks include linear combination of channels (convolutions ), ReLU, batch-normalization and upscaling. Note that since here we have a special case of batch size 1, the Batch Norm operator essentially normalizes the activation channel-wise. In particular, the paper contains a nice theoretical justification for the denoising case, in which they show that the model can only fit a certain amount of noise, which explains why it would converge to more natural-looking images, although it only applies to small networks (1 layer ? possibly generalizable to multi-layer and no batch-norm)
CNN
architectures with notion of relational reasoning, particularly useful for tasks such as visual question answering, dynamics understanding etc.
The main idea of Relation Networks (RN
) is to constrain the functional form of convolutional neural networks as to explicitly learn relations between entities, rather than hoping for this property to emerge in the representation during training. Formally, let \(O\) be a set of objects of interest \(O = \{o_1 \dots o_n\}\); The Relation Network is trained to learn a representation that considers all pairwise relations across the objects:
\(f_{\phi}\) and \(g_{\theta}\) are defined as Multi Layer Perceptrons. By definition, the Relation Network (i) has to consider all pairs of objects, (ii) operates directly on the set of objects hence is not constrained to a specific organization of the data, and (iii) is data-efficient in the sense that only one function, \(g_{\theta}\) is learned to capture all the possible relations: \(g\) and \(f\) are typically light modules and most of the overhead comes from the sum of pairwise components (\(n^2\)).
The objects are the basic elements of the relational process we want to model. They are defined with regard to the task at hand, for instance:
Attending relations between objects in an image: The image is first processed through a fully-convolutional network. Each of the resulting cell is taken as an object, which is a feature of dimensions \(k\), additionally tagged with its position in the feature map.
Sequence of images. In that case, each image is first fed through a feature extractor and the resulting embedding is used as an object. The goal is to model relations between images across the sequence.
Figure: Example of applying the Relation Network for Visual Question Answeting. Questions are processed with an LSTM
to produce a question embedding, and images are processed with a CNN
to produce a set of objects for the RN
.
The main evaluation is done on the CLEVR
dataset [2]. The main message seems to be that the proposed module is very simple and yet often improves the model accuracy when added to various architectures (CNN
, CNN + LSTM
etc.) introduced in [1]. The main baseline they compare to (and outperform) is Spatial Attention (SA
) which is another simple method to integrate some form of relational reasoning in a neural architecture.
Palm et al, [link]
\[\begin{align} h_i^0 &= v_i\\ h_i^{t + 1} &= f_{\phi} \left( h_i^t, v_i, \sum_{j} e_{i, j} g_{\theta}(h^t_i, h^t_j) \right)\\ o_i &= r(h_i^T) \mbox{ or } o = r(\sum_i h_i^T) \end{align}\]This paper builds on the Relation Network architecture and propose to explore more complex relational structures, defined as a graph, using a message passing approach: Formally, we are given a graph with vertices \(\mathcal V = \{v_i\}\) and edges \(\mathcal E = \{e_{i, j}\}\). By abuse of notation, \(v_i\) also denotes the embedding for vertex \(i\) (e.g. obtained via a CNN) and \(e_{i, j}\) is 1 where \(i\) and \(j\) are linked, 0 otherwise. To each node we associate a hidden state \(h_i^t\) at iteration \(t\), which will be updated via message passing. After a few iterations, the resulting state is passed through a
MLP
\(r\) to output the result (either for each node or for the whole graph):
Comparing to the original Relation Network:
- Each update rule is a Relation Network that only looks at pairwise relations between linked vertices. The message passing scheme additionally introduces the notion of recurrence, and the dependency on the previous hidden state.
- The dependence on \(h_i^t\) could in theory be avoided by adding self-edges from \(v_i\) to \(v_i\), to make it closer to the Relation Network formulation.
- Adding \(v_i\) as input of \(f_\phi\) looks like a simple trick to avoid long-term memory problems.
The experiments essentially compare the proposed
RRNN
model to the Relation Network and classical recurrent architectures such asLSTM
. They consider three datasets:
- Babi. NLP question answering task with some reasoning involved. Solves 19.7 (out of 20) tasks on average, while simple RN solved around 18 of them reliably.
- Pretty CLEVR. A CLEVR like dataset (only with simple 2D shapes) with questions involving various steps of reasoning, e.g. “which is the shape \(n\) steps of the red circle ?”
- Sudoku. the graph contains 81 nodes (one for each cell in the sudoku), with edges between cells belonging to the same row, column or block.
Jahrens and Martinetz, [link]
\[\begin{align} h_{i, j}^0 &= g^0_{\theta}(x_i, x_j) \\ h_{i, j}^t &= g^{t + 1}_{\theta}\left(\sum_k h_{i, k}^{t - 1}, \sum_k h_{j, k}^{t - 1}\right) \\ MLRN(O) &= f_{\phi}(\sum_{i, j} h^T_{i, j}) \end{align}\]This paper presents a very simple trick to make Relation Network consider higher order relations than pairwise, while retaining some efficiency. Essentially the model can be written as follow:
It is not clear why this model would be equivalent to explicitly considering higher-level relations (as it is rather combining pairwise terms for a finite number of steps). According to the experiments it seems that indeed this architecture could be better fitted for the studied tasks (e.g. over the Relation Network or Recurrent Relation Network) but it also makes the model even harder to interpret.
CRL
) to learn at the same time both the structure of the task and its components.
A problem \(P_i\) is defined as a transformation \(x_i : t_x \mapsto y_i : t_y\), where \(t_x\) and \(t_y\) are the respective types of \(x\) and \(y\). However since we only consider recursive problem here, then \(t_x = t_y\).
We define a family of problems \(\mathcal P\) as a set of composite recursive problems that share regularities. The goal of CRL
is to extrapolate to solve new compositions of these tasks, using knowledge from the limited subset of tasks it has seen during training.
In essence, the problem can be formulated as a sequence-decision making task via a meta-level MDP (\(\mathcal X\), \(\mathcal F\), \(\mathcal P_{\mbox{meta}}\), r, \(\gamma\)), where \(\mathcal X\) is the set of states, i.e., representations; \(\mathcal F\) is a set of computations, i.e., istances of the transformations we consider, for instance as neural networks, and an additional special function HALT
that stops the execution; \(\mathcal P_{\mbox{meta}}: (x_t, f_t, x_{t + 1}) \mapsto c \in [0, 1]\) is the policy which assigns a probability to each possible transition. Finally \(r\) is the reward function and \(\gamma\) a decay factor.
More specifically, the CRL
is implemented as a set of neural networks, \(f_k \in \mathcal F\), and a controller \(\pi(f\ |\ \mathbf{x}, t_y)\) which selects the best course of action given the current history of representations \(\mathbf{x}\) and target type \(t_y\).
The loss is back-propagated through the functions \(f\), and the controller is trained as a Reinforcement Learning (RL
) agent with a sparse reward (it only knows the final target result).
An additional important training scheme is the use of curriculum learning i.e., start by learning small transformations and then consider more complex compositions, increasing the state space little by little.
Figure: (top-left) CRL
is a symbiotic relationship between a
controller and evaluator: the controller selects a module `m` given an intermediate representation `x` and the
evaluator applies `m` on `x` to create a new representation. (bottom-left) CRL
dynamically learns the
structure of a program customized for its problem, and this program can be viewed as a finite state machine.
(right) A series of computations in the program is equivalent to a traversal through a Meta-MDP, where module
can be reused across different stages of computation, allowing for recursive computation.
The learner will aim to solve recursive arithmetic expressions across 6 languages: English
, Numerals
, PigLatin
, Reversed-English
, Spanish
. The input is a tuple \((x^s, t_y)\), where \(x\) is the arithmetic expression expressed in source language \(s\), and \(t_y\) is the output language.
Training: The learner trains on a curriculum of a limited set of 2, 3, 4, 5-length expressions. During training, each source language is seen with four target languages (and one held out for testing) and each target language is seen with four source languages (and one held out for testing).
Testing: The learner is asked to generalize to 5-length expressions (test set) and to extrapolate to 10-length expressions (extrapolation set) with unseen language pairs.
The authors consider two main types of functional units for this task: A reducer, which takes as input a window of three terms in the input expression and outputs a softmax distribution over the vocabulary. While a translator applies a function to every element of the input sequence and outputs a sequence of the same size.
The CRL
is compared to a baseline RNN
architecture that directly tries to map a variable length input sequence to the target output. On the test set, RNN
and CRL
yield similar accuracies although CRL
usually requires less training samples and/or less training iterations. On the extrapolation set however, CRL
more clearly outperforms RNN
.
Interestingly the CRL
results usually have a much bigger variance which would be interesting to qualitatively analyze. Moreover, the use of curriculum learning significantly improves the model performance. Finally, qualitative results show that the reducers and translators are interpretable to some degree: e.g., it is possible to map some of the reducers to specific operations, however due to the unsupervised nature of the task, the mapping is not always straight-forward.
This time the functional units are composed of three specialized Spatial Transformer Networks [1] to learn rotation, scale and translation, and an identity function. Overall this setting does not yield very good quantitative results. More precisely, one of the main challenges, since we are acting on a visual domain, is to deduce the structure of the task from information which lacks clear structure (pixel matrices). Additionally the fact that all inputs and outputs have the same domain (images) and that only a sparse reward is available make it more difficult for the controller to distinguish between functionalities, i.e., it could collapse to using only one transformer.
SAT
problems with weak supervision: In that case, a model is trained only to predict the satisfiability of a formula in conjunctive normal form. As a byproduct, if the formula is satisfiable, an actual satisfying assignment can be worked out from the network's activations in most cases.
SAT
solvers.We consider boolean logic formulas in their conjunctive normal form (CNF), i.e. each input formula is represented as a conjunction (\(\land\)) of clauses, which are themselves disjunctions (\(\lor\)) of literals (positive or negative instances of variables). The goal is to learn a classifier to predict whether such a formula is satisfiable.
A first problem is how to encode the input formula in such a way that it preserves the CNF invariances (invariance to negating a literal in all clauses, invariance to permutations in \(\lor\) and \(\land\) etc.). The authors use a standard undirected graph representation where:
The graph relations are encoded as an adjacency matrix, \(A\), with as many rows as there are literals and as many columns as there are clauses. Note that this structure does not constrain the vertices ordering, and does not make any preferential treatment between positive or negative literals. However it still has some caveats, which can be avoided by pre-processing the formula. For instance when there are disconnected components in the graph, the averaging decision rule (see next paragraph) can lead to false positives.
In a high-level view, the model keeps track of an embedding for each litteral and each clause (\(L^t\) and \(C^t\)), updated via message-passing on the graph, and combined via a Multi Layer Perceptron (MLP
) to output the model prediction of the formula’s satisfiability. Then the model updates are as follow:
where \(h\) designates a hidden context vector for the LSTMs. The operator \(L \mapsto \bar{L}\) returns \(\overline{L}\), the embedding matrix \(L\) where the row of each litteral is swapped with the one corresponding to the literal’s negation. In other words, in (1) each clause embedding is updated based on the litteral that composes it, while in (2) each litteral embedding is updated based on the clauses it appears in and its negated counterpart.
After \(T\) iterations of this message-passing scheme, the model computes a logit for the satisfiability classification problem, which is trained via sigmoid cross-entropy:
\[\begin{align} L^t_{\mbox{vote}} &= \texttt{MLP}_{\texttt{vote}}(L^t)\\ y^t &= \mbox{mean}(L^t_{\mbox{vote}}) \end{align}\]The training set is built such that for any satisfiable training formula \(S\), it also includes an unsatisfiable counterpart \(S'\) which differs from \(S\) only by negating one litteral in one clause. These carefully curated samples should constrain the model to pick up substantial characteristics of the formula. In practice, the model is trained on formulas containing up to 40 variables, and on average 200 clauses. At this size, the SAT problem can still be solved by state-of-the-art solvers (yielding the supervision required to solve the model) but are large enough they prove challenging for Machine Learning models.
When a formula is satisfiable, one often also wants to know a valuation (variable assignment) that satisfies it. Recall that \(L^t_{\mbox{vote}}\) encodes a “vote” for every literal and its negative counterpart. Qualitative experiments show that those scores cannot be directly used for inferring the variable assignment, however they do induce a nice clustering of the variables (once the message passing has converged). Hence an assignment can be found as follows:
In practice, this method retrieves a satistifiability assignment for over 70% of the satisfiable test formulas.
In practice, the NeuroSAT
model is trained with embeddings of dimension 128 and 26 message passing iterations. The MLP
architectures are very standard: 3 layers followed by ReLU activations. The final model obtains 85% accuracy in predicting a formula’s satisfiability on the test set.
It also can generalize to larger problems, although it requires to increase the number of message passing iterations. However the classification performance significantly decreases (e.g. 25% for 200 variables) and the number of iterations linearly scales with the number of variables (at least in the paper experiments).
Figure: (left) Success rate of a NeuroSAT
model trained on 40 variables for test set involving formulas with up to 200 variables, as a function of the number of message-passing iterations. (right) The sequence of literal votes across message-passing iterations on a satisfiable formula. The vote matrix is reshaped such that each row contains the votes for a literal and its negated counterpart. For several iterations, most literals vote unsat with low confidence (light blue). After a few iterations, there is a phase transition and all literals vote sat with very high confidence (dark red), until convergence.
Interestingly, the model generalizes well to other classes of problems that were reduced to SAT
(using SAT
’s NP-completitude), although they have different structure than the random formulas generated for training, which seems to show that the model does learn some general characteristics of boolean formulas.
To summarize, the model takes advantage of the structure of Boolean formulas, and is able to predict whether an input formula is satisfiable or not with high accuracy. Moreover, even though trained only with this weak supervisory signal, it can work out a valid assignment most of the time. However it is still subpar compared to standard SAT solvers, which makes its applicability limited.
]]>VAE
s or GAN
s) and easily parallelizable training and inference (unlike the sequential generative process in auto-regressive models). This paper proposes a new, more flexible, form of invertible flow for generative models, which builds on [3].
PixelCNN
that allows for faster parallelized sample generation) would be nice.Given input data \(x\), invertible flow-based generative models are built as two steps processes that generate data from an intermediate latent representation \(z\):
\[\begin{align} z \sim p_{\theta}(z)\\ x = g_\theta(z) \end{align}\]where \(g_\theta\) is an invertible function, i.e. a bijection, \(g_\theta: \mathcal X \rightarrow \mathcal Z\). It acts as an encoder from the input data to the latent space. \(g\) is usually built as a sequence of smaller invertible functions \(g = g_1 \circ \dots \circ g_n\). Such a sequence is also called a normalizing flow [1]. Under this construction, the change of variables formula applied to \(x = g(z)\) gives the following equivalence between the input and latent densities:
\[\begin{align} \log p(x) &= \log p(z) + \log\ \left| \det \left( \frac{d z}{d x} \right)\right|\\ &= \log p(z) + \sum_{i=1}^n \log\ \left| \det \left( \frac{g_{\leq i}(x)}{g_{\leq i - 1}(x)} \right)\right| \end{align}\]where \(\forall i \in [1; n],\ g_{\leq i} = g_i \circ \dots g_1\) In particular, this means \(g_{\leq n}(x) = z\) and \(g_0(x) = x\). \(p_\theta(z)\) is usually chosen as a simple density such as a unit Gaussian distribution, \(p_\theta(z) = \mathcal N(z; 0, \mathbf{I})\). In order to efficiently estimate the likelihood, the functions \(g_1, \dots g_n\) are usually chosen such that the log-determinant of the Jacobian, \(\log\ \left\vert \det \left( \frac{g_{\leq i}}{g_{\leq i - 1}} \right) \right\vert\), is easily computed, for instance by choosing transformation such that the Jacobian is a triangular matrix.
Each flow step function \(g_i\) is a sequence of three operations as follows. Given an input tensor of dimensions \(h \times w \times c\):
Step Description | Functional Form of flow \(g_i\) | Inverse Function of the flow, \(g_i^{-1}\) | Log-determinant Expression |
---|---|---|---|
ActNorm \(s: [c,]\) \(b: [c,]\) |
\(y = \sigma\odot x + \mu\) | \(x = (y - \mu) / \sigma\) | \(hw\ \mbox{sum} \log(\vert\sigma\vert)\) |
1x1 conv \(W: [c,c]\) |
\(y = Wx\) | \(x = W^{-1}y\) | \(h w \log \vert \det (W) \vert\) |
Affine Coupling (ACL) [2] |
\(x_a,\ x_b = \mbox{split}(x)\) \((\log \sigma, \mu) = \mbox{NN}(x_b)\) \(y_a = \sigma \odot x_a + \mu\) \(y = \mbox{concat}(y_a, x_b)\) |
\(y_a,\ y_b = \mbox{split}(y)\) \((\log \sigma, \mu) = \mbox{NN}(y_b)\) \(x_a = (y_a - \mu) / \sigma\) \(x = \mbox{concat}(x_a, y_b)\) |
\(\mbox{sum} (\log \vert\sigma\vert)\) |
ActNorm. The activation normalization layer is introduced as a replacement for Batch Normalization (BN
) to avoid degraded performance with small mini-batch sizes, e.g. when training with batch size 1. This layer has the same form as BN
, however the bias, \(\mu\), and variance, \(\sigma\), are data-independent variables: They are initialized based on an initial mini-batch of data (data-dependent initialization), but are optimized during training with the rest of the parameters, rather than estimated from the input minibatch statistics.
1x1 convolution. This is a simple 1x1 convolutional layer. In particular, the cost of computing the determinant of \(W\) can be reduced by writing \(W\) in its LU decomposition, although this increases the number of parameters to be learned.
Affine Coupling Layer. The ACL
was introduced in [2]. The input tensor \(x\) is first split in half along the channel dimension. The second half, \(x_b\), is fed through a small neural network to get parameters \(\sigma\) and \(\mu\), and the corresponding affine transformation is applied to the first half, \(x_a\).
The rescaled \(x_a\) is the actual transformed output of the layer, however \(x_b\) also has to be propagated in order to make the transformation invertible, such that \(\sigma\) and \(\mu\) can also be estimated in the reverse flow.
Finally, note that the previous 1x1 convolution can be seen as a generalized permutation of the input channels, and guarantees that different channels combinations are seen during the split
operation.
These operations are then combined in a multi-scale architecture as described in [3], which in particular relies on a squeezing operation to trade of spatial resolution for number of output channels. Given an input tensor of size \(s \times s \times c\), the squeezing operator takes blocks of size \(2 \times 2 \times c\) and flatten them to size \(1 \times 1 \times 4c\), which can easily be inverted by reshaping. The final pipeline consists in \(L\) levels that operate on different scales: each level is composed of \(K\) flow steps and a final squeezing operation.
Figure: Overview of the multi-layer GLOW
architecture.
In summary, the main differences with [3] are:
In practice, the authors implement NN
as a convolutional neural network of depth 3 in the ACL
; which means that each flow step contains 4 convolutions in total. They also use \(K = 32\) flow steps in each level. Finally the number of levels \(L\) is 3 for small-scale experiments (32x32 images) and 6 for large scale (256x256 ImageNet images).
In particular this means that the model contains a lot of parameters (\(L \times K \times 4\) convolutions) which might be a practical disadvantage compared to other method that produce samples of similar quality, e.g. GAN
s. However, contrary to these models, GLOW
provides exact likelihood inference.
GLOW
outperforms RealNVP
[3] in terms of data likelihood, as evaluated on standard benchmarks (ImageNet, CIFAR-10, LSUN). In particular, the 1x1 convolutions performs better than other more specific permutations operations, and only introduces a small computational overhead.
Qualitatively, the samples are of great quality and the model seems to scale well with higher resolution. However this greatly increases the memory requirements. Leveraging the model’s invertibility to avoid storing activations during the feed-forward pass such as in [4] could be used to (partially) palliate the problem.
ResNet
) [3] have greatly advanced the state-of-the-art in Deep Learning by making it possible to train much deeper networks via the addition of skip connections. However, in order to compute gradients during the backpropagation pass, all the units' activations have to be stored during the feed-forward pass, leading to high memory requirements for these very deep networks.
Instead, the authors propose a reversible architecture in which activations at one layer can be computed from the ones of the next. Leveraging this invertibility property, they design a more efficient implementation of backpropagation, effectively trading compute power for memory storage.
i-RevNets
[4])This paper proposes to incorporate idea from previous reversible architectures, such as NICE
[1], into a standard ResNet
. The resulting model is called RevNet
and is composed of reversible blocks, inspired from additive coupling [1, 2]:
ResNet block | Inverse |
---|---|
$$ \begin{align} \mathbf{input }\ x&\\ x_1, x_2 &= \mbox{split}(x)\\ y_1 &= x_1 + \mathcal{F}(x_2)\\ y_2 &= x_2 + \mathcal{G}(y_1)\\ \mathbf{output}\ y &= (y_1, y_2) \end{align} $$ | $$ \begin{align} \mathbf{input }\ y&\\ y1, y2 &= \mbox{split}(y)\\ x_2 &= y_2 - \mathcal{G}(y_1)\\ x_1 &= y_1 - \mathcal{F}(x_2)\\ \mathbf{output}\ x &= (x_1, x_2) \end{align} $$ |
where \(\mathcal F\) and \(\mathcal G\) are residual functions, composed of sequences of convolutions, ReLU
and Batch Normalization layers, analogous to the ones in a standard ResNet
block, although operations in the reversible blocks need to have a stride of 1 to avoid information loss and preserve invertibility. Finally, for the split
operation, the authors consider splitting the input Tensor across the channel dimension as in [1, 2].
Similarly to ResNet
, the final RevNet
architecture is composed of these invertible residual blocks, as well as non-reversible subsampling operations (e.g., pooling) for which activations have to be stored. However the number of such operations is much smaller than the number of residual blocks in a typical ResNet
architecture.
The backpropagation algorithm is derived from the chain rule and is used to compute the total gradients of the loss with respect to the parameters in a neural network: given a loss function \(L\), we want to compute the gradients of \(L\) with respect to the parameters of each layer, indexed by \(n \in [1, N]\), i.e., the quantities \(\overline{\theta_{n}} = \partial L /\ \partial \theta_n\) (where \(\forall x, \bar{x} = \partial L / \partial x\)). We roughly summarize the algorithm in the left column of Table 1: In order to compute the gradients for the \(n\)-th block, backpropagation requires the input and output activation of this block, \(y_{n - 1}\) and \(y_{n}\), which have been stored, and the derivative of the loss respectively to the output, \(\overline{y_{n}}\), which has been computed in the backpropagation iteration of the upper layer; Hence the name backpropagation.
Since activations are not stored in RevNet
, the algorithm needs to be slightly modified, which we describe in the right column of Table 1. In summary, we first need to recover the input activations of the RevNet
block using its invertibility. These activations will be propagated to the earlier layers for further backpropagation. Secondly, we need to compute the gradients of the loss with respect to the inputs, i.e. \(\overline{y_{n - 1}} = (\overline{y_{n -1, 1}}, \overline{y_{n - 1, 2}})\), using the fact that:
Once again, this result will be propagated further down the network. Finally, once we have computed both these quantities we can obtain the gradients with respect to the parameters of this block, \(\theta_n\).
ResNet Architecture | RevNet Architecture | |
---|---|---|
Format of a Block | $$ y_{n} = y_{n - 1} + \mathcal F(y_{n - 1}) $$ | $$ \begin{align} y_{n - 1, 1}, y_{n - 1, 2} &= \mbox{split}(y_{n - 1})\\ y_{n, 1} &= y_{n - 1, 1} + \mathcal{F}(y_{n - 1, 2})\\ y_{n, 2} &= y_{n - 1, 2} + \mathcal{G}(y_{n, 1})\\ y_{n} &= (y_{n, 1}, y_{n, 2}) \end{align} $$ |
Parameters | $$ \begin{align} \theta = \theta_{\mathcal F} \end{align} $$ | $$\begin{align} \theta = (\theta_{\mathcal F}, \theta_{\mathcal G}) \end{align} $$ |
Backpropagation | $$\begin{align} &\mathbf{in:}\ y_{n - 1}, y_{n}, \overline{ y_{n}}\\ \overline{\theta_n} &=\overline{y_n} \frac{\partial y_n}{\partial \theta_n}\\ \overline{y_{n - 1}} &= \overline{y_{n}}\ \frac{\partial y_{n}}{\partial y_{n-1}} \\ &\mathbf{out:}\ \overline{\theta_n}, \overline{y_{n -1}} \end{align}$$ | $$\begin{align} &\mathbf{in:}\ y_{n}, \overline{y_{n }}\\ \texttt{# recover}& \texttt{ input activations} \\ y_{n, 1}, y_{n, 2} &= \mbox{split}(y_{n})\\ y_{n - 1, 2} &= y_{n, 2} - \mathcal{G}(y_{n, 1})\\ y_{n - 1, 1} &= y_{n, 1} - \mathcal{F}(y_{n - 1, 2})\\ \texttt{# compute}& \texttt{ gradients wrt. inputs} \\ \overline{y_{n -1, 1}} &= \overline{y_{n, 1}} + \overline{y_{n,2}} \frac{\partial \mathcal G}{\partial y_{n,1}} \\ \overline{y_{n -1, 2}} &= \overline{y_{n, 1}} \frac{\partial \mathcal F}{\partial y_{n,2}} + \overline{y_{n,2}} \left(1 + \frac{\partial \mathcal F}{\partial y_{n,2}} \frac{\partial \mathcal G}{\partial y_{n,1}} \right) \\ \texttt{# compute}& \texttt{ gradients wrt. parameters} \\ \overline{\theta_{n, \mathcal G}} &= \overline{y_{n, 2}} \frac{\partial \mathcal G}{\partial \theta_{n, \mathcal G}}\\ \overline{\theta_{n, \mathcal F}} &= \overline{y_{n,1}} \frac{\partial F}{\partial \theta_{n, \mathcal F}} + \overline{y_{n, 2}} \frac{\partial F}{\partial \theta_{n, \mathcal F}} \frac{\partial \mathcal G}{\partial y_{n,1}}\\ &\mathbf{out:}\ \overline{\theta_{n}}, \overline{y_{n -1}}, y_{n - 1} \end{align}$$ |
RevNet
s trade off memory requirements, by avoiding storing activations, against computations. Compared to other methods that focus on improving memory requirements in deep networks, RevNet
provides the best trade-off: no activations have to be stored, the spatial complexity is \(O(1).\) For the computation complexity, it is linear in the number of layers, i.e. \(O(L)\).
One disadvantage is that RevNet
s introduces additional parameters, as each block is composed of two residuals, \(\mathcal F\) and \(\mathcal G\), and their number of channels is also halved as the input is first split into two.
In the experiments section, the author compare ResNet
architectures to their RevNets
“counterparts”: they build a RevNet
with roughly the same number of parameters by halving the number of residual units and doubling the number of channels.
Interestingly, RevNets
achieve similar performances to their ResNet
counterparts, both in terms of final accuracy, and in terms of training dynamics. The authors also analyze the impact of floating errors that might occur when reconstructing activations rather than storing them, however it appears these errors are of small magnitude and do not seem to negatively impact the model.
To summarize, reversible networks seems like a very promising direction to efficiently train very deep networks with memory budget constraints.
In the Conditional Neural Processes (CNP
) setting, we are given \(n\) labeled points, called observations \(O = \{(x_i, y_i)\}_{i=1^n}\), and another set of \(m\) unlabeled targets \(T = \{x_i\}_{i=n + 1}^{n + m}\).We assume
that the outputs are a realization of the following process: Given \(\mathcal P\) a distribution over functions in \(X \rightarrow Y\), sample \(f \sim \mathcal P\), and set \(y_i = f(x_i)\) for all \(x_i\) in the targets set.
The goal is to learn a prediction model for the output samples while trying to obtain the same flexibility as Gaussian processes, rather than using the standard supervised learning paradigm of deep neural networks. The main inconvient of using standard Gaussian Processes is that they do not scale well (\((n + m)^3\)).
CNP
s give up on the theoretical guarantees of the Gaussian Process framework in exchange for more flexibility. In particular, observations are encoded in a representation of fixed dimension, independent of \(n\).
In other words, we first encode each observation and combine these embeddings via an operator \(\oplus\) to obtain a fixed representation of the observations, \(r\). Then for each new target \(x_i\) we obtain parameters \(\phi_i\) conditioned on \(r\) and \(x_i\), which determine the stochastic process to draw outputs from.
Figure: Conditional Neural Processes Architecture.
In practice, \(\oplus\) is taken to be the mean operation, i.e., \(r\) is the average of \(r_i\)s over all observations. For regression tasks, \(Q\) is a Gaussian distribution parameterized by mean and variance \(\phi = (\mu_i, \sigma_i).\) For classification tasks, \(\phi\) simply encodes a discrete distribution over classes.
Given observations \(f \sim P\), \(\{(x_i, y_i = f(x_i))\}_{i=1}^n\), we sample \(N\) uniformly in \([1, \dots, n]\) and train the model to predict labels for the whole observations set, conditioned only on the subset \(\{(x_i, y_i)\}_{i=1}^N\) by minimizing the negative log likelihood:
\[\begin{align} \mathcal{L}(\theta) = - \mathbb{E}_{f \sim p} \mathbb{E}_N \left( \log Q_{\theta}(\{y_i\}_{i=1}^N |\ \{(x_i, y_i)\}_{i=1}^N, \{x_i\}_{i=1}^N) \right) \end{align}\]Note: The sampling step \(f \sim P\) is not clear, but it seems to model the stochasticity in output \(y\) given \(x\).
The mode scales with \(O(n + m)\), i.e., linear time, which is much better than the cubic rate of Gaussian Processes.
1D regression: We generate a dataset that consist of functions generated from a GP with an exponential kernel. At every training step we sample a curve from the GP (\(f \sim P\)), select a subset of \(n\) points \((x_i, y_i)\) as observations, and a subset of points \((x_t, y_t)\) as target points. The output distribution on the target labels is parameterized as a Gaussian whose mean and variance are output by \(g\).
Image completion. \(f\) is a function that maps a pixel coordinate to a RGB triple. At each iteration, an image from the training set is chosen, a subset of its pixels is selected, and the aim is to predict the RGB value of the remaining pixels while conditioned on those. In particular, it is interesting to see that CNP
allows for flexible conditioning patterns, contrary to other conditional generative models which are often constrained by either the architecture (e.g. PixelCNN
) or training scheme.
Few shot classification. Consider a dataset with many classes but only few examples per class (e.g. Omniglot
). In this set of experiments, the model is trained to predict labels for samples in a select subset of classes, while being conditioned on all remaining samples in the dataset.
Word2Vec
embedding [1]: More specifically hey tackle the joint task of inferring a transformation from a given (source, target) pair, and applying the same relation to a new source image.
Definition: Informally, a visual analogy, denoted by “a:b :: c:d”, means that the entity a is to b what the entity c is to d. This paper focuses on the problem of generating image d after inferring the relation a:b and given a source image c.
The authors propose to use an encoder-decoder based model for generation and to model analogies as simple transformations of the latent space, for instance addition between vectors, as was the case in words embeddings such as GloVe
[2] or Word2Vec
[1].
Additive objective. Let \(f\) denote the encoder that maps images to the latent space \(\mathbb{R}^K\) and \(g\) the decoder. The first, most straightforward, objective the authors consider is to model analogies as additions in the latent space:
\[\begin{align} \mathcal L_{\mbox{add}}(c, d; a, b) = \|d - g \left(f(c) + f(b) - f(a) \right)\|^2 \end{align}\]One disadvantage of this purely linear transformation is that it cannot learn complex structures such as periodic transformations: For instance, if a:b is a rotation, the the embedding for the decoded image should ideally comes back to \(f(c)\) which is not possible as we keep adding the non-zero vector \(f(b) - f(a)\). To capture more complex transformations of the latent space, the authors introduce two variants of the previous objective.
Multiplicative objective.
\[\begin{align} \mathcal L_{\mbox{mult}}(c, d; a, b) = \|d - g \left( f(c) + W \odot [f(b) - f(a)] \odot f(c) \right)\|^2 \end{align}\]where \(W \in \mathbb{R}^{K\times K\times K}\), \(K\) is the dimension of the embedding, and the three-way multiplication operator is defined as \(\forall k,\ (A \odot B \odot C)_k = \sum_{i, j} A_{ijk} B_i C_j\)
Deep objective.
\[\begin{align} \mathcal L_{\mbox{deep}}(c, d; a, b) = \|d - g \left( f(c) + \mbox{MLP}([ f(b) - f(a); f(c)]) \right)\|^2 \end{align}\]where MLP
is a Multi Layer Perceptron. The \([ \cdot; \cdot]\) operator denotes concatenation. This allows for very generic transformations, but can introduce a significant number of parameters for the model to train, depending on the depth of the network.
Figure: Illustration of the network structure for analogy making. The top portion shows the encoder, transformation module, and decoder. The botton portion illustrates each of the transformation variants. We share weights with all three encoder networks shown on the top left
While the previous losses acted at the pixel-level to match the decoded image with the target image D, the authors introduce an additional regularization loss that additionally matches the analogy at the feature level with the source analogy a:b. Formally, each objective can be written in the form \(\| d - g(f(c) + T(f(b) - f(a), f(c)))\|^2\) and the corresponding regularization loss term is defined as:
\[\begin{align} R(c, d; a, b) = \|(f(d) - f(c)) - T(f(b) - f(a), f(c))\|^2 \end{align}\]Where \(T\) is defined accordingly to match the chosen embedding, \(\mathcal L_{\mbox{add}}\), \(\mathcal L_{\mbox{mult}}\) or \(\mathcal L_{\mbox{deep}}\). For intance, \(T: (x, y) \mapsto x\) in the additive variant.
The authors consider another solution to the visual analogy problem, in which they aim to learn a disentangled feature space that can be freely manipulated by smoothly modifying the appropriate latent variables, rather than learning a specific operation.
In that setting, the problem is slightly different, as we require additional supervision to control the different factors of variation. It is denoted as (a, b):s :: c: given two input images a and b, and a boolean mask s on the latent space, retrieve image c which matches the features of a according to the pattern of s, and the features of b on the remaining latent variables.
Let us denote by \(S\) the number of possible axes of variations (e.g., change in illumination, elevation, rotation etc) then \(s \in \{0, 1\}^S\) is a one-hot block vector encoding the current transformation, called the switch vector. The disentangling objective is thus
\[\begin{align} \mathcal{L}_{\mbox{dis}} = |c - g(f(a) \times s + f(b) \times (1 - s))| \end{align}\]In other words the decoder tries to match c by exploiting separate and disentangled information from a and b. Contrary to the previous analogy objectives, only three images are needed, but it also requires extra supervision in the form of the switch vector s which can be hard to obtain.
The authors consider three main experimental settings:
Synthetic experiments on geometric shapes. The dataset consists in 48 × 48 images scaled to [0, 1] with 4 shapes, 8 colors, 4 scales, 5 row and column positions, and 24 rotation angles. No disentangling training was performed in this setting.
Sprites dataset. The dataset consists of 60 × 60 color images of sprites scaled to [0, 1], with 7 attributes and 672 total unique characters. For each character, there are 5 animations each from 4 viewpoints. Each animation has between 6 and 13 frames. The data is split by characters. For the disentanglement experiments, the authors try two methods:dist
, where they only try to separate the pose from identity (i.e., only two axes of variations), and dist+cls
, where they actually consider all available attributes separately.
3D Cars. For each of the 199 car models, the authors generated 64 × 64 color renderings from 24 rotation angles each offset by 15 degrees.
The authors report results in terms of pixel prediction error. Out of the three manipulation method, \(\mathcal{L}_{\mbox deep}\) usually performs best. However qualitative samples show that \(\mathcal L_{\mbox{add}}\) and \(\mathcal L_{\mbox{mult}}\) both also perform well, although they fail for the case of rotation in the first set of experiments, which justifies the use of more complex training objectives.
Disentanglement methods usually outperforms the other baselines, especially in few-shots experiments. In particular the dist+cls
method usually wins by a large margin, which shows that the additional supervision really helps in learning a structured representation. However such supervisory signal sounds hard to obtain in practice in more generic scenarios.
Figure 2: Examples of samples from the three visual analogy datasets considered in experiments
Sadeghi et al., [link]
\[\begin{align} \mathcal{L}(x_{1, 2}, x_{3, 4}) = y (\| x_{1, 2} - x_{3, 4} \| -m+P) + (1 - y) \max (m_N - \| x_{1, 2} - x_{3, 4} \|, 0) \end{align}\]In this paper, the authors tackle the visual analogy problem in natural images by learning a joint embedding on relation and visual appearances using a Siamese architecture. The main idea is to learn an embedding space where the analogy transformation can be modeled by simple latent vector transformations. The model consists in a Siamese quadruple architecture, where the four heads correspond to the three context images and the candidate image for the visual analogy task respectively They do consider a restrained set of analogies, in particular those based on attributes or actions of animals or geometric view point changes. Given analogy problem \(I_1 : I_2 :: I_3 : I_4\) with label \(y\) (1 if \(I_4\) fits the analogy, 0 otherwise), the model is trained with the following objective
where \(x_{i, j}\) refers to the embedding for the image pair \(i, j\). Intuitively, the model pushes embeddings with a similar analogy close, and others apart (up to a certain margin \(m_N\)). The \(m_P\) margin is introduced as a heuristic to avoid overfittting: Embeddings are only made closer if their distance is above the margin threshold \(m_P\). The pairwise embeddings are obtained by subtracting the individual images embeddings. This implies the assumption that \(x_2 = x_1 + r\), where \(r\) is the transformation from image \(I_1\) to image \(I_2\).
The authors additionally create a visual analogy dataset. Generating the dataset is rather intuitive as long as we have an attribute-style representation of the domain. Typically, the analogies considered are transformations over properties (object, action, pose) of different categories (dog, cat, chair etc). As negative data points, they consider (i) fully random quadruples, or (ii) valid quadruples where one of \(I_3\) or \(I_4\) is swapped with a random image. The evaluation is done with image retrieval metrics. They also consider generalization scenarios: For instance removing the analogy \(white \rightarrow black\) during training, but keeping e.g. \(white \rightarrow red\) and \(green \rightarrow black\). There is a lack of details about the missing pairs to really get a full idea of the generalization ability of the model (i.e. if an analogy is missing from the training set, does that mean its reverse also is ? or does “analogy” refers to the high-level relation or is it instantiated relatively to the category too ?).