Amélie Royer

A Style-Based Generator Architecture for Generative Adversarial Networks

2021-01-14T13:59:24+01:00

In this work, the authors propose VQ-VAE, a variant of the Variational Autoencoder (VAE) framework with a discrete latent space, using ideas from vector quantization. The two main motivations are (i) discrete variables are potentially better fit to capture the structure of data such as text and (ii) to prevent the posterior collapse in VAEs that leads to latent variables being ignored when the decoder is too powerful.

Pros (+): Simple method to incorporate a discretized latent space in VAEs.
Cons (-): Paragraph about the learned prior is not very clear, and does not have corresponding ablation experiments to evalute its importance.

Proposed

Discrete latent space

The model is based on VAE [1], where image $x$ is generated from random latent variable $z$ by a decoder $p(x\ \vert\ z)$. The posterior (encoder) captures the latent variable distribution $q_{\phi}(z\ \vert\ x)$ and is generally trained to match a certain distribution $p(z)$ from which $z$ is sampled from at inference time. Contrary to the standard framework, in this work the latent space is discrete, i.e., $z \in \mathbb{R}^{K \times D}$ where $K$ is the number of codes in the latent space and $D$ their dimensionality. More precisely, the input image is first fed to $z_e$, that outputs a continuous vector, which is then mapped to one of the latent codes in the discrete space via nearest-neighbor search.

\[\begin{align} q(z = z_k\ |\ x) = [\!| k = \arg\min_j \| z_e(x) - z_j \|^2 |\!] \end{align}\]

Adapting the $\mathcal{L}_{\text{ELBO}}$ to this formalism, the KL divergence term greatly simplifies and we obtain:

\[\begin{align} \mathcal{L}_{\text{ELBO}}(x) &= \text{KL}(q(z | x) \| p(z)) - \mathbb{E}_{z \sim q(\cdot | x)}(\log p(x | z))\\ &= - \log(p(z_k)) - \log p(x | z_k)\\ \mbox{where }& z_k = z_q(x) = \arg\min_z \| z_e(x) - z \|^2 \tag{1} \end{align}\]

In practice, the authors use a categorical uniform prior for the latent codes, meaning the KL divergence is constant and the objective reduces to the reconstruction loss.

Figure: A figure describing the VQ-VAE (left). Visualization of the embedding space (right)). The output of the encoder z(x) is mapped to the nearest point. The gradient (in red) will push the encoder to change its output, which could alter the configuration, hence the code assignment, in the next forward pass.

Training Objective

As we mentioned previously, the $\mathcal{L}_{\text{ELBO}}$ objective reduces to the reconstruction loss and is used to learn the encoder and decoder parameters. However the mapping from $z_e$ to $z_q$ is not straight-forward differentiable (Equation (1)). To palliate this, the authors use a straight-through estimator, meaning the gradients from the decoder input $z_q(x)$ (quantized) are directly copied to the encoder output $z_e(x)$ (continuous). However, this means that the latent codes that intervene in the mapping from $z_e$ to $z_q$ do not receive gradient updates that way.

Hence in order to train the discrete embedding space, the authors propose to use Vector Quantization (VQ), a dictionary learning technique, which uses mean squared error to make the latent code closer to the continuous vector it was matched to:

\[\begin{align} \mathcal{L}_{\text{VQ-VAE}}(x) = - \log p(x | z_q(x)) + \| \overline{z_e(x)} - e \|^2 + \beta \| z_e(x) - \bar{e} \|^2 \end{align}\]

where $x \mapsto \overline{x}$ denotes the stop gradient operator. The first term is the reconstruction loss stemming from the ELBO, the second term is the vector quantization contribution. Finally, the last term is a commitment loss to control the volume of the latent space by forcing the encoder to “commit” to the latent code it matched with, and not grow its output space unbounded.

Learned Prior

A second contribution of this work consists in learning the prior distribution. As mentioned, during the training phase, the prior $p(z)$ is a uniform categorical distribution. After the training is done, we fit an autoregressive distribution over the space of latent codes. This is in particular enabled by the fact that the latent space is discrete.

Note: It is not clear to me if the autoregressive model is trained on latent codes sampled from the prior $z \sim p(z)$ or from the encoder distribution $x \sim \mathcal{D};\ z \sim q(z\ \vert\ x)$

Experiments

The proposed model is mostly compared to the standard continuous VAE framework. It seems to achieve similar log-likelihood and sample quality, while taking advantage of the discrete latent space. In particular For ImageNet for instance, they consider $K = 512$ latent codes with dimensions $1$. The output of the fully-convolutional encoder $z_e$ is a feature map of size $32 \times 32 \times 1$ which is then quantized pixel-wise. Interestingly, the model still performs well when using a powerful decoder (here, PixelCNN [2]) which seems to indicate it does not suffer from posterior collapse as strongly as the standard continuous VAE.

A second set of experiments tackles the problem of audio modeling. The performance of the model are once again satisfying. Furthermore, it does seem like the discrete latent space actually captures relevant characteristics of the input data structure, although this is a purely qualitative observation.

References

[1] Autoencoding Variational Bayes, Kingma and Welling, ICLR 2014
[2] Pixel Recurrent Neural Networks, van den Oord et al, arXiv 2016

Domain Adversarial Training of Neural Networks

2019-05-23T10:59:24+02:00

In this article, the authors tackle the problem of unsupervised domain adaptation: Given labeled samples from a source distribution `\mathcal D_S` and unlabeled samples from target distribution `\mathcal D_T`, the goal is to learn a function that solves the task for both the source and target domains. In particular, the proposed model is trained on both source and target data jointly, and aims to directly learn an aligned representation of the domains, while retaining meaningful information with respect to the source labels.

Pros (+): Theoretical justification, simple model, easy to implement.
Cons (-): Some training instability in practice.

Generalized Bound on the Expected Risk

Several theoretical studies of the domain adaptation problem have proposed upper bounds of the risk on the target domain, involving the risk on the source domain and a notion of distance between the source and target distribution, $\mathcal D_S$ and $\mathcal D_T$. Here, the authors specifically consider the work of [1]. First, they define the $\mathcal H$-divergence:

\[\begin{align} d_{\mathcal H}(\mathcal D_S, \mathcal D_T) = 2 \sup_{h \in \mathcal H} \left| \mathbb{E}_{x\sim\mathcal{D}_s} (h(x) = 1) - \mathbb{E}_{x\sim\mathcal{D}_T} (h(x) = 1) \right| \tag{1} \end{align}\]

where $\mathcal H$ is a space of (here, binary) hypothesis functions. In the case where $\mathcal H$ is a symmetric hypothesis class (i.e., $h \in \mathcal H \implies -h \in \mathcal H$), one can reduce (1) to the empirical form:

\[\begin{align} d_{\mathcal H}(\mathcal D_S, \mathcal D_T) &\simeq 2 \sup_{h \in \mathcal H} \left|\frac{1}{|D_S|} \sum_{x \in D_S} [\!|h(x) = 1 |\!] - \frac{1}{|D_T|} \sum_{x \in D_T} [\!|h(x) = 1 |\!] \right|\\ &= 2 \sup_{h \in \mathcal H} \left|\frac{1}{|D_S|} \sum_{x \in D_S} 1 - [\!|h(x) = 0 |\!] - \frac{1}{|D_T|} \sum_{x \in D_T} [\!|h(x) = 1 |\!] \right|\\ &= 2 - 2 \min_{h \in \mathcal H} \left|\frac{1}{|D_S|} \sum_{x \in D_S} [\!|h(x) = 0 |\!] + \frac{1}{|D_T|} \sum_{x \in D_T} [\!|h(x) = 1 |\!] \right| \tag{2} \end{align}\]

It is difficult to estimate the minimum over the hypothesis class $\mathcal H$. Instead, [1] propose to approximate Equation (2) by training a classifier $\hat{h}$ on samples $\mathbf{x_S} \in \mathcal{D}_S$ with label 0 and $\mathbf{x_T} \in \mathcal D_T$ with label 1, and replacing the minimum term by the empirical risk of $\hat h$. Given this definition of the $\mathcal H$-divergence, [1] further derives an upper bound on the empirical risk on the target domain, which in particular involves a trade-off between the empirical risk on the source domain, $\mathcal{R}_{D_S}(h)$, and the divergence between the source and target distributions, $d_{\mathcal H}(D_S, D_T)$.

\[\begin{align} \mathcal{R}_{D_T}(h) \leq \mathcal{R}_{D_S}(h) + d_{\mathcal H}(D_S, D_T) + f\left(\mbox{VC}(\mathcal H), \frac{1}{n}\right) \tag{upper-bound} \end{align}\]

where $\mbox{VC}$ designates the Vapnik–Chervonenkis dimensions and $n$ the number of samples. The rest of the paper directly stems from this intuition: in order to minimize the target risk the proposed Domain Adversarial Neural Network (DANN) aims to build an “internal representation that contains no discriminative information about the origin of the input (source or target), while preserving a low risk on the source (labeled) examples”.

Proposed

The goal of the model is to learn a classifier $\phi$, which can be decomposed as $\phi = G_y \circ G_f$, where $G_f$ is a feature extractor and $G_y$ a small classifier on top that outputs the target label. This architecture is trained with a standard classification objective to minimize:

\[\begin{align} \mathcal{L}_y(\theta_f, \theta_y) = \frac{1}{N_s} \sum_{(x, y) \in D_s} \ell(G_y(G_f(x)), y) \end{align}\]

Additionally DANN introduces a domain prediction branch, which is another classifier $G_d$ on top of the feature representation $G_f$ and whose goal is to approximate the domain discrepancy as (2), which leads to the following training objective to maximize:

\[\begin{align} \mathcal{L}_d(\theta_f, \theta_d) = \frac{1}{N_s} \sum_{x \in D_s} \ell(G_d(G_f(x)), s) + \frac{1}{N_t} \sum_{x \in D_t} \ell(G_d(G_f(x)), t) \end{align}\]

The final objective can thus be written as:

\[\begin{align} E(\theta_f, \theta_y, \theta_d) &= \mathcal{L}_y(\theta_f, \theta_y) - \lambda \mathcal{L}_d(\theta_f, \theta_d) \tag{1}\\ \theta_f^\ast, \theta_y^\ast &= \arg\min E(\theta_f, \theta_y, \theta_d) \tag{2}\\ \theta_d^\ast &= \arg\max E(\theta_f, \theta_y, \theta_d) \tag{3} \end{align}\]

Gradient Reversal Layer

Applying standard gradient descent, the DANN objective leads to the following gradient update rules:

\[\begin{align} \theta_f &= \theta_f - \alpha \left( \frac{\partial \mathcal{L}_y}{\partial \theta_f} - \lambda \frac{\partial \mathcal{L}_d}{\partial \theta_f} \right)\\ \theta_y &= \theta_y - \alpha \frac{\partial \mathcal{L}_y}{\partial \theta_y} \\ \theta_d &= \theta_d + \alpha \frac{- \lambda \partial \mathcal{L}_d}{\partial \theta_d} \\ \end{align}\]

In the case of neural networks, the gradients of the loss with respect to parameters are obtained with the backpropagation algorithm. The current system equations are very similar to the standard backpropagation scheme, except for the opposite sign in the derivative of $\mathcal{L}_d$ with respect to $\theta_d$ and $\theta_f$. The authors introduce the gradient reversal layer (GRL) to evaluate both gradients in one standard backpropagation step.

The idea is that the output of $\theta_f$ is normally propagated to $\theta_d$, however during backpropagation, its gradient is multiplied by a negative constant:

\[\begin{align} \frac{\partial \mathcal L_d}{\partial \theta_f} = \frac{\bf{\color{red}{-}} \partial \mathcal L_d}{\partial G_f(x)} \frac{\partial G_f(x)}{\partial \theta_f} \end{align}\]

In other words, for the update of $\theta_d$, the gradients of $\mathcal L_d$ with the respect to activations are computed normally (minimization), but they are then propagated with a minus sign in the feature extraction part of the network (maximization). Augmented with the gradient reversal layer, the final model is trained by minimizing the sum of losses $\mathcal L_d + \mathcal L_y$ , which corresponds to the optimization problem in (1-3).

Figure: The proposed architecture includes a deep feature extractor and a deep label predictor. Unsupervised domain adaptation is achieved by adding a domain classifier connected to the feature extractor via a gradient reversal layer that multiplies the gradient by a certain negative constant during backpropagation.

Experiments

Datasets

The paper presents extensive results on the following settings:

Toy dataset: A toy example based on the two half-moons dataset, where the source domains consists in the standard binary classification tasks with the two half-moons, and the target is the same, but with a 30 degrees rotation. They compare the DANN to a NN model which has the same architecture but without the GRL: in other words, the baseline directly minimizes both the task and domain classification losses.
Sentiment Analysis: These experiments are performed on the Amazon reviews dataset which contains product reviews from four different domains (hence 12 different source to target scenarios) which have to be classified as either positive or negative reviews.
Image Classification: Here the model is evaluated on various image classification task including MNIST $\rightarrow$ SVHN, or different domain pairs from the OFFICE dataset [2] .
Person Re-identification: The task of person identification across various visual domains.

Validation

Setting hyperparameters is a difficult problem, as we cannot directly evaluate the model on the target domain (no labeled data available). Instead of standard cross-validation, the authors use reverse validation based on a technique introduced in [3]: First, the (labeled) source set $S$ and (unlabeled) target set $T$ are each split into a training and validation set, $S'$ and $S_V$ (resp. $T'$ and $T_V$). Using these splits, a model $\eta$ is trained on $S'\rightarrow T'$. Then a second model $\eta_r$ is trained for the reverse direction on the set $\{ (x, \eta(x)),\ x \in T'\} \rightarrow S'$. This reverse classifier $\eta_r$ is then finally evaluated on the labeled validation set $S_V$, and this accuracy is used as a validation score.

Conclusions

In general, the proposed method seems to perform very well for aligning the source and target domains in an unsupervised domain adaptation framework. Its main advantage is its simplicity, both in terms of theoretical motivation and implementation. In fact, the GRL is easily implemented in standard Deep Learning frameworks and can be added to any architectures.

The main shortcomings of the method are that (i) all experiments deal with only two sources and extensions to multiple domains might require some tweaks (e.g., considering the sum of pairwise discrepancies as an upper-bound) and (ii) in practice, training can become unstable due to the adversary training scheme; In particular, the experiment sections show that some stability tricks have to be used during training, such as using momentum or slowly increasing the contribution of the domain classification branch.

Figure: t-SNE projections of the embeddings for the source (MNIST) and target (SVHN) datasets without (left) and with (right) DANN adaptation.

Closely related

Conditional Adversarial Domain Adaptation.

Long et al, NeurIPS 2018[link]

In this work, the authors propose to for Domain Adversarial Networks. More specifically, the domain classifier is conditioned on the input’s class: However, since part of the samples are unlabeled, the conditioning uses the output of the target classifier branch as a proxy for the class information. Instead of simply concatenating the feature input with the condition, the authors consider a multilinear conditioning technique which relies on the cross-covariance operator. Another related paper is [4]. It also uses the multi-class information of the input domain, although in a simpler way.

References

[1] Analysis of representations for Domain Adaptation, Ben-David et al, NeurIPS 2006
[2] Adapting visual category models to new domains, Saenko et al, ECCV 2010
[3] Person re-identification via structured prediction, Zhang and Saligrama, arXiv 2014
[4] Multi-Adversarial Domain Adaptation, Pei et al, AAAI 2018

Deep Image Prior

2019-05-14T14:59:24+02:00

Deep Neural Networks are widely used in image generation tasks for capturing a general prior on natural images from a large set of observations. However, this paper shows that the structure of the network itself is able to capture a good prior, at least for local cues of image statistics. More precisely, a randomly initialized convolutional neural network can be a good handcrafted prior for low-level tasks such as denoising, inpainting.

Pros (+): Interesting results, with connections to Style Transfer and Network inverson.
Cons (-): Seems like the results might depend a lot on parameter initialization, learning rate etc.

Background

Given a random noise vector $z$ and conditioned on an image $x_0$, the goal of conditional image generation is to generate image $x = f_{\theta}(z; x_0)$ (where the random nature of $z$ provides a sampling strategy for $x$); for instance, the task of generating a high quality image $x$ from its lower resolution counterpart $x_0$.

In particular, this encompasses inverse tasks such as denoising, super-resolution and inpainting that acts at the local pixel level. Such tasks can often be phrased with an objective of the following form:

\[\begin{align} \theta^{\ast} = \arg\min E(x, x_0) + R(x) \end{align}\]

where $E$ is a cost function and $R$ is a prior on the output space acting as a regularizer. $R$ is often a hand-crafted prior, for instance a smoothness constraint like Total Variation [1], or, for more recent techniques, it can be implemented with adversarial training (e.g., GANs).

Deep Image Prior

In this paper, the goal is to replace $R$ by an implicit prior captured by the neural network, relatively to input noise $z$. In other words

\[\begin{align} R(x) &= 0\ \mbox{if}\ \exists \theta\ \mbox{s.t.}\ x = f_{\theta}(z)\\ R(x) &= + \infty,\ \mbox{otherwise} \end{align}\]

Which results in the following workflow:

\[\begin{align} \theta^{\ast} = \arg\min E(f(z; x_0), x_0) \mbox{ and } x^{\ast} = f_{\theta^{\ast}}(z; x_0) \end{align}\]

One could wonder if this is a good choice for a prior at all. In fact, $f$, being instantiated as a neural network, should be powerful enough that any image $x$ can be generated from $z$ for a certain choice of parameters $\theta$, which means the prior should not be constraining.

However, the structure of the network* itself effectively affects how optimization algorithms such as gradient descent will browse the output space: To quantify this effect, the authors perform a reconstruction experiment (i.e., $E(x) = \| x - x_0 \|$) for different choices of the input image $x_0$ ((i)** natural image, (ii) same image with small perturbations, (iii) with large perturbations, and (iv) white noise) using a U-Net [2] inspired architecture. Experimental results show that the network descends faster to natural-looking images (case (i) and (ii)), than to random noise (case (iii) and (iv)).

Figure: Learning curves for the reconstruction task using: a natural image, the same plus i.i.d. noise, the same but randomly scrambled, and white noise.

Experiments

The experiments focus on three image analysis tasks:

Image denoising ($E(x, x_0) = \|x - x_0\|$), based on the previous observation that the model converges more easily to natural-looking images than noisy ones.
Super Resolution ($E(x, x_0) = \| \mbox{downscale}(x) - x_0 \|$), to upscale the resolution of input image $x_0$
Image inpainting ($E(x, x_0) = \|(x - x_0) \odot m\|$) where the input image $x_0$ is masked by a mask $m$ and the goal is to recover the missing pixels.

The method seems to outperform most non-trained methods, when available, (e.g. Bicubic upsampling for Super-Resolution) but is still often outperformed y learning-based ones. The inpainting results are particularly interesting, and I do not know of any other non-trained baselines for this task. Obviously performs poorly when the obscured region requires highly semantic knowledge, but it seems to perform well on more reasonable benchmarks.

Additionally, the authors test the proposed prior for diagnosing neural networks by generating natural pre-images for neural activations of deep layers. Qualitative images look better than other handcrafted priors (total variation) and are not biased to specific datasets as are trained methods.

Figure: Example comparison between the proposed Deep Image Prior and various baselines for the task of Super-Resolution.

Closely related (follow-up work)

Deep Decoder: Concise Image Representations from Untrained Non-Convolutional Networks

Heckel and Hand, [link]

This paper builds on Deep Image Prior but proposes a much simpler architecture which is under-parametrized and non-convolutional. In particular, there are fewer weight parameters than the dimensionality of the output image (in comparison, DIP was using a U-Net based architecture). In particular, this property implies that the weights of the network can additionally be used as a compressed representation of the image. In order to test for compression, the authors use their architecture to reconstruct image $x$ for different compression ratios $k$ (i.e., number of network parameters $N$, is $k$-times smaller as the output dimension of the images).

The deep decoder architecture combines standard blocks include linear combination of channels (convolutions ), ReLU, batch-normalization and upscaling. Note that since here we have a special case of batch size 1, the Batch Norm operator essentially normalizes the activation channel-wise. In particular, the paper contains a nice theoretical justification for the denoising case, in which they show that the model can only fit a certain amount of noise, which explains why it would converge to more natural-looking images, although it only applies to small networks (1 layer ? possibly generalizable to multi-layer and no batch-norm)

References

[1] An introduction to Total Variation for Image Analysis, Chambolle et al., Technical Report, 2009
[2] U-Net: Convolutional Networks for Biomedical Image Segmentation, Ronneberger et al., MICCAI 2015

A simple Neural Network Module for Relational Reasoning

2019-05-14T08:59:24+02:00

The authors propose a relation module to equip CNN architectures with notion of relational reasoning, particularly useful for tasks such as visual question answering, dynamics understanding etc.

Pros (+): Simple architecture, relies on small and flexible modules.
Cons (-): Still a black-box module, hard to quantify how much "reasoning" happens.

Proposed Model

The main idea of Relation Networks (RN) is to constrain the functional form of convolutional neural networks as to explicitly learn relations between entities, rather than hoping for this property to emerge in the representation during training. Formally, let $O$ be a set of objects of interest $O = \{o_1 \dots o_n\}$; The Relation Network is trained to learn a representation that considers all pairwise relations across the objects:

\[\begin{align} \mbox{RN}(O) = f_{\phi}& \left(\sum_{i, j} g_{\theta}(o_i, o_j) \right) \end{align}\]

$f_{\phi}$ and $g_{\theta}$ are defined as Multi Layer Perceptrons. By definition, the Relation Network (i) has to consider all pairs of objects, (ii) operates directly on the set of objects hence is not constrained to a specific organization of the data, and (iii) is data-efficient in the sense that only one function, $g_{\theta}$ is learned to capture all the possible relations: $g$ and $f$ are typically light modules and most of the overhead comes from the sum of pairwise components ($n^2$).

The objects are the basic elements of the relational process we want to model. They are defined with regard to the task at hand, for instance:

Attending relations between objects in an image: The image is first processed through a fully-convolutional network. Each of the resulting cell is taken as an object, which is a feature of dimensions $k$, additionally tagged with its position in the feature map.
Sequence of images. In that case, each image is first fed through a feature extractor and the resulting embedding is used as an object. The goal is to model relations between images across the sequence.

Figure: Example of applying the Relation Network for Visual Question Answeting. Questions are processed with an LSTM to produce a question embedding, and images are processed with a CNN to produce a set of objects for the RN.

Experiments

The main evaluation is done on the CLEVR dataset [2]. The main message seems to be that the proposed module is very simple and yet often improves the model accuracy when added to various architectures (CNN, CNN + LSTM etc.) introduced in [1]. The main baseline they compare to (and outperform) is Spatial Attention (SA) which is another simple method to integrate some form of relational reasoning in a neural architecture.

Closely related

Recurrent Relational Neural Networks [3]

Palm et al, [link]

This paper builds on the Relation Network architecture and propose to explore more complex relational structures, defined as a graph, using a message passing approach: Formally, we are given a graph with vertices $\mathcal V = \{v_i\}$ and edges $\mathcal E = \{e_{i, j}\}$. By abuse of notation, $v_i$ also denotes the embedding for vertex $i$ (e.g. obtained via a CNN) and $e_{i, j}$ is 1 where $i$ and $j$ are linked, 0 otherwise. To each node we associate a hidden state $h_i^t$ at iteration $t$, which will be updated via message passing. After a few iterations, the resulting state is passed through a MLP $r$ to output the result (either for each node or for the whole graph):

\[\begin{align} h_i^0 &= v_i\\ h_i^{t + 1} &= f_{\phi} \left( h_i^t, v_i, \sum_{j} e_{i, j} g_{\theta}(h^t_i, h^t_j) \right)\\ o_i &= r(h_i^T) \mbox{ or } o = r(\sum_i h_i^T) \end{align}\]

Comparing to the original Relation Network:

Each update rule is a Relation Network that only looks at pairwise relations between linked vertices. The message passing scheme additionally introduces the notion of recurrence, and the dependency on the previous hidden state.

The dependence on $h_i^t$ could in theory be avoided by adding self-edges from $v_i$ to $v_i$, to make it closer to the Relation Network formulation.

Adding $v_i$ as input of $f_\phi$ looks like a simple trick to avoid long-term memory problems.

The experiments essentially compare the proposed RRNN model to the Relation Network and classical recurrent architectures such as LSTM. They consider three datasets:

Babi. NLP question answering task with some reasoning involved. Solves 19.7 (out of 20) tasks on average, while simple RN solved around 18 of them reliably.

Pretty CLEVR. A CLEVR like dataset (only with simple 2D shapes) with questions involving various steps of reasoning, e.g. “which is the shape $n$ steps of the red circle ?”

Sudoku. the graph contains 81 nodes (one for each cell in the sudoku), with edges between cells belonging to the same row, column or block.

Multi-Layer Relation Neural Networks [4]

Jahrens and Martinetz, [link]

This paper presents a very simple trick to make Relation Network consider higher order relations than pairwise, while retaining some efficiency. Essentially the model can be written as follow:

\[\begin{align} h_{i, j}^0 &= g^0_{\theta}(x_i, x_j) \\ h_{i, j}^t &= g^{t + 1}_{\theta}\left(\sum_k h_{i, k}^{t - 1}, \sum_k h_{j, k}^{t - 1}\right) \\ MLRN(O) &= f_{\phi}(\sum_{i, j} h^T_{i, j}) \end{align}\]

It is not clear why this model would be equivalent to explicitly considering higher-level relations (as it is rather combining pairwise terms for a finite number of steps). According to the experiments it seems that indeed this architecture could be better fitted for the studied tasks (e.g. over the Relation Network or Recurrent Relation Network) but it also makes the model even harder to interpret.

References

[1] Inferring and executing programs for visual reasoning, Johnson et al, ICCV 2017
[2] CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning, Johnson et al, CVPR 1017
[3] Recurrent Relational Neural Networks, Palm et al, NeurIPS 2018
[4] Multi-Layer Relation Neural Networks, Jahrens et Martinetz, arXiv 2018

Automatically Composing Representation Transformations as a Mean for Generalization

2019-05-14T08:59:24+02:00

The authors focus on solving recursive tasks which can be decomposed into a sequence of simpler algorithmic procedures (e.g., arithmetic problems, geometric transformations). The main difficulties of this approach are (i) how to actually decompose the task into simpler blocks and (ii) how to extrapolate to more complex problems from learning on simpler individual tasks. The authors propose the compositional recursive learner (CRL) to learn at the same time both the structure of the task and its components.

Pros (+): This problem is well motivated, and seems a very promising direction, for learning domain-agnostic components.
Cons (-): The actual implementation description lacks crucial details and I am not sure how easy it would be to reimplement.

Proposed model

Problem definition

A problem $P_i$ is defined as a transformation $x_i : t_x \mapsto y_i : t_y$, where $t_x$ and $t_y$ are the respective types of $x$ and $y$. However since we only consider recursive problem here, then $t_x = t_y$. We define a family of problems $\mathcal P$ as a set of composite recursive problems that share regularities. The goal of CRL is to extrapolate to solve new compositions of these tasks, using knowledge from the limited subset of tasks it has seen during training.

Implementation

In essence, the problem can be formulated as a sequence-decision making task via a meta-level MDP ($\mathcal X$, $\mathcal F$, $\mathcal P_{\mbox{meta}}$, r, $\gamma$), where $\mathcal X$ is the set of states, i.e., representations; $\mathcal F$ is a set of computations, i.e., istances of the transformations we consider, for instance as neural networks, and an additional special function HALT that stops the execution; $\mathcal P_{\mbox{meta}}: (x_t, f_t, x_{t + 1}) \mapsto c \in [0, 1]$ is the policy which assigns a probability to each possible transition. Finally $r$ is the reward function and $\gamma$ a decay factor.

More specifically, the CRL is implemented as a set of neural networks, $f_k \in \mathcal F$, and a controller $\pi(f\ |\ \mathbf{x}, t_y)$ which selects the best course of action given the current history of representations $\mathbf{x}$ and target type $t_y$. The loss is back-propagated through the functions $f$, and the controller is trained as a Reinforcement Learning (RL) agent with a sparse reward (it only knows the final target result). An additional important training scheme is the use of curriculum learning i.e., start by learning small transformations and then consider more complex compositions, increasing the state space little by little.

Figure: (top-left) CRL is a symbiotic relationship between a controller and evaluator: the controller selects a module `m` given an intermediate representation `x` and the evaluator applies `m` on `x` to create a new representation. (bottom-left) CRL dynamically learns the structure of a program customized for its problem, and this program can be viewed as a finite state machine. (right) A series of computations in the program is equivalent to a traversal through a Meta-MDP, where module can be reused across different stages of computation, allowing for recursive computation.

Experiments

Multilingual Arithmetic

The learner will aim to solve recursive arithmetic expressions across 6 languages: English, Numerals, PigLatin, Reversed-English, Spanish. The input is a tuple $(x^s, t_y)$, where $x$ is the arithmetic expression expressed in source language $s$, and $t_y$ is the output language.

Training: The learner trains on a curriculum of a limited set of 2, 3, 4, 5-length expressions. During training, each source language is seen with four target languages (and one held out for testing) and each target language is seen with four source languages (and one held out for testing).
Testing: The learner is asked to generalize to 5-length expressions (test set) and to extrapolate to 10-length expressions (extrapolation set) with unseen language pairs.

The authors consider two main types of functional units for this task: A reducer, which takes as input a window of three terms in the input expression and outputs a softmax distribution over the vocabulary. While a translator applies a function to every element of the input sequence and outputs a sequence of the same size.

The CRL is compared to a baseline RNN architecture that directly tries to map a variable length input sequence to the target output. On the test set, RNN and CRL yield similar accuracies although CRL usually requires less training samples and/or less training iterations. On the extrapolation set however, CRL more clearly outperforms RNN. Interestingly the CRL results usually have a much bigger variance which would be interesting to qualitatively analyze. Moreover, the use of curriculum learning significantly improves the model performance. Finally, qualitative results show that the reducers and translators are interpretable to some degree: e.g., it is possible to map some of the reducers to specific operations, however due to the unsupervised nature of the task, the mapping is not always straight-forward.

Image Transformations

This time the functional units are composed of three specialized Spatial Transformer Networks [1] to learn rotation, scale and translation, and an identity function. Overall this setting does not yield very good quantitative results. More precisely, one of the main challenges, since we are acting on a visual domain, is to deduce the structure of the task from information which lacks clear structure (pixel matrices). Additionally the fact that all inputs and outputs have the same domain (images) and that only a sparse reward is available make it more difficult for the controller to distinguish between functionalities, i.e., it could collapse to using only one transformer.

References

[1] Spatial Transformer Networks, Jaderberg et al., NeurIPS 2016

Learning a SAT Solver from Single-Bit Supervision

2019-05-14T08:59:24+02:00

The goal is to solve SAT problems with weak supervision: In that case, a model is trained only to predict the satisfiability of a formula in conjunctive normal form. As a byproduct, if the formula is satisfiable, an actual satisfying assignment can be worked out from the network's activations in most cases.

Pros (+): No need for extensive annotation, seems to extrapolate nicely to harder problems by increasing the number message passing iterations.
Cons (-): Limited practical applicability since it is outperformed by classical SAT solvers.

Model: NeuroSAT

Input

We consider boolean logic formulas in their conjunctive normal form (CNF), i.e. each input formula is represented as a conjunction ($\land$) of clauses, which are themselves disjunctions ($\lor$) of literals (positive or negative instances of variables). The goal is to learn a classifier to predict whether such a formula is satisfiable.

A first problem is how to encode the input formula in such a way that it preserves the CNF invariances (invariance to negating a literal in all clauses, invariance to permutations in $\lor$ and $\land$ etc.). The authors use a standard undirected graph representation where:

$\mathcal V$: vertices are the literals (positive and negative form of variables, denoted as $x$ and $\bar x$) and the clauses occurring in the input formula
$\mathcal E$: Edges are added to connect (i) the literals with clauses they appear in and (ii) each literal to its negative counterpart.

The graph relations are encoded as an adjacency matrix, $A$, with as many rows as there are literals and as many columns as there are clauses. Note that this structure does not constrain the vertices ordering, and does not make any preferential treatment between positive or negative literals. However it still has some caveats, which can be avoided by pre-processing the formula. For instance when there are disconnected components in the graph, the averaging decision rule (see next paragraph) can lead to false positives.

Message-passing model

In a high-level view, the model keeps track of an embedding for each litteral and each clause ($L^t$ and $C^t$), updated via message-passing on the graph, and combined via a Multi Layer Perceptron (MLP) to output the model prediction of the formula’s satisfiability. Then the model updates are as follow:

\[\begin{align} C^t, h_C^t &= \texttt{LSTM}_\texttt{C}(h_C^{t - 1}, A^T \texttt{MLP}_{\texttt{L}}(L^{t - 1}) )\ \ \ \ \ \ \ \ \ \ \ (1)\\ L^t, h_L^t &= \texttt{LSTM}_\texttt{L}(h_L^{t - 1}, \overline{L^{t - 1}}, A\ \texttt{MLP}_{\texttt{C}}(C^{t }) )\ \ \ \ \ \ (2)\\ \end{align}\]

where $h$ designates a hidden context vector for the LSTMs. The operator $L \mapsto \bar{L}$ returns $\overline{L}$, the embedding matrix $L$ where the row of each litteral is swapped with the one corresponding to the literal’s negation. In other words, in (1) each clause embedding is updated based on the litteral that composes it, while in (2) each litteral embedding is updated based on the clauses it appears in and its negated counterpart.

After $T$ iterations of this message-passing scheme, the model computes a logit for the satisfiability classification problem, which is trained via sigmoid cross-entropy:

\[\begin{align} L^t_{\mbox{vote}} &= \texttt{MLP}_{\texttt{vote}}(L^t)\\ y^t &= \mbox{mean}(L^t_{\mbox{vote}}) \end{align}\]

Building the training set

The training set is built such that for any satisfiable training formula $S$, it also includes an unsatisfiable counterpart $S'$ which differs from $S$ only by negating one litteral in one clause. These carefully curated samples should constrain the model to pick up substantial characteristics of the formula. In practice, the model is trained on formulas containing up to 40 variables, and on average 200 clauses. At this size, the SAT problem can still be solved by state-of-the-art solvers (yielding the supervision required to solve the model) but are large enough they prove challenging for Machine Learning models.

Inferring the SAT assignment

When a formula is satisfiable, one often also wants to know a valuation (variable assignment) that satisfies it. Recall that $L^t_{\mbox{vote}}$ encodes a “vote” for every literal and its negative counterpart. Qualitative experiments show that those scores cannot be directly used for inferring the variable assignment, however they do induce a nice clustering of the variables (once the message passing has converged). Hence an assignment can be found as follows:

(1) Reshape $L^T_{\mbox{vote}}$ to size $(n, 2)$ where $n$ is the number of literals.
(2) Cluster the litterals into two clusters with centers $\Delta_1$ and $\Delta_2$ using the following criterion: \begin{align} |x_i - \Delta_1|^2 + |\overline{x_i} - \Delta_2|^2 \leq |x_i - \Delta_2|^2 + |\overline{x_i} - \Delta_1|^2 \end{align}
(3) Try the two resulting assignments (set $\Delta_1$ to true and $\Delta_2$ to false, or vice-versa) and choose the one that yields satisfiability if any.

In practice, this method retrieves a satistifiability assignment for over 70% of the satisfiable test formulas.

Experiments

In practice, the NeuroSAT model is trained with embeddings of dimension 128 and 26 message passing iterations. The MLP architectures are very standard: 3 layers followed by ReLU activations. The final model obtains 85% accuracy in predicting a formula’s satisfiability on the test set.

It also can generalize to larger problems, although it requires to increase the number of message passing iterations. However the classification performance significantly decreases (e.g. 25% for 200 variables) and the number of iterations linearly scales with the number of variables (at least in the paper experiments).

Figure: (left) Success rate of a NeuroSAT model trained on 40 variables for test set involving formulas with up to 200 variables, as a function of the number of message-passing iterations. (right) The sequence of literal votes across message-passing iterations on a satisfiable formula. The vote matrix is reshaped such that each row contains the votes for a literal and its negated counterpart. For several iterations, most literals vote unsat with low confidence (light blue). After a few iterations, there is a phase transition and all literals vote sat with very high confidence (dark red), until convergence.

Interestingly, the model generalizes well to other classes of problems that were reduced to SAT (using SAT’s NP-completitude), although they have different structure than the random formulas generated for training, which seems to show that the model does learn some general characteristics of boolean formulas.

To summarize, the model takes advantage of the structure of Boolean formulas, and is able to predict whether an input formula is satisfiable or not with high accuracy. Moreover, even though trained only with this weak supervisory signal, it can work out a valid assignment most of the time. However it is still subpar compared to standard SAT solvers, which makes its applicability limited.

Glow: Generative Flow with Invertible 1×1 Convolutions

2019-05-07T14:59:24+02:00

Invertible flow based generative models such as [2, 3] have several advantages including exact likelihood inference process (unlike VAEs or GANs) and easily parallelizable training and inference (unlike the sequential generative process in auto-regressive models). This paper proposes a new, more flexible, form of invertible flow for generative models, which builds on [3].

Pros (+): Very clear presentation, promising results both quantitative and qualitative.
Cons (-): One of the disadvantages of the models seem to be a large number of parameters, it would be interesting to have a more detailed report on training time. Also a comparison to [5] (a variant of PixelCNN that allows for faster parallelized sample generation) would be nice.

Invertible flow-Based Generative Models

Given input data $x$, invertible flow-based generative models are built as two steps processes that generate data from an intermediate latent representation $z$:

\[\begin{align} z \sim p_{\theta}(z)\\ x = g_\theta(z) \end{align}\]

where $g_\theta$ is an invertible function, i.e. a bijection, $g_\theta: \mathcal X \rightarrow \mathcal Z$. It acts as an encoder from the input data to the latent space. $g$ is usually built as a sequence of smaller invertible functions $g = g_1 \circ \dots \circ g_n$. Such a sequence is also called a normalizing flow [1]. Under this construction, the change of variables formula applied to $x = g(z)$ gives the following equivalence between the input and latent densities:

\[\begin{align} \log p(x) &= \log p(z) + \log\ \left| \det \left( \frac{d z}{d x} \right)\right|\\ &= \log p(z) + \sum_{i=1}^n \log\ \left| \det \left( \frac{g_{\leq i}(x)}{g_{\leq i - 1}(x)} \right)\right| \end{align}\]

where $\forall i \in [1; n],\ g_{\leq i} = g_i \circ \dots g_1$ In particular, this means $g_{\leq n}(x) = z$ and $g_0(x) = x$. $p_\theta(z)$ is usually chosen as a simple density such as a unit Gaussian distribution, $p_\theta(z) = \mathcal N(z; 0, \mathbf{I})$. In order to efficiently estimate the likelihood, the functions $g_1, \dots g_n$ are usually chosen such that the log-determinant of the Jacobian, $\log\ \left\vert \det \left( \frac{g_{\leq i}}{g_{\leq i - 1}} \right) \right\vert$, is easily computed, for instance by choosing transformation such that the Jacobian is a triangular matrix.

Proposed Flow Construction: GLOW

Flow step

Each flow step function $g_i$ is a sequence of three operations as follows. Given an input tensor of dimensions $h \times w \times c$:

Step Description	Functional Form of flow $g_i$	Inverse Function of the flow, $g_i^{-1}$	Log-determinant Expression
ActNorm $s: [c,]$ $b: [c,]$	$y = \sigma\odot x + \mu$	$x = (y - \mu) / \sigma$	$hw\ \mbox{sum} \log(\vert\sigma\vert)$
1x1 conv $W: [c,c]$	$y = Wx$	$x = W^{-1}y$	$h w \log \vert \det (W) \vert$
Affine Coupling (ACL) [2]	$x_a,\ x_b = \mbox{split}(x)$ $(\log \sigma, \mu) = \mbox{NN}(x_b)$ $y_a = \sigma \odot x_a + \mu$ $y = \mbox{concat}(y_a, x_b)$	$y_a,\ y_b = \mbox{split}(y)$ $(\log \sigma, \mu) = \mbox{NN}(y_b)$ $x_a = (y_a - \mu) / \sigma$ $x = \mbox{concat}(x_a, y_b)$	$\mbox{sum} (\log \vert\sigma\vert)$

ActNorm. The activation normalization layer is introduced as a replacement for Batch Normalization (BN) to avoid degraded performance with small mini-batch sizes, e.g. when training with batch size 1. This layer has the same form as BN, however the bias, $\mu$, and variance, $\sigma$, are data-independent variables: They are initialized based on an initial mini-batch of data (data-dependent initialization), but are optimized during training with the rest of the parameters, rather than estimated from the input minibatch statistics.
1x1 convolution. This is a simple 1x1 convolutional layer. In particular, the cost of computing the determinant of $W$ can be reduced by writing $W$ in its LU decomposition, although this increases the number of parameters to be learned.
Affine Coupling Layer. The ACL was introduced in [2]. The input tensor $x$ is first split in half along the channel dimension. The second half, $x_b$, is fed through a small neural network to get parameters $\sigma$ and $\mu$, and the corresponding affine transformation is applied to the first half, $x_a$. The rescaled $x_a$ is the actual transformed output of the layer, however $x_b$ also has to be propagated in order to make the transformation invertible, such that $\sigma$ and $\mu$ can also be estimated in the reverse flow. Finally, note that the previous 1x1 convolution can be seen as a generalized permutation of the input channels, and guarantees that different channels combinations are seen during the split operation.

General Pipeline

These operations are then combined in a multi-scale architecture as described in [3], which in particular relies on a squeezing operation to trade of spatial resolution for number of output channels. Given an input tensor of size $s \times s \times c$, the squeezing operator takes blocks of size $2 \times 2 \times c$ and flatten them to size $1 \times 1 \times 4c$, which can easily be inverted by reshaping. The final pipeline consists in $L$ levels that operate on different scales: each level is composed of $K$ flow steps and a final squeezing operation.

Figure: Overview of the multi-layer GLOW architecture.

In summary, the main differences with [3] are:

Batch Normalization is replaced with Activation Normalization
1x1 convolutions are considered as a more generic operation to replace permutations
Only channel-wise splitting is considered in the Affine Coupling Layer, while [3] also considered a binary spatial checkerboard pattern to split the input tensor in two.

Experiments

Implementation

In practice, the authors implement NN as a convolutional neural network of depth 3 in the ACL; which means that each flow step contains 4 convolutions in total. They also use $K = 32$ flow steps in each level. Finally the number of levels $L$ is 3 for small-scale experiments (32x32 images) and 6 for large scale (256x256 ImageNet images). In particular this means that the model contains a lot of parameters ($L \times K \times 4$ convolutions) which might be a practical disadvantage compared to other method that produce samples of similar quality, e.g. GANs. However, contrary to these models, GLOW provides exact likelihood inference.

Results

GLOW outperforms RealNVP [3] in terms of data likelihood, as evaluated on standard benchmarks (ImageNet, CIFAR-10, LSUN). In particular, the 1x1 convolutions performs better than other more specific permutations operations, and only introduces a small computational overhead.

Qualitatively, the samples are of great quality and the model seems to scale well with higher resolution. However this greatly increases the memory requirements. Leveraging the model’s invertibility to avoid storing activations during the feed-forward pass such as in [4] could be used to (partially) palliate the problem.

References

[1] Variational inference with normalizing flows, Rezende and Mohamed, ICML 2015
[2] NICE: Non-linear Independent Components Estimation, Dinh et al., ICLR 2015
[3] Density estimation using Real NVP, Dinh et al., ICLR 2017
[4] The Reversible Residual Network: Backpropagation Without Storing Activations, Gomez et al., NeurIPS 2017
[5] Parallel Multiscale Autoregressive Density Estimation, S.Reed et al, ICML 2017

The Reversible Residual Network: Backpropagation Without Storing Activations

2019-05-07T08:59:24+02:00

Residual Networks (ResNet) [3] have greatly advanced the state-of-the-art in Deep Learning by making it possible to train much deeper networks via the addition of skip connections. However, in order to compute gradients during the backpropagation pass, all the units' activations have to be stored during the feed-forward pass, leading to high memory requirements for these very deep networks. Instead, the authors propose a reversible architecture in which activations at one layer can be computed from the ones of the next. Leveraging this invertibility property, they design a more efficient implementation of backpropagation, effectively trading compute power for memory storage.

Pros (+): The change does not negatively impact model accuracy (for equivalent number of model parameters) and it only requires a small change in the backpropagation algorithm.
Cons (-): Increased number of parameters, not fully reversible (see i-RevNets [4])

Proposed Architecture

RevNet

This paper proposes to incorporate idea from previous reversible architectures, such as NICE [1], into a standard ResNet. The resulting model is called RevNet and is composed of reversible blocks, inspired from additive coupling [1, 2]:

ResNet block	Inverse
$$ \begin{align} \mathbf{input }\ x&\\ x_1, x_2 &= \mbox{split}(x)\\ y_1 &= x_1 + \mathcal{F}(x_2)\\ y_2 &= x_2 + \mathcal{G}(y_1)\\ \mathbf{output}\ y &= (y_1, y_2) \end{align} $$	$$ \begin{align} \mathbf{input }\ y&\\ y1, y2 &= \mbox{split}(y)\\ x_2 &= y_2 - \mathcal{G}(y_1)\\ x_1 &= y_1 - \mathcal{F}(x_2)\\ \mathbf{output}\ x &= (x_1, x_2) \end{align} $$

where $\mathcal F$ and $\mathcal G$ are residual functions, composed of sequences of convolutions, ReLU and Batch Normalization layers, analogous to the ones in a standard ResNet block, although operations in the reversible blocks need to have a stride of 1 to avoid information loss and preserve invertibility. Finally, for the split operation, the authors consider splitting the input Tensor across the channel dimension as in [1, 2].

Similarly to ResNet, the final RevNet architecture is composed of these invertible residual blocks, as well as non-reversible subsampling operations (e.g., pooling) for which activations have to be stored. However the number of such operations is much smaller than the number of residual blocks in a typical ResNet architecture.

Backpropagation

The backpropagation algorithm is derived from the chain rule and is used to compute the total gradients of the loss with respect to the parameters in a neural network: given a loss function $L$, we want to compute the gradients of $L$ with respect to the parameters of each layer, indexed by $n \in [1, N]$, i.e., the quantities $\overline{\theta_{n}} = \partial L /\ \partial \theta_n$ (where $\forall x, \bar{x} = \partial L / \partial x$). We roughly summarize the algorithm in the left column of Table 1: In order to compute the gradients for the $n$-th block, backpropagation requires the input and output activation of this block, $y_{n - 1}$ and $y_{n}$, which have been stored, and the derivative of the loss respectively to the output, $\overline{y_{n}}$, which has been computed in the backpropagation iteration of the upper layer; Hence the name backpropagation.

Since activations are not stored in RevNet, the algorithm needs to be slightly modified, which we describe in the right column of Table 1. In summary, we first need to recover the input activations of the RevNet block using its invertibility. These activations will be propagated to the earlier layers for further backpropagation. Secondly, we need to compute the gradients of the loss with respect to the inputs, i.e. $\overline{y_{n - 1}} = (\overline{y_{n -1, 1}}, \overline{y_{n - 1, 2}})$, using the fact that:

\[\begin{align} \overline{y_{n - 1, i}} = \overline{y_{n, 1}}\ \frac{\partial y_{n, 1}}{y_{n - 1, i}} + \overline{y_{n, 2}}\ \frac{\partial y_{n, 2}}{y_{n - 1, i}} \end{align}\]

Once again, this result will be propagated further down the network. Finally, once we have computed both these quantities we can obtain the gradients with respect to the parameters of this block, $\theta_n$.

	ResNet Architecture	RevNet Architecture
Format of a Block	$$ y_{n} = y_{n - 1} + \mathcal F(y_{n - 1}) $$	$$ \begin{align} y_{n - 1, 1}, y_{n - 1, 2} &= \mbox{split}(y_{n - 1})\\ y_{n, 1} &= y_{n - 1, 1} + \mathcal{F}(y_{n - 1, 2})\\ y_{n, 2} &= y_{n - 1, 2} + \mathcal{G}(y_{n, 1})\\ y_{n} &= (y_{n, 1}, y_{n, 2}) \end{align} $$
Parameters	$$ \begin{align} \theta = \theta_{\mathcal F} \end{align} $$	$$\begin{align} \theta = (\theta_{\mathcal F}, \theta_{\mathcal G}) \end{align} $$
Backpropagation	$$\begin{align} &\mathbf{in:}\ y_{n - 1}, y_{n}, \overline{ y_{n}}\\ \overline{\theta_n} &=\overline{y_n} \frac{\partial y_n}{\partial \theta_n}\\ \overline{y_{n - 1}} &= \overline{y_{n}}\ \frac{\partial y_{n}}{\partial y_{n-1}} \\ &\mathbf{out:}\ \overline{\theta_n}, \overline{y_{n -1}} \end{align}$$	$$\begin{align} &\mathbf{in:}\ y_{n}, \overline{y_{n }}\\ \texttt{# recover}& \texttt{ input activations} \\ y_{n, 1}, y_{n, 2} &= \mbox{split}(y_{n})\\ y_{n - 1, 2} &= y_{n, 2} - \mathcal{G}(y_{n, 1})\\ y_{n - 1, 1} &= y_{n, 1} - \mathcal{F}(y_{n - 1, 2})\\ \texttt{# compute}& \texttt{ gradients wrt. inputs} \\ \overline{y_{n -1, 1}} &= \overline{y_{n, 1}} + \overline{y_{n,2}} \frac{\partial \mathcal G}{\partial y_{n,1}} \\ \overline{y_{n -1, 2}} &= \overline{y_{n, 1}} \frac{\partial \mathcal F}{\partial y_{n,2}} + \overline{y_{n,2}} \left(1 + \frac{\partial \mathcal F}{\partial y_{n,2}} \frac{\partial \mathcal G}{\partial y_{n,1}} \right) \\ \texttt{# compute}& \texttt{ gradients wrt. parameters} \\ \overline{\theta_{n, \mathcal G}} &= \overline{y_{n, 2}} \frac{\partial \mathcal G}{\partial \theta_{n, \mathcal G}}\\ \overline{\theta_{n, \mathcal F}} &= \overline{y_{n,1}} \frac{\partial F}{\partial \theta_{n, \mathcal F}} + \overline{y_{n, 2}} \frac{\partial F}{\partial \theta_{n, \mathcal F}} \frac{\partial \mathcal G}{\partial y_{n,1}}\\ &\mathbf{out:}\ \overline{\theta_{n}}, \overline{y_{n -1}}, y_{n - 1} \end{align}$$

Table 1: Backpropagation in the standard case and for Reversible blocks

Computational Efficiency

RevNets trade off memory requirements, by avoiding storing activations, against computations. Compared to other methods that focus on improving memory requirements in deep networks, RevNet provides the best trade-off: no activations have to be stored, the spatial complexity is $O(1).$ For the computation complexity, it is linear in the number of layers, i.e. $O(L)$. One disadvantage is that RevNets introduces additional parameters, as each block is composed of two residuals, $\mathcal F$ and $\mathcal G$, and their number of channels is also halved as the input is first split into two.

Experiments

In the experiments section, the author compare ResNet architectures to their RevNets “counterparts”: they build a RevNet with roughly the same number of parameters by halving the number of residual units and doubling the number of channels.

Interestingly, RevNets achieve similar performances to their ResNet counterparts, both in terms of final accuracy, and in terms of training dynamics. The authors also analyze the impact of floating errors that might occur when reconstructing activations rather than storing them, however it appears these errors are of small magnitude and do not seem to negatively impact the model. To summarize, reversible networks seems like a very promising direction to efficiently train very deep networks with memory budget constraints.

References

[1] NICE: Non-linear Independent Components Estimation, Dinh et al., ICLR 2015
[2] Density estimation using Real NVP, Dinh et al., ICLR 2017
[3] Deep Residual Learning for Image Recognition, He et al., CVPR 2016
[4] $i$RevNet: Deep Invertible Networjs, Jacobsen et al., ICLR 2018

Conditional Neural Processes

2019-05-06T14:00:00+02:00

Gaussian Processes are models that consider a family of functions (typically under a Gaussian distribution) and aim to quickly fit one of these functions at test time based on some observations. In that sense there are orthogonal to Neural Networks which instead aim to learn one function based on a large training set and hoping it generalizes well on any new unseen test input. This work is an attempt at bridging both approaches.

Pros (+): Novel and well justified, wide range of applications.
Cons (-): Not clear how easy the method is to put in practice, e.g. dependency to initialization.

Proposed Model

Statistical Background

In the Conditional Neural Processes (CNP) setting, we are given $n$ labeled points, called observations $O = \{(x_i, y_i)\}_{i=1^n}$, and another set of $m$ unlabeled targets $T = \{x_i\}_{i=n + 1}^{n + m}$.We assume that the outputs are a realization of the following process: Given $\mathcal P$ a distribution over functions in $X \rightarrow Y$, sample $f \sim \mathcal P$, and set $y_i = f(x_i)$ for all $x_i$ in the targets set.

The goal is to learn a prediction model for the output samples while trying to obtain the same flexibility as Gaussian processes, rather than using the standard supervised learning paradigm of deep neural networks. The main inconvient of using standard Gaussian Processes is that they do not scale well ($(n + m)^3$).

Conditional Neural Processes

CNPs give up on the theoretical guarantees of the Gaussian Process framework in exchange for more flexibility. In particular, observations are encoded in a representation of fixed dimension, independent of $n$.

\[\begin{align} r_i &= h(x_i, y_i), \forall i \in [1, n]\\ r &= r_1 \oplus \dots \oplus r_n\\ \phi_i &= g_{\theta}(x_i, r), \forall i \in [n + 1, n + m]\\ y_i &\sim Q(f(x_i)\ |\ \phi_i) \end{align}\]

In other words, we first encode each observation and combine these embeddings via an operator $\oplus$ to obtain a fixed representation of the observations, $r$. Then for each new target $x_i$ we obtain parameters $\phi_i$ conditioned on $r$ and $x_i$, which determine the stochastic process to draw outputs from.

Figure: Conditional Neural Processes Architecture.

In practice, $\oplus$ is taken to be the mean operation, i.e., $r$ is the average of $r_i$s over all observations. For regression tasks, $Q$ is a Gaussian distribution parameterized by mean and variance $\phi = (\mu_i, \sigma_i).$ For classification tasks, $\phi$ simply encodes a discrete distribution over classes.

Training

Given observations $f \sim P$, $\{(x_i, y_i = f(x_i))\}_{i=1}^n$, we sample $N$ uniformly in $[1, \dots, n]$ and train the model to predict labels for the whole observations set, conditioned only on the subset $\{(x_i, y_i)\}_{i=1}^N$ by minimizing the negative log likelihood:

\[\begin{align} \mathcal{L}(\theta) = - \mathbb{E}_{f \sim p} \mathbb{E}_N \left( \log Q_{\theta}(\{y_i\}_{i=1}^N |\ \{(x_i, y_i)\}_{i=1}^N, \{x_i\}_{i=1}^N) \right) \end{align}\]

Note: The sampling step $f \sim P$ is not clear, but it seems to model the stochasticity in output $y$ given $x$.

The mode scales with $O(n + m)$, i.e., linear time, which is much better than the cubic rate of Gaussian Processes.

Experiments and Applications

1D regression: We generate a dataset that consist of functions generated from a GP with an exponential kernel. At every training step we sample a curve from the GP ($f \sim P$), select a subset of $n$ points $(x_i, y_i)$ as observations, and a subset of points $(x_t, y_t)$ as target points. The output distribution on the target labels is parameterized as a Gaussian whose mean and variance are output by $g$.
Image completion. $f$ is a function that maps a pixel coordinate to a RGB triple. At each iteration, an image from the training set is chosen, a subset of its pixels is selected, and the aim is to predict the RGB value of the remaining pixels while conditioned on those. In particular, it is interesting to see that CNP allows for flexible conditioning patterns, contrary to other conditional generative models which are often constrained by either the architecture (e.g. PixelCNN) or training scheme.
Few shot classification. Consider a dataset with many classes but only few examples per class (e.g. Omniglot). In this set of experiments, the model is trained to predict labels for samples in a select subset of classes, while being conditioned on all remaining samples in the dataset.

Deep Visual Analogy Making

2019-05-06T12:40:24+02:00

In this paper, the authors propose to learn visual analogies akin to the semantic and synctatic analogies naturally emerging in the Word2Vec embedding [1]: More specifically hey tackle the joint task of inferring a transformation from a given (source, target) pair, and applying the same relation to a new source image.

Pros (+): Intuitive formulation; Introduces two datasets for the visual analogy task.
Cons (-): Only consider "local" changes, i.e. geometric transformations or single attribute modifications, and rather clean images (e.g., no background).

Proposed Model

Definition: Informally, a visual analogy, denoted by “a:b :: c:d”, means that the entity a is to b what the entity c is to d. This paper focuses on the problem of generating image d after inferring the relation a:b and given a source image c.

The authors propose to use an encoder-decoder based model for generation and to model analogies as simple transformations of the latent space, for instance addition between vectors, as was the case in words embeddings such as GloVe [2] or Word2Vec [1].

Learning to generate analogies via manipulation of the embedding space

Additive objective. Let $f$ denote the encoder that maps images to the latent space $\mathbb{R}^K$ and $g$ the decoder. The first, most straightforward, objective the authors consider is to model analogies as additions in the latent space:

\[\begin{align} \mathcal L_{\mbox{add}}(c, d; a, b) = \|d - g \left(f(c) + f(b) - f(a) \right)\|^2 \end{align}\]

One disadvantage of this purely linear transformation is that it cannot learn complex structures such as periodic transformations: For instance, if a:b is a rotation, the the embedding for the decoded image should ideally comes back to $f(c)$ which is not possible as we keep adding the non-zero vector $f(b) - f(a)$. To capture more complex transformations of the latent space, the authors introduce two variants of the previous objective.

Multiplicative objective.

\[\begin{align} \mathcal L_{\mbox{mult}}(c, d; a, b) = \|d - g \left( f(c) + W \odot [f(b) - f(a)] \odot f(c) \right)\|^2 \end{align}\]

where $W \in \mathbb{R}^{K\times K\times K}$, $K$ is the dimension of the embedding, and the three-way multiplication operator is defined as $\forall k,\ (A \odot B \odot C)_k = \sum_{i, j} A_{ijk} B_i C_j$

Deep objective.

\[\begin{align} \mathcal L_{\mbox{deep}}(c, d; a, b) = \|d - g \left( f(c) + \mbox{MLP}([ f(b) - f(a); f(c)]) \right)\|^2 \end{align}\]

where MLP is a Multi Layer Perceptron. The $[ \cdot; \cdot]$ operator denotes concatenation. This allows for very generic transformations, but can introduce a significant number of parameters for the model to train, depending on the depth of the network.

Figure: Illustration of the network structure for analogy making. The top portion shows the encoder, transformation module, and decoder. The botton portion illustrates each of the transformation variants. We share weights with all three encoder networks shown on the top left

Regularizing the latent space

While the previous losses acted at the pixel-level to match the decoded image with the target image D, the authors introduce an additional regularization loss that additionally matches the analogy at the feature level with the source analogy a:b. Formally, each objective can be written in the form $\| d - g(f(c) + T(f(b) - f(a), f(c)))\|^2$ and the corresponding regularization loss term is defined as:

\[\begin{align} R(c, d; a, b) = \|(f(d) - f(c)) - T(f(b) - f(a), f(c))\|^2 \end{align}\]

Where $T$ is defined accordingly to match the chosen embedding, $\mathcal L_{\mbox{add}}$, $\mathcal L_{\mbox{mult}}$ or $\mathcal L_{\mbox{deep}}$. For intance, $T: (x, y) \mapsto x$ in the additive variant.

Disentangling the feature space

The authors consider another solution to the visual analogy problem, in which they aim to learn a disentangled feature space that can be freely manipulated by smoothly modifying the appropriate latent variables, rather than learning a specific operation.

In that setting, the problem is slightly different, as we require additional supervision to control the different factors of variation. It is denoted as (a, b):s :: c: given two input images a and b, and a boolean mask s on the latent space, retrieve image c which matches the features of a according to the pattern of s, and the features of b on the remaining latent variables.

Let us denote by $S$ the number of possible axes of variations (e.g., change in illumination, elevation, rotation etc) then $s \in \{0, 1\}^S$ is a one-hot block vector encoding the current transformation, called the switch vector. The disentangling objective is thus

\[\begin{align} \mathcal{L}_{\mbox{dis}} = |c - g(f(a) \times s + f(b) \times (1 - s))| \end{align}\]

In other words the decoder tries to match c by exploiting separate and disentangled information from a and b. Contrary to the previous analogy objectives, only three images are needed, but it also requires extra supervision in the form of the switch vector s which can be hard to obtain.

Experiments

The authors consider three main experimental settings:

Synthetic experiments on geometric shapes. The dataset consists in 48 × 48 images scaled to [0, 1] with 4 shapes, 8 colors, 4 scales, 5 row and column positions, and 24 rotation angles. No disentangling training was performed in this setting.
Sprites dataset. The dataset consists of 60 × 60 color images of sprites scaled to [0, 1], with 7 attributes and 672 total unique characters. For each character, there are 5 animations each from 4 viewpoints. Each animation has between 6 and 13 frames. The data is split by characters. For the disentanglement experiments, the authors try two methods:dist, where they only try to separate the pose from identity (i.e., only two axes of variations), and dist+cls, where they actually consider all available attributes separately.
3D Cars. For each of the 199 car models, the authors generated 64 × 64 color renderings from 24 rotation angles each offset by 15 degrees.

The authors report results in terms of pixel prediction error. Out of the three manipulation method, $\mathcal{L}_{\mbox deep}$ usually performs best. However qualitative samples show that $\mathcal L_{\mbox{add}}$ and $\mathcal L_{\mbox{mult}}$ both also perform well, although they fail for the case of rotation in the first set of experiments, which justifies the use of more complex training objectives.

Disentanglement methods usually outperforms the other baselines, especially in few-shots experiments. In particular the dist+cls method usually wins by a large margin, which shows that the additional supervision really helps in learning a structured representation. However such supervisory signal sounds hard to obtain in practice in more generic scenarios.

Figure 2: Examples of samples from the three visual analogy datasets considered in experiments

Closely Related

Visalogy: Answering Visual Analogy Questions [3]

Sadeghi et al., [link]

In this paper, the authors tackle the visual analogy problem in natural images by learning a joint embedding on relation and visual appearances using a Siamese architecture. The main idea is to learn an embedding space where the analogy transformation can be modeled by simple latent vector transformations. The model consists in a Siamese quadruple architecture, where the four heads correspond to the three context images and the candidate image for the visual analogy task respectively They do consider a restrained set of analogies, in particular those based on attributes or actions of animals or geometric view point changes. Given analogy problem $I_1 : I_2 :: I_3 : I_4$ with label $y$ (1 if $I_4$ fits the analogy, 0 otherwise), the model is trained with the following objective

\[\begin{align} \mathcal{L}(x_{1, 2}, x_{3, 4}) = y (\| x_{1, 2} - x_{3, 4} \| -m+P) + (1 - y) \max (m_N - \| x_{1, 2} - x_{3, 4} \|, 0) \end{align}\]

where $x_{i, j}$ refers to the embedding for the image pair $i, j$. Intuitively, the model pushes embeddings with a similar analogy close, and others apart (up to a certain margin $m_N$). The $m_P$ margin is introduced as a heuristic to avoid overfittting: Embeddings are only made closer if their distance is above the margin threshold $m_P$. The pairwise embeddings are obtained by subtracting the individual images embeddings. This implies the assumption that $x_2 = x_1 + r$, where $r$ is the transformation from image $I_1$ to image $I_2$.

The authors additionally create a visual analogy dataset. Generating the dataset is rather intuitive as long as we have an attribute-style representation of the domain. Typically, the analogies considered are transformations over properties (object, action, pose) of different categories (dog, cat, chair etc). As negative data points, they consider (i) fully random quadruples, or (ii) valid quadruples where one of $I_3$ or $I_4$ is swapped with a random image. The evaluation is done with image retrieval metrics. They also consider generalization scenarios: For instance removing the analogy $white \rightarrow black$ during training, but keeping e.g. $white \rightarrow red$ and $green \rightarrow black$. There is a lack of details about the missing pairs to really get a full idea of the generalization ability of the model (i.e. if an analogy is missing from the training set, does that mean its reverse also is ? or does “analogy” refers to the high-level relation or is it instantiated relatively to the category too ?).

References

[1] Distributed representations of words and phrases and their compositionality, Mikolov et al., NIPS 2013
[2] GloVe: Global Vectors for Word Representation, Pennington et al., EMNLP 2014
[3] Visalogy: Answering Visual Analogy Questions, Sadeghi et al., NeurIPS 2015

Step Description	Functional Form of flow \(g_i\)	Inverse Function of the flow, \(g_i^{-1}\)	Log-determinant Expression
ActNorm \(s: [c,]\) \(b: [c,]\)	\(y = \sigma\odot x + \mu\)	\(x = (y - \mu) / \sigma\)	\(hw\ \mbox{sum} \log(\vert\sigma\vert)\)
1x1 conv \(W: [c,c]\)	\(y = Wx\)	\(x = W^{-1}y\)	\(h w \log \vert \det (W) \vert\)
Affine Coupling (ACL) [2]	\(x_a,\ x_b = \mbox{split}(x)\) \((\log \sigma, \mu) = \mbox{NN}(x_b)\) \(y_a = \sigma \odot x_a + \mu\) \(y = \mbox{concat}(y_a, x_b)\)	\(y_a,\ y_b = \mbox{split}(y)\) \((\log \sigma, \mu) = \mbox{NN}(y_b)\) \(x_a = (y_a - \mu) / \sigma\) \(x = \mbox{concat}(x_a, y_b)\)	\(\mbox{sum} (\log \vert\sigma\vert)\)