Jekyll2021-04-01T21:01:32+02:00https://ameroyer.github.io/feed.xmlAmélie Royerpersonal descriptionAmélie Royeraroyer@ist.ac.atA Style-Based Generator Architecture for Generative Adversarial Networks2021-01-14T13:59:24+01:002021-01-14T13:59:24+01:00https://ameroyer.github.io/generative%20models/a_style_based_generator_architecture_for_generative_adversarial_networks<div class="summary">
In this work, the authors propose <code>VQ-VAE</code>, a variant of the Variational Autoencoder (<code>VAE</code>) framework with a discrete latent space, using ideas from vector quantization. The two main motivations are <b>(i)</b> discrete variables are potentially better fit to capture the structure of data such as text and <b>(ii)</b> to prevent the posterior collapse in <code>VAE</code>s that leads to latent variables being ignored when the decoder is too powerful.
<ul>
<li><span class="pros">Pros (+):</span> Simple method to incorporate a discretized latent space in VAEs.</li>
<li><span class="cons">Cons (-):</span> Paragraph about the learned prior is not very clear, and does not have corresponding ablation experiments to evalute its importance.</li>
</ul>
</div>
<h2 class="section proposed"> Proposed </h2>
<h3 id="discrete-latent-space">Discrete latent space</h3>
<p>The model is based on <code class="language-plaintext highlighter-rouge">VAE</code> <span class="citations">[1]</span>, where image \(x\) is generated from random latent variable \(z\) by a <em>decoder</em> \(p(x\ \vert\ z)\). The posterior (<em>encoder</em>) captures the latent variable distribution \(q_{\phi}(z\ \vert\ x)\) and is generally trained to match a certain distribution \(p(z)\) from which \(z\) is sampled from at inference time.
Contrary to the standard framework, in this work <em>the latent space is discrete</em>, i.e., \(z \in \mathbb{R}^{K \times D}\) where \(K\) is the number of codes in the latent space and \(D\) their dimensionality. More precisely, the input image is first fed to \(z_e\), that outputs a continuous vector, which is then mapped to one of the latent codes in the discrete space via <em>nearest-neighbor search</em>.</p>
\[\begin{align}
q(z = z_k\ |\ x) = [\!| k = \arg\min_j \| z_e(x) - z_j \|^2 |\!]
\end{align}\]
<p>Adapting the \(\mathcal{L}_{\text{ELBO}}\) to this formalism, the KL divergence term greatly simplifies and we obtain:</p>
\[\begin{align}
\mathcal{L}_{\text{ELBO}}(x) &= \text{KL}(q(z | x) \| p(z)) - \mathbb{E}_{z \sim q(\cdot | x)}(\log p(x | z))\\
&= - \log(p(z_k)) - \log p(x | z_k)\\
\mbox{where }& z_k = z_q(x) = \arg\min_z \| z_e(x) - z \|^2 \tag{1}
\end{align}\]
<p>In practice, the authors use a categorical <em>uniform prior</em> for the latent codes, meaning the KL divergence is constant and the objective reduces to the reconstruction loss.</p>
<div class="figure">
<img src="/images/posts/vqvae.png" />
<p><b>Figure:</b> A figure describing the <code>VQ-VAE</code> (<b>left</b>). Visualization of the embedding space (<b>right</b>)). The output of the encoder z(x) is mapped to the nearest point. The gradient (in <span style="color:red">red</span>) will push the
encoder to change its output, which could alter the configuration, hence the code assignment, in the next forward pass.</p>
</div>
<h3 id="training-objective">Training Objective</h3>
<p>As we mentioned previously, the \(\mathcal{L}_{\text{ELBO}}\) objective reduces to the <em>reconstruction loss</em> and is used to learn the encoder and decoder parameters. However the mapping from \(z_e\) to \(z_q\) is not straight-forward differentiable (Equation <strong>(1)</strong>).
To palliate this, the authors use a <em>straight-through estimator</em>, meaning the gradients from the decoder input \(z_q(x)\) (quantized) are directly copied to the encoder output \(z_e(x)\) (continuous).
However, this means that the latent codes that intervene in the mapping from \(z_e\) to \(z_q\) do not receive gradient updates that way.</p>
<p>Hence in order to train the discrete embedding space, the authors propose to use <em>Vector Quantization</em> (<code class="language-plaintext highlighter-rouge">VQ</code>), a dictionary learning technique, which uses mean squared error to make the latent code closer to the continuous vector it was matched to:</p>
\[\begin{align}
\mathcal{L}_{\text{VQ-VAE}}(x) = - \log p(x | z_q(x)) + \| \overline{z_e(x)} - e \|^2 + \beta \| z_e(x) - \bar{e} \|^2
\end{align}\]
<p>where \(x \mapsto \overline{x}\) denotes the <code class="language-plaintext highlighter-rouge">stop gradient</code> operator. The first term is the reconstruction loss stemming from the ELBO, the second term is the vector quantization contribution. Finally, the last term is a <em>commitment loss</em> to control the volume of the latent space by forcing the encoder to “commit” to the latent code it matched with, and not grow its output space unbounded.</p>
<h3 id="learned-prior">Learned Prior</h3>
<p>A second contribution of this work consists in <em>learning the prior distribution</em>. As mentioned, during the training phase, the prior \(p(z)\) is a uniform categorical distribution. After the training is done, we fit an <em>autoregressive distribution</em> over the space of latent codes. This is in particular enabled by the fact that the latent space is discrete.</p>
<p><strong>Note:</strong> It is not clear to me if the autoregressive model is trained on latent codes sampled from the prior \(z \sim p(z)\) or from the encoder distribution \(x \sim \mathcal{D};\ z \sim q(z\ \vert\ x)\)</p>
<hr />
<h2 class="section experiments"> Experiments </h2>
<p>The proposed model is mostly compared to the standard continuous <code class="language-plaintext highlighter-rouge">VAE</code> framework. It seems to achieve similar log-likelihood and sample quality, while taking advantage of the discrete latent space. In particular
For ImageNet for instance, they consider \(K = 512\) latent codes with dimensions \(1\). The output of the fully-convolutional encoder \(z_e\) is a feature map of size \(32 \times 32 \times 1\) which is then quantized <em>pixel-wise</em>. Interestingly, the model still performs well when using a powerful decoder (here, PixelCNN <span class="citations">[2]</span>) which seems to indicate it does not suffer from <em>posterior collapse</em> as strongly as the standard continuous <code class="language-plaintext highlighter-rouge">VAE</code>.</p>
<p>A second set of experiments tackles the problem of audio modeling. The performance of the model are once again satisfying. Furthermore, it does seem like the discrete latent space actually captures relevant characteristics of the input data structure, although this is a purely qualitative observation.</p>
<hr />
<h2 class="section references"> References </h2>
<ul>
<li><span class="citations">[1]</span> Autoencoding Variational Bayes, <i>Kingma and Welling, ICLR 2014</i></li>
<li><span class="citations">[2]</span> Pixel Recurrent Neural Networks, <i>van den Oord et al, arXiv 2016</i></li>
</ul>Karras et al.In this work, the authors propose VQ-VAE, a variant of the Variational Autoencoder (VAE) framework with a discrete latent space, using ideas from vector quantization. The two main motivations are (i) discrete variables are potentially better fit to capture the structure of data such as text and (ii) to prevent the posterior collapse in VAEs that leads to latent variables being ignored when the decoder is too powerful.Domain Adversarial Training of Neural Networks2019-05-23T10:59:24+02:002019-05-23T10:59:24+02:00https://ameroyer.github.io/domain%20adaptation/domain_adversarial_training_of_neural_networks<div class="summary">
In this article, the authors tackle the problem of <b>unsupervised domain adaptation</b>: Given labeled samples from a source distribution `\mathcal D_S` and unlabeled samples from target distribution `\mathcal D_T`, the goal is to learn a function that solves the task for both the source and target domains. In particular, the proposed model is trained on <b>both</b> source and target data jointly, and aims to directly learn an <b>aligned representation</b> of the domains, while retaining meaningful information with respect to the source labels.
<ul>
<li><span class="pros">Pros (+):</span> Theoretical justification, simple model, easy to implement.</li>
<li><span class="cons">Cons (-):</span> Some training instability in practice.</li>
</ul>
</div>
<h2 class="section theory"> Generalized Bound on the Expected Risk </h2>
<p>Several theoretical studies of the domain adaptation problem have proposed upper bounds of the <em>risk on the target domain</em>, involving the risk on the source domain and a notion of <em>distance</em> between the source and target distribution, \(\mathcal D_S\) and \(\mathcal D_T\). Here, the authors specifically consider the work of <span class="citations">[1]</span>. First, they define the \(\mathcal H\)-divergence:</p>
\[\begin{align}
d_{\mathcal H}(\mathcal D_S, \mathcal D_T) = 2 \sup_{h \in \mathcal H} \left| \mathbb{E}_{x\sim\mathcal{D}_s} (h(x) = 1) - \mathbb{E}_{x\sim\mathcal{D}_T} (h(x) = 1) \right| \tag{1}
\end{align}\]
<p>where \(\mathcal H\) is a space of (here, binary) hypothesis functions. In the case where \(\mathcal H\) is a <em>symmetric hypothesis class</em> (i.e., \(h \in \mathcal H \implies -h \in \mathcal H\)), one can reduce <strong>(1)</strong> to the empirical form:</p>
\[\begin{align}
d_{\mathcal H}(\mathcal D_S, \mathcal D_T) &\simeq 2 \sup_{h \in \mathcal H} \left|\frac{1}{|D_S|} \sum_{x \in D_S} [\!|h(x) = 1 |\!] - \frac{1}{|D_T|} \sum_{x \in D_T} [\!|h(x) = 1 |\!] \right|\\
&= 2 \sup_{h \in \mathcal H} \left|\frac{1}{|D_S|} \sum_{x \in D_S} 1 - [\!|h(x) = 0 |\!] - \frac{1}{|D_T|} \sum_{x \in D_T} [\!|h(x) = 1 |\!] \right|\\
&= 2 - 2 \min_{h \in \mathcal H} \left|\frac{1}{|D_S|} \sum_{x \in D_S} [\!|h(x) = 0 |\!] + \frac{1}{|D_T|} \sum_{x \in D_T} [\!|h(x) = 1 |\!] \right| \tag{2}
\end{align}\]
<p>It is difficult to estimate the minimum over the hypothesis class \(\mathcal H\). Instead, <span class="citations">[1]</span> propose to <em>approximate</em> Equation <strong>(2)</strong> by training a classifier \(\hat{h}\) on samples \(\mathbf{x_S} \in \mathcal{D}_S\) with label 0 and \(\mathbf{x_T} \in \mathcal D_T\) with label 1, and replacing the <code class="language-plaintext highlighter-rouge">minimum</code> term by the empirical risk of \(\hat h\).
Given this definition of the \(\mathcal H\)-divergence, <span class="citations">[1]</span> further derives an <em>upper bound</em> on the empirical risk on the target domain, which in particular involves a trade-off between the empirical risk on the source domain, \(\mathcal{R}_{D_S}(h)\), and the divergence between the source and target distributions, \(d_{\mathcal H}(D_S, D_T)\).</p>
\[\begin{align}
\mathcal{R}_{D_T}(h) \leq \mathcal{R}_{D_S}(h) + d_{\mathcal H}(D_S, D_T) + f\left(\mbox{VC}(\mathcal H), \frac{1}{n}\right) \tag{upper-bound}
\end{align}\]
<p>where \(\mbox{VC}\) designates the <em>Vapnik–Chervonenkis</em> dimensions and \(n\) the number of samples.
The rest of the paper directly stems from this intuition: in order to minimize the <em>target risk</em> the proposed <em>Domain Adversarial Neural Network</em> (<code class="language-plaintext highlighter-rouge">DANN</code>) aims to build an “<i>internal representation that contains no discriminative information about the origin of the input (source or target), while preserving a low risk on the source (labeled) examples</i>”.</p>
<hr />
<h2 class="section proposed"> Proposed </h2>
<p>The goal of the model is to learn a classifier \(\phi\), which can be decomposed as \(\phi = G_y \circ G_f\), where \(G_f\) is a feature extractor and \(G_y\) a small classifier on top that outputs the target label. This architecture is trained with a standard classification objective to <em>minimize</em>:</p>
\[\begin{align}
\mathcal{L}_y(\theta_f, \theta_y) = \frac{1}{N_s} \sum_{(x, y) \in D_s} \ell(G_y(G_f(x)), y)
\end{align}\]
<p>Additionally <code class="language-plaintext highlighter-rouge">DANN</code> introduces a <em>domain prediction branch</em>, which is another classifier \(G_d\) on top of the feature representation \(G_f\) and whose goal is to approximate the domain discrepancy as <strong>(2)</strong>, which leads to the following training objective to <em>maximize</em>:</p>
\[\begin{align}
\mathcal{L}_d(\theta_f, \theta_d) = \frac{1}{N_s} \sum_{x \in D_s} \ell(G_d(G_f(x)), s) + \frac{1}{N_t} \sum_{x \in D_t} \ell(G_d(G_f(x)), t)
\end{align}\]
<p>The <strong><em>final objective</em></strong> can thus be written as:</p>
\[\begin{align}
E(\theta_f, \theta_y, \theta_d) &= \mathcal{L}_y(\theta_f, \theta_y) - \lambda \mathcal{L}_d(\theta_f, \theta_d) \tag{1}\\
\theta_f^\ast, \theta_y^\ast &= \arg\min E(\theta_f, \theta_y, \theta_d) \tag{2}\\
\theta_d^\ast &= \arg\max E(\theta_f, \theta_y, \theta_d) \tag{3}
\end{align}\]
<h3 id="gradient-reversal-layer">Gradient Reversal Layer</h3>
<p>Applying standard gradient descent, the <code class="language-plaintext highlighter-rouge">DANN</code> objective leads to the following gradient update rules:</p>
\[\begin{align}
\theta_f &= \theta_f - \alpha \left( \frac{\partial \mathcal{L}_y}{\partial \theta_f} - \lambda \frac{\partial \mathcal{L}_d}{\partial \theta_f} \right)\\
\theta_y &= \theta_y - \alpha \frac{\partial \mathcal{L}_y}{\partial \theta_y} \\
\theta_d &= \theta_d + \alpha \frac{- \lambda \partial \mathcal{L}_d}{\partial \theta_d} \\
\end{align}\]
<p>In the case of neural networks, the gradients of the loss with respect to parameters are obtained with the <em>backpropagation algorithm</em>. The current system equations are very similar to the standard backpropagation scheme, except for the opposite sign in the derivative of \(\mathcal{L}_d\) with respect to \(\theta_d\) and \(\theta_f\). The authors introduce the <strong><em>gradient reversal layer</em></strong> (<code class="language-plaintext highlighter-rouge">GRL</code>) to evaluate both gradients in one standard backpropagation step.</p>
<p>The idea is that the output of \(\theta_f\) is normally propagated to \(\theta_d\), however during backpropagation, its gradient is multiplied by a negative constant:</p>
\[\begin{align}
\frac{\partial \mathcal L_d}{\partial \theta_f} = \frac{\bf{\color{red}{-}} \partial \mathcal L_d}{\partial G_f(x)} \frac{\partial G_f(x)}{\partial \theta_f}
\end{align}\]
<p>In other words, for the update of \(\theta_d\), the gradients of \(\mathcal L_d\) with the respect to activations are computed normally (<em>minimization</em>), but they are then propagated with a minus sign in the feature extraction part of the network (<em>maximization</em>).
Augmented with the gradient reversal layer, the final model is trained by minimizing the sum of losses \(\mathcal L_d + \mathcal L_y\) , which corresponds to the optimization problem in <strong>(1-3)</strong>.</p>
<div class="figure">
<img src="/images/posts/dann.png" />
<p><b>Figure:</b> The proposed architecture includes a <span style="color: green">deep feature extractor</span> and a <span style="color: blue">deep label predictor</span>.
Unsupervised domain adaptation is achieved by adding a <span style="color: fuchsia">domain classifier</span> connected to the feature extractor via a gradient reversal layer that multiplies
the gradient by a certain negative constant during backpropagation. </p>
</div>
<hr />
<h2 class="section experiments"> Experiments </h2>
<h3 id="datasets">Datasets</h3>
<p>The paper presents extensive results on the following settings:</p>
<ul>
<li><strong>Toy dataset</strong>: A toy example based on the <em>two half-moons dataset</em>, where the source domains consists in the standard binary classification tasks with the two half-moons, and the target is the same, but with a 30 degrees rotation. They compare the <code class="language-plaintext highlighter-rouge">DANN</code> to a <code class="language-plaintext highlighter-rouge">NN</code> model which has the same architecture but without the <code class="language-plaintext highlighter-rouge">GRL</code>: in other words, the baseline directly minimizes both the task and domain classification losses.</li>
<li><strong>Sentiment Analysis</strong>: These experiments are performed on the <em>Amazon reviews dataset</em> which contains product reviews from four different domains (hence 12 different source to target scenarios) which have to be classified as either positive or negative reviews.</li>
<li><strong>Image Classification</strong>: Here the model is evaluated on various image classification task including MNIST \(\rightarrow\) SVHN, or different domain pairs from the OFFICE dataset <span class="citations">[2]</span> .</li>
<li><strong>Person Re-identification</strong>: The task of person identification across various visual domains.</li>
</ul>
<h3 id="validation">Validation</h3>
<p>Setting hyperparameters is a difficult problem, as we cannot directly evaluate the model on the target domain (no labeled data available). Instead of standard cross-validation, the authors use <em>reverse validation</em> based on a technique introduced in <span class="citations">[3]</span>: First, the (labeled) source set \(S\) and (unlabeled) target set \(T\) are each <em>split into a training and validation set</em>, \(S'\) and \(S_V\) (resp. \(T'\) and \(T_V\)).
Using these splits, a model \(\eta\) is trained on \(S'\rightarrow T'\). Then a second model \(\eta_r\) is trained for the <em>reverse direction</em> on the set \(\{ (x, \eta(x)),\ x \in T'\} \rightarrow S'\). This reverse classifier \(\eta_r\) is then finally evaluated on the labeled validation set \(S_V\), and this accuracy is used as a validation score.</p>
<h3 id="conclusions">Conclusions</h3>
<p>In general, the proposed method seems to perform very well for aligning the source and target domains in an <em>unsupervised domain adaptation</em> framework. Its main advantage is its <em>simplicity</em>, both in terms of theoretical motivation and implementation. In fact, the <code class="language-plaintext highlighter-rouge">GRL</code> is easily implemented in standard Deep Learning frameworks and can be added to any architectures.</p>
<p>The main shortcomings of the method are that <strong>(i)</strong> all experiments deal with only two sources and extensions <em>to multiple domains</em> might require some tweaks (e.g., considering the sum of pairwise discrepancies as an upper-bound) and <strong>(ii)</strong> in practice, training can become <em>unstable</em> due to the adversary training scheme; In particular, the experiment sections show that some stability tricks have to be used during training, such as using momentum or slowly increasing the contribution of the domain classification branch.</p>
<div class="figure">
<img src="/images/posts/dann_mnist_embeddings.png" />
<p><b>Figure:</b> <code>t-SNE</code> projections of the embeddings for the <span style="color: blue">source</span> (MNIST) and <span style="color: red">target</span> (SVHN) datasets without (<b>left</b>) and with (<b>right</b>) <code>DANN</code> adaptation. </p>
</div>
<hr />
<h2 class="section followup">Closely related</h2>
<h4 style="margin-bottom: 0px"> Conditional Adversarial Domain Adaptation.</h4>
<p style="text-align: right"><small>Long et al, NeurIPS 2018<a href="https://arxiv.org/abs/1705.10667">[link]</a></small></p>
<blockquote>
<p>In this work, the authors propose to for Domain Adversarial Networks. More specifically, the domain classifier is conditioned on the input’s class: However, since part of the samples are unlabeled, the conditioning uses the <em>output of the target classifier branch</em> as a proxy for the class information. Instead of simply concatenating the feature input with the condition, the authors consider a <em>multilinear conditioning</em> technique which relies on the <em>cross-covariance</em> operator. Another related paper is <span class="citations">[4]</span>. It also uses the multi-class information of the input domain, although in a simpler way.</p>
</blockquote>
<hr />
<h2 class="section references"> References </h2>
<ul>
<li><span class="citations">[1]</span> Analysis of representations for Domain Adaptation, <i>Ben-David et al, NeurIPS 2006</i></li>
<li><span class="citations">[2]</span> Adapting visual category models to new domains, <i>Saenko et al, ECCV 2010</i></li>
<li><span class="citations">[3]</span> Person re-identification via structured prediction, <i>Zhang and Saligrama, arXiv 2014</i></li>
<li><span class="citations">[4]</span> Multi-Adversarial Domain Adaptation, <i>Pei et al, AAAI 2018</i></li>
</ul>Ganin et al.In this article, the authors tackle the problem of unsupervised domain adaptation: Given labeled samples from a source distribution `\mathcal D_S` and unlabeled samples from target distribution `\mathcal D_T`, the goal is to learn a function that solves the task for both the source and target domains. In particular, the proposed model is trained on both source and target data jointly, and aims to directly learn an aligned representation of the domains, while retaining meaningful information with respect to the source labels.Deep Image Prior2019-05-14T14:59:24+02:002019-05-14T14:59:24+02:00https://ameroyer.github.io/image%20analsys/deep_image_prior<div class="summary">
Deep Neural Networks are widely used in image generation tasks for capturing a general prior on natural images from a large set of observations. However, this paper shows that the <b>structure of the network itself is able to capture a good prior</b>, at least for local cues of image statistics. More precisely, a randomly initialized convolutional neural network can be a good handcrafted prior for low-level tasks such as denoising, inpainting.
<ul>
<li><span class="pros">Pros (+):</span> Interesting results, with connections to Style Transfer and Network inverson.</li>
<li><span class="cons">Cons (-):</span> Seems like the results might depend a lot on parameter initialization, learning rate etc.</li>
</ul>
</div>
<h2 class="section theory"> Background </h2>
<p>Given a random noise vector \(z\) and conditioned on an image \(x_0\), the goal of <em>conditional image generation</em> is to generate image \(x = f_{\theta}(z; x_0)\) (where the random nature of \(z\) provides a sampling strategy for \(x\)); for instance, the task of generating a high quality image \(x\) from its lower resolution counterpart \(x_0\).</p>
<p>In particular, this encompasses <em>inverse tasks</em> such as denoising, super-resolution and inpainting that acts at the <em>local pixel level</em>. Such tasks can often be phrased with an objective of the following form:</p>
\[\begin{align}
\theta^{\ast} = \arg\min E(x, x_0) + R(x)
\end{align}\]
<p>where \(E\) is a cost function and \(R\) is a <em>prior on the output space</em> acting as a regularizer. \(R\) is often a hand-crafted prior, for instance a smoothness constraint like Total Variation <span class="citations">[1]</span>, or, for more recent techniques, it can be implemented with adversarial training (e.g.,<code class="language-plaintext highlighter-rouge"> GAN</code>s).</p>
<hr />
<h2 class="section proposed">Deep Image Prior</h2>
<p>In this paper, the goal is to replace \(R\) by an <em>implicit prior captured by the neural network</em>, relatively to input noise \(z\). In other words</p>
\[\begin{align}
R(x) &= 0\ \mbox{if}\ \exists \theta\ \mbox{s.t.}\ x = f_{\theta}(z)\\
R(x) &= + \infty,\ \mbox{otherwise}
\end{align}\]
<p>Which results in the following workflow:</p>
\[\begin{align}
\theta^{\ast} = \arg\min E(f(z; x_0), x_0) \mbox{ and } x^{\ast} = f_{\theta^{\ast}}(z; x_0)
\end{align}\]
<p>One could wonder if this is a <em>good choice for a prior</em> at all. In fact, \(f\), being instantiated as a neural network, should be powerful enough that any image \(x\) can be generated from \(z\) for a certain choice of parameters \(\theta\), which means the prior should not be constraining.</p>
<p>However, the <strong>structure of the network* itself effectively affects how optimization algorithms such as gradient descent will browse the output space:
To quantify this effect, the authors perform a reconstruction experiment (i.e., \(E(x) = \| x - x_0 \|\)) for different choices of the input image \(x_0\) (</strong><em>(i)</em>** natural image, <strong><em>(ii)</em></strong> same image with small perturbations, <strong><em>(iii)</em></strong> with large perturbations, and <strong><em>(iv)</em></strong> white noise) using a <code class="language-plaintext highlighter-rouge">U-Net</code> <span class="citations">[2]</span> inspired architecture. Experimental results show that the network descends faster to natural-looking images (case <strong><em>(i)</em></strong> and <strong><em>(ii)</em></strong>), than to random noise (case <strong><em>(iii)</em></strong> and <strong><em>(iv)</em></strong>).</p>
<div class="figure">
<img src="/images/posts/dip_toyexp.png" />
<p><b>Figure:</b> Learning curves for the reconstruction task using: a natural image, the same plus i.i.d. noise, the same but randomly scrambled, and white noise.</p>
</div>
<hr />
<h2 class="section experiments"> Experiments </h2>
<p>The experiments focus on three <em>image analysis tasks</em>:</p>
<ul>
<li><strong>Image denoising</strong> (\(E(x, x_0) = \|x - x_0\|\)), based on the previous observation that the model converges more easily to natural-looking images than noisy ones.</li>
<li><strong>Super Resolution</strong> (\(E(x, x_0) = \| \mbox{downscale}(x) - x_0 \|\)), to upscale the resolution of input image \(x_0\)</li>
<li><strong>Image inpainting</strong> (\(E(x, x_0) = \|(x - x_0) \odot m\|\)) where the input image \(x_0\) is masked by a mask \(m\) and the goal is to recover the missing pixels.</li>
</ul>
<p>The method seems to <em>outperform most non-trained methods</em>, when available, (e.g. Bicubic upsampling for Super-Resolution) but is still often outperformed y learning-based ones. The <em>inpainting results</em> are particularly interesting, and I do not know of any other non-trained baselines for this task. Obviously performs poorly when the obscured region requires highly semantic knowledge, but it seems to perform well on more reasonable benchmarks.</p>
<p>Additionally, the authors test the proposed prior for diagnosing neural networks by <em>generating natural pre-images</em> for neural activations of deep layers. Qualitative images look better than other handcrafted priors (total variation) and are not biased to specific datasets as are trained methods.</p>
<div class="figure">
<img src="/images/posts/dip_full.png" />
<p><b>Figure:</b> Example comparison between the proposed Deep Image Prior and various baselines for the task of Super-Resolution.</p>
</div>
<h2 class="section followup">Closely related (follow-up work)</h2>
<h4 style="margin-bottom: 0px">Deep Decoder: Concise Image Representations from Untrained Non-Convolutional Networks</h4>
<p style="text-align: right"><small>Heckel and Hand, <a href="https://arxiv.org/abs/1810.03982">[link]</a></small></p>
<blockquote>
<p>This paper builds on Deep Image Prior but proposes a much simpler architecture which is <em>under-parametrized</em> and <em>non-convolutional</em>. In particular, there are fewer weight parameters than the dimensionality of the output image (in comparison, DIP was using a <code class="language-plaintext highlighter-rouge">U-Net</code> based architecture). In particular, this property implies that <em>the weights of the network can additionally be used as a compressed representation</em> of the image. In order to test for compression, the authors use their architecture to reconstruct image \(x\) for different compression ratios \(k\) (i.e., number of network parameters \(N\), is \(k\)-times smaller as the output dimension of the images).</p>
</blockquote>
<blockquote>
<p>The deep decoder architecture combines standard blocks include linear combination of channels (convolutions ), ReLU, batch-normalization and upscaling. Note that since here we have a special case of batch size 1, the Batch Norm operator essentially normalizes the activation channel-wise. In particular, the paper contains a nice <em>theoretical justification for the denoising case</em>, in which they show that the model can only fit a certain amount of noise, which explains why it would converge to more natural-looking images, although it only applies to small networks (1 layer ? possibly generalizable to multi-layer and no batch-norm)</p>
</blockquote>
<hr />
<h2 class="section references"> References </h2>
<ul>
<li><span class="citations">[1]</span> An introduction to Total Variation for Image Analysis, <i>Chambolle et al., Technical Report, 2009</i></li>
<li><span class="citations">[2]</span> U-Net: Convolutional Networks for Biomedical Image Segmentation, <i>Ronneberger et al., MICCAI 2015</i></li>
</ul>Ulyanov et al.Deep Neural Networks are widely used in image generation tasks for capturing a general prior on natural images from a large set of observations. However, this paper shows that the structure of the network itself is able to capture a good prior, at least for local cues of image statistics. More precisely, a randomly initialized convolutional neural network can be a good handcrafted prior for low-level tasks such as denoising, inpainting.A simple Neural Network Module for Relational Reasoning2019-05-14T08:59:24+02:002019-05-14T08:59:24+02:00https://ameroyer.github.io/architectures/a_simple_neural_network_module_for_relational_reasoning<div class="summary">
The authors propose a <b>relation module</b> to equip <code>CNN</code> architectures with notion of relational reasoning, particularly useful for tasks such as visual question answering, dynamics understanding etc.
<ul>
<li><span class="pros">Pros (+):</span> Simple architecture, relies on small and flexible modules.</li>
<li><span class="cons">Cons (-):</span> Still a black-box module, hard to quantify how much "reasoning" happens.</li>
</ul>
</div>
<h2 class="section proposed"> Proposed Model</h2>
<p>The main idea of <em>Relation Networks</em> (<code class="language-plaintext highlighter-rouge">RN</code>) is to constrain the functional form of convolutional neural networks as to explicitly learn relations between entities, rather than hoping for this property to emerge in the representation during training. Formally, let \(O\) be a set of objects of interest \(O = \{o_1 \dots o_n\}\); The Relation Network is trained to learn a representation that considers all <em>pairwise relations</em> across the objects:</p>
\[\begin{align}
\mbox{RN}(O) = f_{\phi}& \left(\sum_{i, j} g_{\theta}(o_i, o_j) \right)
\end{align}\]
<p>\(f_{\phi}\) and \(g_{\theta}\) are defined as <em>Multi Layer Perceptrons</em>. By definition, the Relation Network <strong><em>(i)</em></strong> has to consider all pairs of objects, <strong><em>(ii)</em></strong> operates directly on the set of objects hence is not constrained to a specific organization of the data, and <strong><em>(iii)</em></strong> is data-efficient in the sense that only one function, \(g_{\theta}\) is learned to capture all the possible relations: \(g\) and \(f\) are typically light modules and most of the overhead comes from the sum of pairwise components (\(n^2\)).</p>
<p>The <em>objects</em> are the basic elements of the relational process we want to model. They are defined with regard to the task at hand, for instance:</p>
<ul>
<li>
<p><strong>Attending relations between objects in an image</strong>: The image is first processed through a fully-convolutional network. Each of the resulting cell is taken as an object, which is a feature of dimensions \(k\), additionally tagged with its position in the feature map.</p>
</li>
<li>
<p><strong>Sequence of images.</strong> In that case, each image is first fed through a feature extractor and the resulting embedding is used as an object. The goal is to model relations between images across the sequence.</p>
</li>
</ul>
<div class="figure">
<img src="/images/posts/relation_network.png" />
<p><b>Figure:</b> Example of applying the Relation Network for <b>Visual Question Answeting</b>. Questions are processed with an <code>LSTM</code> to produce a question embedding, and images are processed with a <code>CNN</code> to produce a set of objects for the <code>RN</code>.</p>
</div>
<hr />
<h2 class="section experiments"> Experiments </h2>
<p>The main evaluation is done on the <code class="language-plaintext highlighter-rouge">CLEVR</code> dataset <span class="citations">[2]</span>. The main message seems to be that the proposed module is very simple and yet often improves the model accuracy when added to various architectures (<code class="language-plaintext highlighter-rouge">CNN</code>, <code class="language-plaintext highlighter-rouge">CNN + LSTM</code> etc.) introduced in <span class="citations">[1]</span>. The main baseline they compare to (and outperform) is <em>Spatial Attention</em> (<code class="language-plaintext highlighter-rouge">SA</code>) which is another simple method to integrate some form of relational reasoning in a neural architecture.</p>
<hr />
<h2 class="section followup">Closely related</h2>
<h4 style="margin-bottom: 0px"> Recurrent Relational Neural Networks <span class="citations">[3]</span></h4>
<p style="text-align: left">Palm et al, <a href="https://arxiv.org/pdf/1711.08028.pdf">[link]</a></p>
<blockquote>
<p>This paper builds on the Relation Network architecture and propose to explore <em>more complex relational structures</em>, defined as a graph, using a <em>message passing</em> approach: Formally, we are given a graph with vertices \(\mathcal V = \{v_i\}\) and edges \(\mathcal E = \{e_{i, j}\}\). By abuse of notation, \(v_i\) also denotes the embedding for vertex \(i\) (e.g. obtained via a CNN) and \(e_{i, j}\) is 1 where \(i\) and \(j\) are linked, 0 otherwise. To each node we associate a <em>hidden state</em> \(h_i^t\) at iteration \(t\), which will be updated via message passing. After a few iterations, the resulting state is passed through a <code class="language-plaintext highlighter-rouge">MLP</code> \(r\) to output the result (either for each node or for the whole graph):</p>
</blockquote>
\[\begin{align}
h_i^0 &= v_i\\
h_i^{t + 1} &= f_{\phi} \left( h_i^t, v_i, \sum_{j} e_{i, j} g_{\theta}(h^t_i, h^t_j) \right)\\
o_i &= r(h_i^T) \mbox{ or } o = r(\sum_i h_i^T)
\end{align}\]
<blockquote>
<p>Comparing to the original Relation Network:</p>
<ul>
<li>Each update rule is a Relation Network that only looks at <em>pairwise relations between linked vertices</em>. The message passing scheme additionally introduces the notion of recurrence, and the dependency on the previous hidden state.</li>
<li>The dependence on \(h_i^t\) could <em>in theory</em> be avoided by adding self-edges from \(v_i\) to \(v_i\), to make it closer to the Relation Network formulation.</li>
<li>Adding \(v_i\) as input of \(f_\phi\) looks like a simple trick to avoid long-term memory problems.</li>
</ul>
</blockquote>
<blockquote>
<p>The <em>experiments</em> essentially compare the proposed <code class="language-plaintext highlighter-rouge">RRNN</code> model to the Relation Network and classical recurrent architectures such as <code class="language-plaintext highlighter-rouge">LSTM</code>. They consider three datasets:</p>
<ul>
<li><strong>Babi.</strong> NLP question answering task with some reasoning involved. Solves 19.7 (out of 20) tasks on average, while simple RN solved around 18 of them reliably.</li>
<li><strong>Pretty CLEVR.</strong> A CLEVR like dataset (only with simple 2D shapes) with questions involving various steps of reasoning, e.g. “which is the shape \(n\) steps of the red circle ?”</li>
<li><strong>Sudoku.</strong> the graph contains 81 nodes (one for each cell in the sudoku), with edges between cells belonging to the same row, column or block.</li>
</ul>
</blockquote>
<h4 style="margin-bottom: 0px; margin-top:50px"> Multi-Layer Relation Neural Networks <span class="citations">[4]</span></h4>
<p style="text-align: left">Jahrens and Martinetz, <a href="https://arxiv.org/pdf/1811.01838.pdf">[link]</a></p>
<blockquote>
<p>This paper presents a very simple trick to make Relation Network consider higher order relations than pairwise, while retaining some efficiency. Essentially the model can be written as follow:</p>
</blockquote>
\[\begin{align}
h_{i, j}^0 &= g^0_{\theta}(x_i, x_j) \\
h_{i, j}^t &= g^{t + 1}_{\theta}\left(\sum_k h_{i, k}^{t - 1}, \sum_k h_{j, k}^{t - 1}\right) \\
MLRN(O) &= f_{\phi}(\sum_{i, j} h^T_{i, j})
\end{align}\]
<blockquote>
<p>It is not clear why this model would be equivalent to explicitly considering higher-level relations (as it is rather combining pairwise terms for a <em>finite number of steps</em>). According to the experiments it seems that indeed this architecture could be better fitted for the studied tasks (e.g. over the Relation Network or Recurrent Relation Network) but it also makes the model even harder to interpret.</p>
</blockquote>
<hr />
<h2 class="section references">References</h2>
<ul>
<li><span class="citations">[1]</span> Inferring and executing programs for visual reasoning, <i>Johnson et al, ICCV 2017</i></li>
<li><span class="citations">[2]</span> CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning, <i>Johnson et al, CVPR 1017</i></li>
<li><span class="citations">[3]</span> Recurrent Relational Neural Networks, <i>Palm et al, NeurIPS 2018</i></li>
<li><span class="citations">[4]</span> Multi-Layer Relation Neural Networks, <i>Jahrens et Martinetz, arXiv 2018</i></li>
</ul>Santoro et al.The authors propose a relation module to equip CNN architectures with notion of relational reasoning, particularly useful for tasks such as visual question answering, dynamics understanding etc.Automatically Composing Representation Transformations as a Mean for Generalization2019-05-14T08:59:24+02:002019-05-14T08:59:24+02:00https://ameroyer.github.io/domain%20adaptation/automatically_composing_representation_transformations_as_a_mean_for_generalization<div class="summary">
The authors focus on solving <b>recursive</b> tasks which can be decomposed into a sequence of simpler algorithmic procedures (e.g., arithmetic problems, geometric transformations). The main difficulties of this approach are <b>(i)</b> how to actually decompose the task into simpler blocks and <b>(ii)</b> how to extrapolate to more complex problems from learning on simpler individual tasks.
The authors propose the <b>compositional recursive learner</b> (<code>CRL</code>) to learn at the same time both the structure of the task and its components.
<ul>
<li><span class="pros">Pros (+):</span> This problem is well motivated, and seems a very promising direction, for learning domain-agnostic components.</li>
<li><span class="cons">Cons (-):</span> The actual implementation description lacks crucial details and I am not sure how easy it would be to reimplement.</li>
</ul>
</div>
<h2 class="section proposed"> Proposed model</h2>
<h3 id="problem-definition">Problem definition</h3>
<p>A <em>problem</em> \(P_i\) is defined as a transformation \(x_i : t_x \mapsto y_i : t_y\), where \(t_x\) and \(t_y\) are the respective types of \(x\) and \(y\). However since we only consider recursive problem here, then \(t_x = t_y\).
We define a <em>family of problems</em> \(\mathcal P\) as a set of composite recursive problems that share regularities. The goal of <code class="language-plaintext highlighter-rouge">CRL</code> is to extrapolate to solve new compositions of these tasks, using knowledge from the limited subset of tasks it has seen during training.</p>
<h3 id="implementation">Implementation</h3>
<p>In essence, the problem can be formulated as a sequence-decision making task via a <em>meta-level MDP</em> (\(\mathcal X\), \(\mathcal F\), \(\mathcal P_{\mbox{meta}}\), r, \(\gamma\)), where \(\mathcal X\) is the <em>set of states</em>, i.e., representations; \(\mathcal F\) is a <em>set of computations</em>, i.e., istances of the transformations we consider, for instance as neural networks, and an additional special function <code class="language-plaintext highlighter-rouge">HALT</code> that stops the execution; \(\mathcal P_{\mbox{meta}}: (x_t, f_t, x_{t + 1}) \mapsto c \in [0, 1]\) is the <em>policy</em> which assigns a probability to each possible transition. Finally \(r\) is the <em>reward function</em> and \(\gamma\) a decay factor.</p>
<p>More specifically, the <code class="language-plaintext highlighter-rouge">CRL</code> is implemented as <em>a set of neural networks</em>, \(f_k \in \mathcal F\), and a <em>controller</em> \(\pi(f\ |\ \mathbf{x}, t_y)\) which selects the best course of action given the current history of representations \(\mathbf{x}\) and target type \(t_y\).
The loss is back-propagated through the functions \(f\), and the controller is trained as a Reinforcement Learning (<code class="language-plaintext highlighter-rouge">RL</code>) agent with a sparse reward (it only knows the final target result).
An additional important training scheme is the use of <em>curriculum learning</em> i.e., start by learning small transformations and then consider more complex compositions, increasing the state space little by little.</p>
<div class="figure">
<img src="/images/posts/crl.png" />
<p><b>Figure:</b> <b>(top-left)</b> <code>CRL</code> is a symbiotic relationship between a
controller and evaluator: the controller selects a module `m` given an intermediate representation `x` and the
evaluator applies `m` on `x` to create a new representation. <b>(bottom-left)</b> <code>CRL</code> dynamically learns the
structure of a program customized for its problem, and this program can be viewed as a finite state machine.
<b>(right)</b> A series of computations in the program is equivalent to a traversal through a Meta-MDP, where module
can be reused across different stages of computation, allowing for recursive computation.</p>
</div>
<hr />
<h2 class="section experiments"> Experiments</h2>
<h3 id="multilingual-arithmetic">Multilingual Arithmetic</h3>
<p>The learner will aim to solve recursive arithmetic expressions across 6 languages: <code class="language-plaintext highlighter-rouge">English</code>, <code class="language-plaintext highlighter-rouge">Numerals</code>, <code class="language-plaintext highlighter-rouge">PigLatin</code>, <code class="language-plaintext highlighter-rouge">Reversed-English</code>, <code class="language-plaintext highlighter-rouge">Spanish</code>. The input is a tuple \((x^s, t_y)\), where \(x\) is the arithmetic expression expressed in source language \(s\), and \(t_y\) is the output language.</p>
<ul>
<li>
<p><strong>Training:</strong> The learner trains on a curriculum of a limited set of 2, 3, 4, 5-length expressions. During training, each source language is seen with four target languages (and one held out for testing) and each target language is seen with four source languages (and one held out for testing).</p>
</li>
<li>
<p><strong>Testing:</strong> The learner is asked to generalize to 5-length expressions (<em>test set</em>) and to extrapolate to 10-length expressions (<em>extrapolation set</em>) with unseen language pairs.</p>
</li>
</ul>
<p>The authors consider two main types of functional units for this task: A <em>reducer</em>, which takes as input a window of three terms in the input expression and outputs a softmax distribution over the vocabulary. While a <em>translator</em> applies a function to every element of the input sequence and outputs a sequence of the same size.</p>
<p>The <code class="language-plaintext highlighter-rouge">CRL</code> is compared to a baseline <code class="language-plaintext highlighter-rouge">RNN</code> architecture that directly tries to map a variable length input sequence to the target output. On the test set, <code class="language-plaintext highlighter-rouge">RNN</code> and <code class="language-plaintext highlighter-rouge">CRL</code> yield similar accuracies although <code class="language-plaintext highlighter-rouge">CRL</code> usually requires less training samples and/or less training iterations. On the extrapolation set however, <code class="language-plaintext highlighter-rouge">CRL</code> more clearly outperforms <code class="language-plaintext highlighter-rouge">RNN</code>.
Interestingly the <code class="language-plaintext highlighter-rouge">CRL</code> results usually have a much bigger <em>variance</em> which would be interesting to qualitatively analyze. Moreover, the use of <em>curriculum learning</em> significantly improves the model performance. Finally, qualitative results show that the reducers and translators are interpretable <em>to some degree</em>: e.g., it is possible to map some of the reducers to specific operations, however due to the unsupervised nature of the task, the mapping is not always straight-forward.</p>
<h3 id="image-transformations">Image Transformations</h3>
<p>This time the functional units are composed of three specialized <em>Spatial Transformer Networks</em> <span class="citations">[1]</span> to learn rotation, scale and translation, and an identity function. Overall this setting does not yield very good quantitative results.
More precisely, one of the main challenges, since we are acting on a visual domain, is to <em>deduce the structure of the task from information which lacks clear structure</em> (pixel matrices). Additionally the fact that all inputs and outputs have the same domain (images) and that only a sparse reward is available make it more difficult for the controller to distinguish between functionalities, i.e., it could collapse to using only one transformer.</p>
<hr />
<h2 class="section references"> References </h2>
<ul>
<li><span class="citations">[1]</span> Spatial Transformer Networks, <i>Jaderberg et al., NeurIPS 2016</i></li>
</ul>Chang et al.The authors focus on solving recursive tasks which can be decomposed into a sequence of simpler algorithmic procedures (e.g., arithmetic problems, geometric transformations). The main difficulties of this approach are (i) how to actually decompose the task into simpler blocks and (ii) how to extrapolate to more complex problems from learning on simpler individual tasks. The authors propose the compositional recursive learner (CRL) to learn at the same time both the structure of the task and its components.Learning a SAT Solver from Single-Bit Supervision2019-05-14T08:59:24+02:002019-05-14T08:59:24+02:00https://ameroyer.github.io/structured%20learning/Learning_a_sat_solver_from_single_bit_supervision<div class="summary">
The goal is to solve <code>SAT</code> problems with weak supervision: In that case, a model is trained only to predict the <b>satisfiability</b> of a formula in conjunctive normal form. As a byproduct, if the formula is satisfiable, an actual satisfying assignment can be worked out from the network's activations in most cases.
<ul>
<li><span class="pros">Pros (+):</span> No need for extensive annotation, seems to extrapolate nicely to harder problems by increasing the number message passing iterations.</li>
<li><span class="cons">Cons (-):</span> Limited practical applicability since it is outperformed by classical <code>SAT</code> solvers.</li>
</ul>
</div>
<h2 class="section proposed"> Model: NeuroSAT</h2>
<h3 id="input">Input</h3>
<p>We consider boolean logic formulas in their <em>conjunctive normal form</em> (CNF), i.e. each input formula is represented as a conjunction (\(\land\)) of <em>clauses</em>, which are themselves disjunctions (\(\lor\)) of literals (positive or negative instances of variables). The goal is to learn a classifier to predict whether such a formula is satisfiable.</p>
<p>A first problem is how to encode the input formula in such a way that it preserves the CNF invariances (invariance to negating a literal in all clauses, invariance to permutations in \(\lor\) and \(\land\) etc.). The authors use a standard <em>undirected graph representation</em> where:</p>
<ul>
<li>\(\mathcal V\): vertices are the literals (positive and negative form of variables, denoted as \(x\) and \(\bar x\)) and the clauses occurring in the input formula</li>
<li>\(\mathcal E\): Edges are added to connect (i) the literals with clauses they appear in and (ii) each literal to its negative counterpart.</li>
</ul>
<p>The graph relations are encoded as an <em>adjacency matrix</em>, \(A\), with as many rows as there are literals and as many columns as there are clauses. Note that this structure <em>does not constrain the vertices ordering</em>, and does not make any preferential treatment between positive or negative literals. However it still has some caveats, which can be avoided by pre-processing the formula. For instance when there are disconnected components in the graph, the averaging decision rule (see next paragraph) can lead to false positives.</p>
<h3 id="message-passing-model">Message-passing model</h3>
<p>In a high-level view, the model keeps track of an embedding for each litteral and each clause (\(L^t\) and \(C^t\)), updated via <em>message-passing on the graph</em>, and combined via a Multi Layer Perceptron (<code class="language-plaintext highlighter-rouge">MLP</code>) to output the model prediction of the formula’s satisfiability. Then the model updates are as follow:</p>
\[\begin{align}
C^t, h_C^t &= \texttt{LSTM}_\texttt{C}(h_C^{t - 1}, A^T \texttt{MLP}_{\texttt{L}}(L^{t - 1}) )\ \ \ \ \ \ \ \ \ \ \ (1)\\
L^t, h_L^t &= \texttt{LSTM}_\texttt{L}(h_L^{t - 1}, \overline{L^{t - 1}}, A\ \texttt{MLP}_{\texttt{C}}(C^{t }) )\ \ \ \ \ \ (2)\\
\end{align}\]
<p>where \(h\) designates a hidden context vector for the LSTMs. The operator \(L \mapsto \bar{L}\) returns \(\overline{L}\), the embedding matrix \(L\) where the row of each litteral is swapped with the one corresponding to the literal’s negation.
In other words, in <strong><em>(1)</em></strong> each clause embedding is updated based on the litteral that composes it, while in <strong><em>(2)</em></strong> each litteral embedding is updated based on the clauses it appears in and its negated counterpart.</p>
<p>After \(T\) iterations of this message-passing scheme, the model computes a <em>logit for the satisfiability classification problem</em>, which is trained via sigmoid cross-entropy:</p>
\[\begin{align}
L^t_{\mbox{vote}} &= \texttt{MLP}_{\texttt{vote}}(L^t)\\
y^t &= \mbox{mean}(L^t_{\mbox{vote}})
\end{align}\]
<h3 id="building-the-training-set">Building the training set</h3>
<p>The training set is built such that for any satisfiable training formula \(S\), it also includes an unsatisfiable counterpart \(S'\) which differs from \(S\) <em>only by negating one litteral in one clause</em>. These carefully curated samples should constrain the model to pick up substantial characteristics of the formula. In practice, the model is trained on formulas containing up to <em>40 variables</em>, and on average <em>200 clauses</em>. At this size, the SAT problem can still be solved by state-of-the-art solvers (yielding the supervision required to solve the model) but are large enough they prove challenging for Machine Learning models.</p>
<h3 id="inferring-the-sat-assignment">Inferring the SAT assignment</h3>
<p>When a formula is satisfiable, one often also wants to know a <em>valuation</em> (variable assignment) that satisfies it.
Recall that \(L^t_{\mbox{vote}}\) encodes a “vote” for every literal and its negative counterpart. Qualitative experiments show that those scores cannot be directly used for inferring the variable assignment, however they do induce a nice clustering of the variables (once the message passing has converged). Hence an assignment can be found as follows:</p>
<ul>
<li><strong>(1)</strong> Reshape \(L^T_{\mbox{vote}}\) to size \((n, 2)\) where \(n\) is the number of literals.</li>
<li><strong>(2)</strong> Cluster the litterals into two clusters with centers \(\Delta_1\) and \(\Delta_2\) using the following criterion:
\begin{align}
|x_i - \Delta_1|^2 + |\overline{x_i} - \Delta_2|^2 \leq |x_i - \Delta_2|^2 + |\overline{x_i} - \Delta_1|^2
\end{align}</li>
<li><strong>(3)</strong> Try the two resulting assignments (set \(\Delta_1\) to true and \(\Delta_2\) to false, or vice-versa) and choose the one that yields satisfiability if any.</li>
</ul>
<p>In practice, this method retrieves a satistifiability assignment for over 70% of the satisfiable test formulas.</p>
<hr />
<h2 class="section experiments"> Experiments </h2>
<p>In practice, the <code class="language-plaintext highlighter-rouge">NeuroSAT</code> model is trained with embeddings of dimension 128 and 26 message passing iterations. The <code class="language-plaintext highlighter-rouge">MLP</code> architectures are very standard: 3 layers followed by ReLU activations. The final model obtains 85% accuracy in predicting a formula’s satisfiability on the test set.</p>
<p>It also can generalize to <em>larger problems</em>, although it requires to increase the number of message passing iterations. However the classification performance significantly decreases (e.g. 25% for 200 variables) and the number of iterations <em>linearly scales</em> with the number of variables (at least in the paper experiments).</p>
<div class="figure">
<img style="width:30%" src="/images/posts/neurosat1.png" /> <img style="width:69%" src="/images/posts/neurosat2.png" />
<p><b>Figure:</b> <b>(left)</b> Success rate of a <code>NeuroSAT</code> model trained on 40 variables for test set involving formulas with up to 200 variables, as a function of the number of message-passing iterations. <b>(right)</b> The sequence of literal votes across message-passing iterations on a satisfiable formula. The vote matrix is reshaped such that each row contains the votes for a literal and its negated counterpart. For several iterations, most literals vote unsat with low confidence (<span style="color: lightblue">light blue</span>). After a few iterations, there is a phase transition and all literals vote sat with very high confidence (<span style="color: red">dark red</span>), until convergence. </p>
</div>
<p>Interestingly, the model generalizes well to other classes of problems that were <em>reduced to <code class="language-plaintext highlighter-rouge">SAT</code></em> (using <code class="language-plaintext highlighter-rouge">SAT</code>’s NP-completitude), although they have different structure than the random formulas generated for training, which seems to show that the model does learn some general characteristics of boolean formulas.</p>
<p>To summarize, the model takes advantage of the structure of Boolean formulas, and is able to predict whether an input formula is satisfiable or not with high accuracy. Moreover, even though trained only with this weak supervisory signal, it can work out a valid assignment most of the time. However it is still subpar <em>compared to standard SAT solvers</em>, which makes its applicability limited.</p>Selsam et al.The goal is to solve SAT problems with weak supervision: In that case, a model is trained only to predict the satisfiability of a formula in conjunctive normal form. As a byproduct, if the formula is satisfiable, an actual satisfying assignment can be worked out from the network's activations in most cases.Glow: Generative Flow with Invertible 1×1 Convolutions2019-05-07T14:59:24+02:002019-05-07T14:59:24+02:00https://ameroyer.github.io/generative%20models/glow_generative_flow_with_invertible_1x1_convolution<div class="summary">
Invertible flow based generative models such as <span class="citations">[2, 3]</span> have several advantages including exact likelihood inference process (unlike <code>VAE</code>s or <code>GAN</code>s) and easily parallelizable training and inference (unlike the sequential generative process in auto-regressive models). This paper proposes a new, more flexible, form of <b>invertible flow</b> for generative models, which builds on <span class="citations">[3]</span>.
<ul>
<li><span class="pros">Pros (+):</span> Very clear presentation, promising results both quantitative and qualitative.</li>
<li><span class="cons">Cons (-):</span> One of the disadvantages of the models seem to be a large number of parameters, it would be interesting to have a more detailed report on training time. Also a comparison to <span class="citations">[5]</span> (a variant of <code>PixelCNN</code> that allows for faster parallelized sample generation) would be nice.</li>
</ul>
</div>
<h2 class="section theory"> Invertible flow-Based Generative Models </h2>
<p>Given input data \(x\), invertible flow-based generative models are built as two steps processes that generate data from an intermediate latent representation \(z\):</p>
\[\begin{align}
z \sim p_{\theta}(z)\\
x = g_\theta(z)
\end{align}\]
<p>where \(g_\theta\) is an <em>invertible</em> function, i.e. a bijection, \(g_\theta: \mathcal X \rightarrow \mathcal Z\). It acts as an encoder from the input data to the latent space.
\(g\) is usually built as a sequence of smaller invertible functions \(g = g_1 \circ \dots \circ g_n\). Such a sequence is also called a <em>normalizing flow</em> <span class="citations">[1]</span>. Under this construction, the <em>change of variables formula</em> applied to \(x = g(z)\) gives the following equivalence between the input and latent densities:</p>
\[\begin{align}
\log p(x) &= \log p(z) + \log\ \left| \det \left( \frac{d z}{d x} \right)\right|\\
&= \log p(z) + \sum_{i=1}^n \log\ \left| \det \left( \frac{g_{\leq i}(x)}{g_{\leq i - 1}(x)} \right)\right|
\end{align}\]
<p>where \(\forall i \in [1; n],\ g_{\leq i} = g_i \circ \dots g_1\) In particular, this means \(g_{\leq n}(x) = z\) and \(g_0(x) = x\). \(p_\theta(z)\) is usually chosen as a simple density such as a unit Gaussian distribution, \(p_\theta(z) = \mathcal N(z; 0, \mathbf{I})\).
In order to efficiently estimate the likelihood, the functions \(g_1, \dots g_n\) are usually chosen such that the <em>log-determinant of the Jacobian</em>, \(\log\ \left\vert \det \left( \frac{g_{\leq i}}{g_{\leq i - 1}} \right) \right\vert\), is easily computed, for instance by choosing transformation such that the Jacobian is a triangular matrix.</p>
<hr />
<h2 class="section proposed"> Proposed Flow Construction: GLOW</h2>
<h3 id="flow-step">Flow step</h3>
<p>Each flow step function \(g_i\) is a sequence of three operations as follows. Given an input tensor of dimensions \(h \times w \times c\):</p>
<table>
<thead>
<tr>
<th>Step Description</th>
<th>Functional Form of flow \(g_i\)</th>
<th>Inverse Function of the flow, \(g_i^{-1}\)</th>
<th>Log-determinant Expression</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>ActNorm</strong> <br /> \(s: [c,]\) <br /> \(b: [c,]\)</td>
<td>\(y = \sigma\odot x + \mu\)</td>
<td>\(x = (y - \mu) / \sigma\)</td>
<td>\(hw\ \mbox{sum} \log(\vert\sigma\vert)\)</td>
</tr>
<tr>
<td><strong>1x1 conv</strong> <br /> \(W: [c,c]\)</td>
<td>\(y = Wx\)</td>
<td>\(x = W^{-1}y\)</td>
<td>\(h w \log \vert \det (W) \vert\)</td>
</tr>
<tr>
<td><strong>Affine Coupling</strong> <br /> <strong>(ACL)</strong> [2]</td>
<td>\(x_a,\ x_b = \mbox{split}(x)\) <br /> \((\log \sigma, \mu) = \mbox{NN}(x_b)\) <br /> \(y_a = \sigma \odot x_a + \mu\) <br /> \(y = \mbox{concat}(y_a, x_b)\)</td>
<td>\(y_a,\ y_b = \mbox{split}(y)\) <br /> \((\log \sigma, \mu) = \mbox{NN}(x_b)\) <br /> \(x_a = (y_a - \mu) / \sigma\) <br /> \(x = \mbox{concat}(x_a, y_b)\)</td>
<td>\(\mbox{sum} (\log \vert\sigma\vert)\)</td>
</tr>
</tbody>
</table>
<p><br /></p>
<ul>
<li>
<p><strong>ActNorm.</strong> The activation normalization layer is introduced as a replacement for <em>Batch Normalization</em> (<code class="language-plaintext highlighter-rouge">BN</code>) to avoid degraded performance with small mini-batch sizes, e.g. when training with batch size 1. This layer has the same form as <code class="language-plaintext highlighter-rouge">BN</code>, however the <em>bias, \(\mu\), and variance, \(\sigma\), are data-independent variables</em>: They are initialized based on an initial mini-batch of data (data-dependent initialization), but are optimized during training with the rest of the parameters, rather than estimated from the input minibatch statistics.</p>
</li>
<li>
<p><strong>1x1 convolution.</strong> This is a simple 1x1 convolutional layer. In particular, the cost of computing the determinant of \(W\) can be reduced by writing \(W\) in its <em>LU decomposition</em>, although this increases the number of parameters to be learned.</p>
</li>
<li>
<p><strong>Affine Coupling Layer.</strong> The <code class="language-plaintext highlighter-rouge">ACL</code> was introduced in <span class="citations">[2]</span>. The input tensor \(x\) is first split in half along the channel dimension. The second half, \(x_b\), is fed through a small neural network to get parameters \(\sigma\) and \(\mu\), and the corresponding affine transformation is applied to the first half, \(x_a\).
The rescaled \(x_a\) is the actual transformed output of the layer, however \(x_b\) also has to be propagated in order to make the transformation invertible, such that \(\sigma\) and \(\mu\) can also be estimated in the reverse flow.
Finally, note that the previous 1x1 convolution can be seen as a generalized <em>permutation of the input channels</em>, and guarantees that different channels combinations are seen during the <code class="language-plaintext highlighter-rouge">split</code> operation.</p>
</li>
</ul>
<h3 id="general-pipeline">General Pipeline</h3>
<p>These operations are then combined in a <em>multi-scale architecture</em> as described in <span class="citations">[3]</span>, which in particular relies on a <em>squeezing</em> operation to trade of spatial resolution for number of output channels.
Given an input tensor of size \(s \times s \times c\), the squeezing operator takes blocks of size \(2 \times 2 \times c\) and flatten them to size \(1 \times 1 \times 4c\), which can easily be inverted by reshaping.
The final pipeline consists in \(L\) levels that operate on different scales: each level is composed of \(K\) flow steps and a final squeezing operation.</p>
<div class="figure">
<img src="/images/posts/glow.png" />
<p><b>Figure:</b> Overview of the multi-layer <code>GLOW</code> architecture.</p>
</div>
<p>In summary, the <em>main differences</em> with <span class="citations">[3]</span> are:</p>
<ul>
<li>Batch Normalization is replaced with Activation Normalization</li>
<li>1x1 convolutions are considered as a more generic operation to replace permutations</li>
<li>Only channel-wise splitting is considered in the Affine Coupling Layer, while <span class="citations">[3]</span> also considered a binary spatial checkerboard pattern to split the input tensor in two.</li>
</ul>
<hr />
<h2 class="section experiments"> Experiments </h2>
<h3 id="implementation">Implementation</h3>
<p>In practice, the authors implement <code class="language-plaintext highlighter-rouge">NN</code> as a convolutional neural network of depth 3 in the <code class="language-plaintext highlighter-rouge">ACL</code>; which means that each flow step contains 4 convolutions in total. They also use \(K = 32\) flow steps in each level. Finally the number of levels \(L\) is 3 for small-scale experiments (32x32 images) and 6 for large scale (256x256 ImageNet images).
In particular this means that the model contains <em>a lot of parameters</em> (\(L \times K \times 4\) convolutions) which might be a practical disadvantage compared to other method that produce samples of similar quality, e.g. <code class="language-plaintext highlighter-rouge">GAN</code>s. However, contrary to these models, <code class="language-plaintext highlighter-rouge">GLOW</code> provides <em>exact likelihood inference</em>.</p>
<h3 id="results">Results</h3>
<p><code class="language-plaintext highlighter-rouge">GLOW</code> outperforms <code class="language-plaintext highlighter-rouge">RealNVP</code> <span class="citations">[3]</span> in terms of data likelihood, as evaluated on standard benchmarks (ImageNet, CIFAR-10, LSUN). In particular, the 1x1 convolutions performs better than other more specific permutations operations, and only introduces a small computational overhead.</p>
<p>Qualitatively, the samples are of great quality and the model seems to scale well with higher resolution. However this greatly increases the <em>memory requirements</em>. Leveraging the model’s invertibility to avoid storing activations during the feed-forward pass such as in <span class="citations">[4]</span> could be used to (partially) palliate the problem.</p>
<hr />
<h2 class="section references"> References </h2>
<ul>
<li><span class="citations">[1]</span> Variational inference with normalizing flows, <i>Rezende and Mohamed, ICML 2015</i></li>
<li><span class="citations">[2]</span> NICE: Non-linear Independent Components Estimation, <i>Dinh et al., ICLR 2015</i></li>
<li><span class="citations">[3]</span> Density estimation using Real NVP, <i>Dinh et al., ICLR 2017</i></li>
<li><span class="citations">[4]</span> The Reversible Residual Network: Backpropagation Without Storing Activations, <i>Gomez et al., NeurIPS 2017 </i></li>
<li><span class="citations">[5]</span> Parallel Multiscale Autoregressive Density Estimation, <i>S.Reed et al, ICML 2017</i></li>
</ul>D. Kingma and P. DhariwalInvertible flow based generative models such as [2, 3] have several advantages including exact likelihood inference process (unlike VAEs or GANs) and easily parallelizable training and inference (unlike the sequential generative process in auto-regressive models). This paper proposes a new, more flexible, form of invertible flow for generative models, which builds on [3].The Reversible Residual Network: Backpropagation Without Storing Activations2019-05-07T08:59:24+02:002019-05-07T08:59:24+02:00https://ameroyer.github.io/architectures/the_reversible_residual_network<div class="summary">
Residual Networks (<code>ResNet</code>) <span class="citations">[3]</span> have greatly advanced the state-of-the-art in Deep Learning by making it possible to train much deeper networks via the addition of skip connections. However, in order to compute gradients during the backpropagation pass, all the units' activations have to be stored during the feed-forward pass, leading to high memory requirements for these very deep networks.
Instead, the authors propose a <b>reversible architecture</b> in which activations at one layer can be computed from the ones of the next. Leveraging this invertibility property, they design a more efficient implementation of backpropagation, effectively trading compute power for memory storage.
<ul>
<li><span class="pros">Pros (+):</span> The change does not negatively impact model accuracy (for equivalent number of model parameters) and it only requires a small change in the backpropagation algorithm.</li>
<li><span class="cons">Cons (-):</span> Increased number of parameters, not fully reversible (see <code>i-RevNets</code> <span class="citations">[4]</span>)</li>
</ul>
</div>
<h2 class="section proposed"> Proposed Architecture</h2>
<h3 id="revnet">RevNet</h3>
<p>This paper proposes to incorporate idea from previous reversible architectures, such as <code class="language-plaintext highlighter-rouge">NICE</code> <span class="citations">[1]</span>, into a standard <code class="language-plaintext highlighter-rouge">ResNet</code>. The resulting model is called <code class="language-plaintext highlighter-rouge">RevNet</code> and is composed of reversible blocks, inspired from <em>additive coupling</em> <span class="citations">[1, 2]</span>:</p>
<center>
<table>
<tr>
<th> ResNet block </th>
<th> Inverse </th>
</tr>
<tr>
<td> $$ \begin{align}
\mathbf{input }\ x&\\
x_1, x_2 &= \mbox{split}(x)\\
y_1 &= x_1 + \mathcal{F}(x_2)\\
y_2 &= x_2 + \mathcal{G}(y_1)\\
\mathbf{output}\ y &= (y_1, y_2)
\end{align} $$ </td>
<td> $$ \begin{align}
\mathbf{input }\ y&\\
y1, y2 &= \mbox{split}(y)\\
x_2 &= y_2 - \mathcal{G}(y_1)\\
x_1 &= y_1 - \mathcal{F}(x_2)\\
\mathbf{output}\ x &= (x_1, x_2)
\end{align} $$</td>
</tr>
</table>
</center>
<p><br /></p>
<p>where \(\mathcal F\) and \(\mathcal G\) are residual functions, composed of sequences of convolutions, <code class="language-plaintext highlighter-rouge">ReLU</code> and Batch Normalization layers, analogous to the ones in a standard <code class="language-plaintext highlighter-rouge">ResNet</code> block, although operations in the reversible blocks need to have a stride of 1 to <em>avoid information loss</em> and preserve invertibility. Finally, for the <code class="language-plaintext highlighter-rouge">split</code> operation, the authors consider splitting the input Tensor across the channel dimension as in <span class="citations">[1, 2]</span>.</p>
<p>Similarly to <code class="language-plaintext highlighter-rouge">ResNet</code>, the final <code class="language-plaintext highlighter-rouge">RevNet</code> architecture is composed of these invertible residual blocks, as well as non-reversible subsampling operations (e.g., pooling) for which activations have to be stored. However the number of such operations is much smaller than the number of residual blocks in a typical <code class="language-plaintext highlighter-rouge">ResNet</code> architecture.</p>
<h3 id="backpropagation">Backpropagation</h3>
<p>The backpropagation algorithm is derived from the chain rule and is used to compute the total gradients of the loss with respect to the parameters in a neural network: given a loss function \(L\), we want to compute <em>the gradients of \(L\) with respect to the parameters of each layer</em>, indexed by \(n \in [1, N]\), i.e., the quantities \(\overline{\theta_{n}} = \partial L /\ \partial \theta_n\) (where \(\forall x, \bar{x} = \partial L / \partial x\)).
We roughly summarize the algorithm in the left column of <strong>Table 1</strong>: In order to compute the gradients for the \(n\)-th block, backpropagation requires the input and output activation of this block, \(y_{n - 1}\) and \(y_{n}\), which have been stored, and the derivative of the loss respectively to the output, \(\overline{y_{n}}\), which has been computed in the backpropagation iteration of the upper layer; Hence the name <em>backpropagation</em>.</p>
<p>Since <em>activations are not stored in <code class="language-plaintext highlighter-rouge">RevNet</code></em>, the algorithm needs to be slightly modified, which we describe in the right column of <strong>Table 1</strong>. In summary, we first need to recover the input activations of the <code class="language-plaintext highlighter-rouge">RevNet</code> block using its invertibility. These activations will be propagated to the earlier layers for further backpropagation. Secondly, we need to compute the gradients of the loss with respect to the inputs, i.e. \(\overline{y_{n - 1}} = (\overline{y_{n -1, 1}}, \overline{y_{n - 1, 2}})\), using the fact that:</p>
\[\begin{align}
\overline{y_{n - 1, i}} = \overline{y_{n, 1}}\ \frac{\partial y_{n, 1}}{y_{n - 1, i}} + \overline{y_{n, 2}}\ \frac{\partial y_{n, 2}}{y_{n - 1, i}}
\end{align}\]
<p>Once again, this result will be propagated further down the network.
Finally, once we have computed both these quantities we can obtain the gradients with respect to the parameters of this block, \(\theta_n\).</p>
<center>
<table>
<tr>
<th> </th>
<th> <b>ResNet Architecture</b></th>
<th> <b>RevNet Architecture</b></th>
</tr>
<tr>
<td><b>Format of a Block</b></td>
<td> $$
y_{n} = y_{n - 1} + \mathcal F(y_{n - 1})
$$</td>
<td>$$
\begin{align}
y_{n - 1, 1}, y_{n - 1, 2} &= \mbox{split}(y_{n - 1})\\
y_{n, 1} &= y_{n - 1, 1} + \mathcal{F}(y_{n - 1, 2})\\
y_{n, 2} &= y_{n - 1, 2} + \mathcal{G}(y_{n, 1})\\
y_{n} &= (y_{n, 1}, y_{n, 2})
\end{align}
$$</td>
</tr>
<tr>
<td><b>Parameters</b></td>
<td>$$
\begin{align}
\theta = \theta_{\mathcal F}
\end{align}
$$</td>
<td> $$\begin{align}
\theta = (\theta_{\mathcal F}, \theta_{\mathcal G})
\end{align}
$$</td>
</tr>
<tr>
<td><b>Backpropagation</b></td>
<td>$$\begin{align}
&\mathbf{in:}\ y_{n - 1}, y_{n}, \overline{ y_{n}}\\
\overline{\theta_n} &=\overline{y_n} \frac{\partial y_n}{\partial \theta_n}\\
\overline{y_{n - 1}} &= \overline{y_{n}}\ \frac{\partial y_{n}}{\partial y_{n-1}} \\
&\mathbf{out:}\ \overline{\theta_n}, \overline{y_{n -1}}
\end{align}$$</td>
<td>$$\begin{align}
&\mathbf{in:}\ y_{n}, \overline{y_{n }}\\
\texttt{# recover}& \texttt{ input activations} \\
y_{n, 1}, y_{n, 2} &= \mbox{split}(y_{n})\\
y_{n - 1, 2} &= y_{n, 2} - \mathcal{G}(y_{n, 1})\\
y_{n - 1, 1} &= y_{n, 1} - \mathcal{F}(y_{n - 1, 2})\\
\texttt{# compute}& \texttt{ gradients wrt. inputs} \\
\overline{y_{n -1, 1}} &= \overline{y_{n, 1}} + \overline{y_{n,2}} \frac{\partial \mathcal G}{\partial y_{n,1}} \\
\overline{y_{n -1, 2}} &= \overline{y_{n, 1}} \frac{\partial \mathcal F}{\partial y_{n,2}} + \overline{y_{n,2}} \left(1 + \frac{\partial \mathcal F}{\partial y_{n,2}} \frac{\partial \mathcal G}{\partial y_{n,1}} \right) \\
\texttt{# compute}& \texttt{ gradients wrt. parameters} \\
\overline{\theta_{n, \mathcal G}} &= \overline{y_{n, 2}} \frac{\partial \mathcal G}{\partial \theta_{n, \mathcal G}}\\
\overline{\theta_{n, \mathcal F}} &= \overline{y_{n,1}} \frac{\partial F}{\partial \theta_{n, \mathcal F}} + \overline{y_{n, 2}} \frac{\partial F}{\partial \theta_{n, \mathcal F}} \frac{\partial \mathcal G}{\partial y_{n,1}}\\
&\mathbf{out:}\ \overline{\theta_{n}}, \overline{y_{n -1}}, y_{n - 1}
\end{align}$$ </td>
</tr>
</table>
<b>Table 1:</b> Backpropagation in the standard case and for Reversible blocks
</center>
<p><br /></p>
<h3 id="computational-efficiency">Computational Efficiency</h3>
<p><code class="language-plaintext highlighter-rouge">RevNet</code>s <em>trade off memory requirements</em>, by avoiding storing activations, against computations. Compared to other methods that focus on improving memory requirements in deep networks, <code class="language-plaintext highlighter-rouge">RevNet</code> provides the best trade-off: no activations have to be stored, the spatial complexity is \(O(1).\) For the computation complexity, it is linear in the number of layers, i.e. \(O(L)\).
One disadvantage is that <code class="language-plaintext highlighter-rouge">RevNet</code>s introduces <em>additional parameters</em>, as each block is composed of two residuals, \(\mathcal F\) and \(\mathcal G\), and their number of channels is also halved as the input is first split into two.</p>
<hr />
<h2 class="section experiments"> Experiments </h2>
<p>In the experiments section, the author compare <code class="language-plaintext highlighter-rouge">ResNet</code> architectures to their <code class="language-plaintext highlighter-rouge">RevNets</code> “counterparts”: they build a <code class="language-plaintext highlighter-rouge">RevNet</code> with roughly the same number of parameters by halving the number of residual units and doubling the number of channels.</p>
<p>Interestingly, <code class="language-plaintext highlighter-rouge">RevNets</code> achieve <em>similar performances</em> to their <code class="language-plaintext highlighter-rouge">ResNet</code> counterparts, both in terms of final accuracy, and in terms of training dynamics. The authors also analyze the impact of floating errors that might occur when reconstructing activations rather than storing them, however it appears these errors are of small magnitude and do not seem to negatively impact the model.
To summarize, reversible networks seems like a very promising direction to efficiently train very deep networks with memory budget constraints.</p>
<hr />
<h2 class="section references"> References </h2>
<ul>
<li><span class="citations">[1]</span> NICE: Non-linear Independent Components Estimation, <i>Dinh et al., ICLR 2015</i></li>
<li><span class="citations">[2]</span> Density estimation using Real NVP, Dinh et al., <i>ICLR 2017</i></li>
<li><span class="citations">[3]</span> Deep Residual Learning for Image Recognition, <i>He et al., CVPR 2016</i></li>
<li><span class="citations">[4]</span> \(i\)RevNet: Deep Invertible Networjs, <i>Jacobsen et al., ICLR 2018</i></li>
</ul>Gomez et al.Residual Networks (ResNet) [3] have greatly advanced the state-of-the-art in Deep Learning by making it possible to train much deeper networks via the addition of skip connections. However, in order to compute gradients during the backpropagation pass, all the units' activations have to be stored during the feed-forward pass, leading to high memory requirements for these very deep networks. Instead, the authors propose a reversible architecture in which activations at one layer can be computed from the ones of the next. Leveraging this invertibility property, they design a more efficient implementation of backpropagation, effectively trading compute power for memory storage.Conditional Neural Processes2019-05-06T14:00:00+02:002019-05-06T14:00:00+02:00https://ameroyer.github.io/structured%20learning/conditional_neural_processes<div class="summary">
<b>Gaussian Processes</b> are models that consider a <b>family of functions</b> (typically under a Gaussian distribution) and aim to quickly fit one of these functions at test time based on some observations. In that sense there are orthogonal to Neural Networks which instead aim to learn one function based on a large training set and hoping it generalizes well on any new unseen test input. This work is an attempt at bridging both approaches.
<ul>
<li><span class="pros">Pros (+):</span> Novel and well justified, wide range of applications.</li>
<li><span class="cons">Cons (-):</span> Not clear how easy the method is to put in practice, e.g. dependency to initialization.</li>
</ul>
</div>
<h2 class="section proposed"> Proposed Model</h2>
<h3 id="statistical-background">Statistical Background</h3>
<p>In the Conditional Neural Processes (<code class="language-plaintext highlighter-rouge">CNP</code>) setting, we are given \(n\) labeled points, called <em>observations</em> \(O = \{(x_i, y_i)\}_{i=1^n}\), and another set of \(m\) unlabeled <strong><em>targets</em></strong> \(T = \{x_i\}_{i=n + 1}^{n + m}\).We assume
that the outputs are a realization of the following process: Given \(\mathcal P\) a distribution over functions in \(X \rightarrow Y\), sample \(f \sim \mathcal P\), and set \(y_i = f(x_i)\) for all \(x_i\) in the targets set.</p>
<p>The goal is to learn a prediction model for the output samples while trying to obtain the same flexibility as Gaussian processes, rather than using the standard supervised learning paradigm of deep neural networks.
The main inconvient of using standard Gaussian Processes is that they do not scale well (\((n + m)^3\)).</p>
<h3 id="conditional-neural-processes">Conditional Neural Processes</h3>
<p><code class="language-plaintext highlighter-rouge">CNP</code>s give up on the theoretical guarantees of the Gaussian Process framework in exchange for more flexibility. In particular, observations are encoded in a representation of <em>fixed dimension</em>, independent of \(n\).</p>
\[\begin{align}
r_i &= h(x_i, y_i), \forall i \in [1, n]\\
r &= r_1 \oplus \dots \oplus r_n\\
\phi_i &= g_{\theta}(x_i, r), \forall i \in [n + 1, n + m]\\
y_i &\sim Q(f(x_i)\ |\ \phi_i)
\end{align}\]
<p>In other words, we first encode each observation and combine these embeddings via an operator \(\oplus\) to obtain a fixed representation of the observations, \(r\). Then for each new target \(x_i\) we obtain parameters \(\phi_i\) conditioned on \(r\) and \(x_i\), which determine the stochastic process to draw outputs from.</p>
<div class="figure">
<img src="/images/posts/conditional_neural_processes.png" />
<p><b>Figure:</b> Conditional Neural Processes Architecture.</p>
</div>
<p>In practice, \(\oplus\) is taken to be the mean operation, i.e., \(r\) is the average of \(r_i\)s over all observations. For regression tasks, \(Q\) is a Gaussian distribution parameterized by mean and variance \(\phi = (\mu_i, \sigma_i).\) For classification tasks, \(\phi\) simply encodes a discrete distribution over classes.</p>
<h3 id="training">Training</h3>
<p>Given observations \(f \sim P\), \(\{(x_i, y_i = f(x_i))\}_{i=1}^n\), we sample \(N\) uniformly in \([1, \dots, n]\) and train the model to predict labels for the whole observations set, conditioned only on the subset \(\{(x_i, y_i)\}_{i=1}^N\) by minimizing the negative log likelihood:</p>
\[\begin{align}
\mathcal{L}(\theta) = - \mathbb{E}_{f \sim p} \mathbb{E}_N \left( \log Q_{\theta}(\{y_i\}_{i=1}^N |\ \{(x_i, y_i)\}_{i=1}^N, \{x_i\}_{i=1}^N) \right)
\end{align}\]
<p><strong>Note:</strong> The sampling step \(f \sim P\) is not clear, but it seems to model the stochasticity in output \(y\) given \(x\).</p>
<p>The mode scales with \(O(n + m)\), i.e., linear time, which is much better than the cubic rate of Gaussian Processes.</p>
<hr />
<h2 class="section experiments"> Experiments and Applications </h2>
<ul>
<li>
<p><strong>1D regression:</strong> We generate a dataset that consist of functions generated from a GP with an exponential kernel. At every training step we sample a curve from the GP (\(f \sim P\)), select a subset of \(n\) points \((x_i, y_i)\) as observations, and a subset of points \((x_t, y_t)\) as target points. The output distribution on the target labels is parameterized as a Gaussian whose mean and variance are output by \(g\).</p>
</li>
<li>
<p><strong>Image completion.</strong> \(f\) is a function that maps a pixel coordinate to a RGB triple. At each iteration, an image from the training set is chosen, a subset of its pixels is selected, and the aim is to predict the RGB value of the remaining pixels while conditioned on those. In particular, it is interesting to see that <em><code class="language-plaintext highlighter-rouge">CNP</code> allows for flexible conditioning patterns</em>, contrary to other conditional generative models which are often constrained by either the architecture (e.g. <code class="language-plaintext highlighter-rouge">PixelCNN</code>) or training scheme.</p>
</li>
<li>
<p><strong>Few shot classification.</strong> Consider a dataset with many classes but only few examples per class (e.g. <code class="language-plaintext highlighter-rouge">Omniglot</code>). In this set of experiments, the model is trained to predict labels for samples in a select subset of classes, while being conditioned on all remaining samples in the dataset.</p>
</li>
</ul>Garnelo et al.Gaussian Processes are models that consider a family of functions (typically under a Gaussian distribution) and aim to quickly fit one of these functions at test time based on some observations. In that sense there are orthogonal to Neural Networks which instead aim to learn one function based on a large training set and hoping it generalizes well on any new unseen test input. This work is an attempt at bridging both approaches.Deep Visual Analogy Making2019-05-06T12:40:24+02:002019-05-06T12:40:24+02:00https://ameroyer.github.io/visual%20reasoning/deep_visual_analogy_making<div class="summary">
In this paper, the authors propose to learn <b>visual analogies</b> akin to the semantic and synctatic analogies naturally emerging in the <code>Word2Vec</code> embedding <span class="citations">[1]</span>: More specifically hey tackle the joint task of inferring a transformation from a given (source, target) pair, and applying the same relation to a new source image.
<ul>
<li><span class="pros">Pros (+):</span> Intuitive formulation; Introduces two datasets for the visual analogy task.</li>
<li><span class="cons">Cons (-):</span> Only consider "local" changes, i.e. geometric transformations or single attribute modifications, and rather clean images (e.g., no background).</li>
</ul>
</div>
<h2 class="section proposed"> Proposed Model</h2>
<p><strong>Definition:</strong> <i>Informally, a visual analogy, denoted by “<strong>a:b :: c:d</strong>”, means that the entity <strong>a</strong> is to <strong>b</strong> what the entity <strong>c</strong> is to <strong>d</strong>. This paper focuses on the problem of generating image <strong>d</strong> after inferring the relation <strong>a:b</strong> and given a source image <strong>c</strong>.</i></p>
<p>The authors propose to use an encoder-decoder based model for generation and to model analogies as <em>simple transformations of the latent space</em>, for instance addition between vectors, as was the case in words embeddings such as <code class="language-plaintext highlighter-rouge">GloVe</code> <span class="citations">[2]</span> or <code class="language-plaintext highlighter-rouge">Word2Vec</code> <span class="citations">[1]</span>.</p>
<h3 id="learning-to-generate-analogies-via-manipulation-of-the-embedding-space">Learning to generate analogies via manipulation of the embedding space</h3>
<p><strong>Additive objective.</strong> Let \(f\) denote the encoder that maps images to the latent space \(\mathbb{R}^K\) and \(g\) the decoder. The first, most straightforward, objective the authors consider is to model analogies as additions in the latent space:</p>
\[\begin{align}
\mathcal L_{\mbox{add}}(c, d; a, b) = \|d - g \left(f(c) + f(b) - f(a) \right)\|^2
\end{align}\]
<p>One disadvantage of this <em>purely linear transformation</em> is that it cannot learn complex structures such as periodic transformations: For instance, if <strong>a:b</strong> is a rotation, the the embedding for the decoded image should ideally comes back to \(f(c)\) which is not possible as we keep adding the non-zero vector \(f(b) - f(a)\). To capture more complex transformations of the latent space, the authors introduce two variants of the previous objective.</p>
<p><strong>Multiplicative objective.</strong></p>
\[\begin{align}
\mathcal L_{\mbox{mult}}(c, d; a, b) = \|d - g \left( f(c) + W \odot [f(b) - f(a)] \odot f(c) \right)\|^2
\end{align}\]
<p>where \(W \in \mathbb{R}^{K\times K\times K}\), \(K\) is the dimension of the embedding, and the three-way multiplication operator is defined as \(\forall k,\ (A \odot B \odot C)_k = \sum_{i, j} A_{ijk} B_i C_j\)</p>
<p><strong>Deep objective.</strong></p>
\[\begin{align}
\mathcal L_{\mbox{deep}}(c, d; a, b) = \|d - g \left( f(c) + \mbox{MLP}([ f(b) - f(a); f(c)]) \right)\|^2
\end{align}\]
<p>where <code class="language-plaintext highlighter-rouge">MLP</code> is a Multi Layer Perceptron. The \([ \cdot; \cdot]\) operator denotes concatenation. This allows for very generic transformations, but can introduce a significant number of parameters for the model to train, depending on the depth of the network.</p>
<div class="figure">
<img src="/images/posts/deep_visual_analogy_1.png" />
<p><b>Figure:</b> Illustration of the network structure for analogy making. The top portion shows the encoder, transformation module, and decoder. The botton portion illustrates each of the transformation variants. We share weights with all three encoder networks shown on the top left</p>
</div>
<h3 id="regularizing-the-latent-space">Regularizing the latent space</h3>
<p>While the previous losses acted at the pixel-level to match the decoded image with the target image <strong>D</strong>, the authors introduce an additional regularization loss that additionally matches the analogy <em>at the feature level</em> with the source analogy <strong>a:b</strong>. Formally, each objective can be written in the form \(\| d - g(f(c) + T(f(b) - f(a), f(c)))\|^2\) and the corresponding regularization loss term is defined as:</p>
\[\begin{align}
R(c, d; a, b) = \|(f(d) - f(c)) - T(f(b) - f(a), f(c))\|^2
\end{align}\]
<p>Where \(T\) is defined accordingly to match the chosen embedding, \(\mathcal L_{\mbox{add}}\), \(\mathcal L_{\mbox{mult}}\) or \(\mathcal L_{\mbox{deep}}\). For intance, \(T: (x, y) \mapsto x\) in the additive variant.</p>
<h3 id="disentangling-the-feature-space">Disentangling the feature space</h3>
<p>The authors consider another solution to the visual analogy problem, in which they aim to learn a disentangled <em>feature space</em> that can be freely manipulated by smoothly modifying the appropriate latent variables, rather than learning a specific operation.</p>
<p>In that setting, the problem is slightly different, as we require additional supervision to control the different factors of variation. It is denoted as <strong>(a, b):s :: c</strong>: given two input images <strong>a</strong> and <strong>b</strong>, and a boolean mask <strong>s</strong> on the latent space, retrieve image <strong>c</strong> which matches the features of <strong>a</strong> according to the pattern of <strong>s</strong>, and the features of <strong>b</strong> on the remaining latent variables.</p>
<p>Let us denote by \(S\) the number of possible axes of variations (e.g., change in illumination, elevation, rotation etc) then \(s \in \{0, 1\}^S\) is a <em>one-hot block vector encoding the current transformation, called the switch vector</em>. The disentangling objective is thus</p>
\[\begin{align}
\mathcal{L}_{\mbox{dis}} = |c - g(f(a) \times s + f(b) \times (1 - s))|
\end{align}\]
<p>In other words the decoder tries to match <strong>c</strong> by exploiting separate and disentangled information from <strong>a</strong> and <strong>b</strong>. Contrary to the previous analogy objectives, only three images are needed, but it also requires <em>extra supervision</em> in the form of the switch vector <strong>s</strong> which can be hard to obtain.</p>
<hr />
<h2 class="section experiments"> Experiments </h2>
<p>The authors consider three main experimental settings:</p>
<ul>
<li>
<p><strong>Synthetic experiments on geometric shapes.</strong> The dataset consists in 48 × 48 images scaled to [0, 1] with 4 shapes, 8 colors, 4 scales, 5 row and column positions, and 24 rotation angles. No disentangling training was performed in this setting.</p>
</li>
<li>
<p><strong>Sprites dataset.</strong> The dataset consists of 60 × 60 color images of sprites scaled to [0, 1], with 7 attributes and 672 total unique characters. For each character, there are 5 animations each from 4 viewpoints. Each animation has between 6 and 13 frames. The data is split by characters. For the disentanglement experiments, the authors try two methods:<code class="language-plaintext highlighter-rouge">dist</code>, where they only try to separate the pose from identity (i.e., only two axes of variations), and <code class="language-plaintext highlighter-rouge">dist+cls</code>, where they actually consider all available attributes separately.</p>
</li>
<li>
<p><strong>3D Cars.</strong> For each of the 199 car models, the authors generated 64 × 64 color renderings from 24 rotation angles each offset by 15 degrees.</p>
</li>
</ul>
<p>The authors report results in terms of <em>pixel prediction error</em>. Out of the three manipulation method, \(\mathcal{L}_{\mbox deep}\) usually performs best. However qualitative samples show that \(\mathcal L_{\mbox{add}}\) and \(\mathcal L_{\mbox{mult}}\) both also perform well, although they fail for the case of rotation in the first set of experiments, which justifies the use of more complex training objectives.</p>
<p>Disentanglement methods usually outperforms the other baselines, especially in <em>few-shots experiments</em>. In particular the <code class="language-plaintext highlighter-rouge">dist+cls</code> method usually wins by a large margin, which shows that the additional supervision really helps in learning a structured representation. However such supervisory signal sounds hard to obtain in practice in more generic scenarios.</p>
<div class="figure">
<img src="/images/posts/deep_visual_analogy_2.png" />
<p><b>Figure 2:</b> Examples of samples from the three visual analogy datasets considered in experiments</p>
</div>
<hr />
<h2 class="section followup"> Closely Related</h2>
<h4 style="margin-bottom: 0px"> Visalogy: Answering Visual Analogy Questions <span class="citations">[3]</span></h4>
<p style="text-align: left">Sadeghi et al., <a href="https://arxiv.org/pdf/1510.08973.pdf">[link]</a></p>
<blockquote>
<p>In this paper, the authors tackle the visual analogy problem in natural images by learning a joint embedding on relation and visual appearances using a <em>Siamese architecture</em>. The main idea is to learn an embedding space where the analogy transformation can be modeled by <em>simple latent vector transformations</em>. The model consists in a Siamese quadruple architecture, where the four heads correspond to the three context images and the candidate image for the visual analogy task respectively They do consider a <em>restrained set of analogies</em>, in particular those based on attributes or actions of animals or geometric view point changes. Given analogy problem \(I_1 : I_2 :: I_3 : I_4\) with label \(y\) (1 if \(I_4\) fits the analogy, 0 otherwise), the model is trained with the following objective</p>
</blockquote>
\[\begin{align}
\mathcal{L}(x_{1, 2}, x_{3, 4}) = y (\| x_{1, 2} - x_{3, 4} \| -m+P) + (1 - y) \max (m_N - \| x_{1, 2} - x_{3, 4} \|, 0)
\end{align}\]
<blockquote>
<p>where \(x_{i, j}\) refers to the embedding for the image pair \(i, j\). Intuitively, the model pushes embeddings with a similar analogy close, and others apart (up to a certain margin \(m_N\)). The \(m_P\) margin is introduced as a <em>heuristic to avoid overfittting</em>: Embeddings are only made closer if their distance is above the margin threshold \(m_P\). The pairwise embeddings are obtained by subtracting the individual images embeddings. This implies the assumption that \(x_2 = x_1 + r\), where \(r\) is the transformation from image \(I_1\) to image \(I_2\).</p>
</blockquote>
<blockquote>
<p>The authors additionally create a <em>visual analogy dataset</em>. Generating the dataset is rather intuitive as long as we have an attribute-style representation of the domain. Typically, the analogies considered are transformations over <em>properties</em> (object, action, pose) of different <em>categories</em> (dog, cat, chair etc). As negative data points, they consider <strong>(i)</strong> fully random quadruples, or <strong>(ii)</strong> valid quadruples where one of \(I_3\) or \(I_4\) is swapped with a random image.
The evaluation is done with <em>image retrieval metrics</em>. They also consider generalization scenarios: For instance removing the analogy \(white \rightarrow black\) during training, but keeping e.g. \(white \rightarrow red\) and \(green \rightarrow black\). There is a lack of details about the missing pairs to really get a full idea of the generalization ability of the model (i.e. if an analogy is missing from the training set, does that mean its reverse also is ? or does “analogy” refers to the high-level relation or is it instantiated relatively to the category too ?).</p>
</blockquote>
<hr />
<h2 class="section references"> References</h2>
<ul>
<li><span class="citations">[1]</span> Distributed representations of words and phrases and their compositionality, <i>Mikolov et al., NIPS 2013</i></li>
<li><span class="citations">[2]</span> GloVe: Global Vectors for Word Representation, <i>Pennington et al., EMNLP 2014</i></li>
<li><span class="citations">[3]</span> Visalogy: Answering Visual Analogy Questions, <i>Sadeghi et al., NeurIPS 2015</i></li>
</ul>Reed et al.In this paper, the authors propose to learn visual analogies akin to the semantic and synctatic analogies naturally emerging in the Word2Vec embedding [1]: More specifically hey tackle the joint task of inferring a transformation from a given (source, target) pair, and applying the same relation to a new source image.