D. Kingma and P. Dhariwal, NeurIPS 2018, [link]
tags: generative models  reversible networks  neurips  2018
VAE
s or GAN
s) and easily parallelizable training and inference (unlike the sequential generative process in autoregressive models). This paper proposes a new, more flexible, form of invertible flow for generative models, which builds on [3].
PixelCNN
that allows for faster parallelized sample generation) would be nice.Given input data , invertible flowbased generative models are built as two steps processes that generate data from an intermediate latent representation :
where is an invertible function, i.e. a bijection, . It acts as an encoder from the input data to the latent space. is usually built as a sequence of smaller invertible functions . Such a sequence is also called a normalizing flow [1]. Under this construction, the change of variables formula applied to gives the following equivalence between the input and latent densities:
where In particular, this means and . is usually chosen as a simple density such as a unit Gaussian distribution, . In order to efficiently estimate the likelihood, the functions are usually chosen such that the logdeterminant of the Jacobian, , is easily computed, for instance by choosing transformation such that the Jacobian is a triangular matrix.
Each flow step function is a sequence of three operations as follows. Given an input tensor of dimensions :
Step Description  Functional Form of flow  Inverse Function of the flow,  Logdeterminant Expression 

ActNorm 

1x1 conv 

Affine Coupling (ACL) [2] 


ActNorm. The activation normalization layer is introduced as a replacement for Batch Normalization (BN
) to avoid degraded performance with small minibatch sizes, e.g. when training with batch size 1. This layer has the same form as BN
, however the bias, , and variance, , are dataindependent variables: They are initialized based on an initial minibatch of data (datadependent initialization), but are optimized during training with the rest of the parameters, rather than estimated from the input minibatch statistics.
1x1 convolution. This is a simple 1x1 convolutional layer. In particular, the cost of computing the determinant of can be reduced by writing in its LU decomposition, although this increases the number of parameters to be learned.
Affine Coupling Layer. The ACL
was introduced in [2]. The input tensor is first split in half along the channel dimension. The second half, , is fed through a small neural network to get parameters and , and the corresponding affine transformation is applied to the first half, .
The rescaled is the actual transformed output of the layer, however also has to be propagated in order to make the transformation invertible, such that and can also be estimated in the reverse flow.
Finally, note that the previous 1x1 convolution can be seen as a generalized permutation of the input channels, and guarantees that different channels combinations are seen during the split
operation.
These operations are then combined in a multiscale architecture as described in [3], which in particular relies on a squeezing operation to trade of spatial resolution for number of output channels. Given an input tensor of size , the squeezing operator takes blocks of size and flatten them to size , which can easily be inverted by reshaping. The final pipeline consists in levels that operate on different scales: each level is composed of flow steps and a final squeezing operation.
Figure: Overview of the multilayer GLOW
architecture.
In summary, the main differences with [3] are:
In practice, the authors implement NN
as a convolutional neural network of depth 3 in the ACL
; which means that each flow step contains 4 convolutions in total. They also use flow steps in each level. Finally the number of levels is 3 for smallscale experiments (32x32 images) and 6 for large scale (256x256 ImageNet images).
In particular this means that the model contains a lot of parameters ( convolutions) which might be a practical disadvantage compared to other method that produce samples of similar quality, e.g. GAN
s. However, contrary to these models, GLOW
provides exact likelihood inference.
GLOW
outperforms RealNVP
[3] in terms of data likelihood, as evaluated on standard benchmarks (ImageNet, CIFAR10, LSUN). In particular, the 1x1 convolutions performs better than other more specific permutations operations, and only introduces a small computational overhead.
Qualitatively, the samples are of great quality and the model seems to scale well with higher resolution. However this greatly increases the memory requirements. Leveraging the model’s invertibility to avoid storing activations during the feedforward pass such as in [4] could be used to (partially) palliate the problem.