D. Kingma and P. Dhariwal, NeurIPS 2018, [link]
tags: generative models - reversible networks - neurips - 2018
GANs) and easily parallelizable training and inference (unlike the sequential generative process in auto-regressive models). This paper proposes a new, more flexible, form of invertible flow for generative models, which builds on .
PixelCNNthat allows for faster parallelized sample generation) would be nice.
Given input data , invertible flow-based generative models are built as two steps processes that generate data from an intermediate latent representation :
where is an invertible function, i.e. a bijection, . It acts as an encoder from the input data to the latent space. is usually built as a sequence of smaller invertible functions . Such a sequence is also called a normalizing flow . Under this construction, the change of variables formula applied to gives the following equivalence between the input and latent densities:
where In particular, this means and . is usually chosen as a simple density such as a unit Gaussian distribution, . In order to efficiently estimate the likelihood, the functions are usually chosen such that the log-determinant of the Jacobian, , is easily computed, for instance by choosing transformation such that the Jacobian is a triangular matrix.
Each flow step function is a sequence of three operations as follows. Given an input tensor of dimensions :
|Step Description||Functional Form of flow||Inverse Function of the flow,||Log-determinant Expression|
ActNorm. The activation normalization layer is introduced as a replacement for Batch Normalization (
BN) to avoid degraded performance with small mini-batch sizes, e.g. when training with batch size 1. This layer has the same form as
BN, however the bias, , and variance, , are data-independent variables: They are initialized based on an initial mini-batch of data (data-dependent initialization), but are optimized during training with the rest of the parameters, rather than estimated from the input minibatch statistics.
1x1 convolution. This is a simple 1x1 convolutional layer. In particular, the cost of computing the determinant of can be reduced by writing in its LU decomposition, although this increases the number of parameters to be learned.
Affine Coupling Layer. The
ACL was introduced in . The input tensor is first split in half along the channel dimension. The second half, , is fed through a small neural network to get parameters and , and the corresponding affine transformation is applied to the first half, .
The rescaled is the actual transformed output of the layer, however also has to be propagated in order to make the transformation invertible, such that and can also be estimated in the reverse flow.
Finally, note that the previous 1x1 convolution can be seen as a generalized permutation of the input channels, and guarantees that different channels combinations are seen during the
These operations are then combined in a multi-scale architecture as described in , which in particular relies on a squeezing operation to trade of spatial resolution for number of output channels. Given an input tensor of size , the squeezing operator takes blocks of size and flatten them to size , which can easily be inverted by reshaping. The final pipeline consists in levels that operate on different scales: each level is composed of flow steps and a final squeezing operation.
Figure: Overview of the multi-layer
In summary, the main differences with  are:
In practice, the authors implement
NN as a convolutional neural network of depth 3 in the
ACL; which means that each flow step contains 4 convolutions in total. They also use flow steps in each level. Finally the number of levels is 3 for small-scale experiments (32x32 images) and 6 for large scale (256x256 ImageNet images).
In particular this means that the model contains a lot of parameters ( convolutions) which might be a practical disadvantage compared to other method that produce samples of similar quality, e.g.
GANs. However, contrary to these models,
GLOW provides exact likelihood inference.
RealNVP  in terms of data likelihood, as evaluated on standard benchmarks (ImageNet, CIFAR-10, LSUN). In particular, the 1x1 convolutions performs better than other more specific permutations operations, and only introduces a small computational overhead.
Qualitatively, the samples are of great quality and the model seems to scale well with higher resolution. However this greatly increases the memory requirements. Leveraging the model’s invertibility to avoid storing activations during the feed-forward pass such as in  could be used to (partially) palliate the problem.