A Style-Based Generator Architecture for Generative Adversarial Networks
StyleGAN, a novel generator architecture for Generative Adversarial Networks that disentangles high-level attributes (such as pose and identity when trained on human faces) from stochastic variation (such as freckles, hair) in the generated images. The architecture is based on mapping the latent code through a mapping network before injecting style at different scales.
- Pros (+): State-of-the-art image quality, disentangled latent space with nice interpolation properties.
- Cons (-): High computational cost, complex architecture with many hyperparameters.
Proposed
Discrete latent space
The model is based on VAE [1], where image is generated from random latent variable by a decoder . The posterior (encoder) captures the latent variable distribution and is generally trained to match a certain distribution from which is sampled from at inference time.
Contrary to the standard framework, in this work the latent space is discrete, i.e., where is the number of codes in the latent space and their dimensionality. More precisely, the input image is first fed to , that outputs a continuous vector, which is then mapped to one of the latent codes in the discrete space via nearest-neighbor search.
Adapting the to this formalism, the KL divergence term greatly simplifies and we obtain:
In practice, the authors use a categorical uniform prior for the latent codes, meaning the KL divergence is constant and the objective reduces to the reconstruction loss.
Figure: A figure describing the VQ-VAE (left). Visualization of the embedding space (right)). The output of the encoder z(x) is mapped to the nearest point. The gradient (in red) will push the
encoder to change its output, which could alter the configuration, hence the code assignment, in the next forward pass.
Training Objective
As we mentioned previously, the objective reduces to the reconstruction loss and is used to learn the encoder and decoder parameters. However the mapping from to is not straight-forward differentiable (Equation (1)). To palliate this, the authors use a straight-through estimator, meaning the gradients from the decoder input (quantized) are directly copied to the encoder output (continuous). However, this means that the latent codes that intervene in the mapping from to do not receive gradient updates that way.
Hence in order to train the discrete embedding space, the authors propose to use Vector Quantization (VQ), a dictionary learning technique, which uses mean squared error to make the latent code closer to the continuous vector it was matched to:
where denotes the stop gradient operator. The first term is the reconstruction loss stemming from the ELBO, the second term is the vector quantization contribution. Finally, the last term is a commitment loss to control the volume of the latent space by forcing the encoder to “commit” to the latent code it matched with, and not grow its output space unbounded.
Learned Prior
A second contribution of this work consists in learning the prior distribution. As mentioned, during the training phase, the prior is a uniform categorical distribution. After the training is done, we fit an autoregressive distribution over the space of latent codes. This is in particular enabled by the fact that the latent space is discrete.
Note: It is not clear to me if the autoregressive model is trained on latent codes sampled from the prior or from the encoder distribution
Experiments
The proposed model is mostly compared to the standard continuous VAE framework. It seems to achieve similar log-likelihood and sample quality, while taking advantage of the discrete latent space. In particular
For ImageNet for instance, they consider latent codes with dimensions . The output of the fully-convolutional encoder is a feature map of size which is then quantized pixel-wise. Interestingly, the model still performs well when using a powerful decoder (here, PixelCNN [2]) which seems to indicate it does not suffer from posterior collapse as strongly as the standard continuous VAE.
A second set of experiments tackles the problem of audio modeling. The performance of the model are once again satisfying. Furthermore, it does seem like the discrete latent space actually captures relevant characteristics of the input data structure, although this is a purely qualitative observation.
References
- [1] Autoencoding Variational Bayes, Kingma and Welling, ICLR 2014
- [2] Pixel Recurrent Neural Networks, van den Oord et al, arXiv 2016