Representation Learning

The Variational Fair Autoencoder

Louizos et al. ICLR 2016 Read paper
representation learning
The goal of this work is to propose a variational autoencoder based model that learns latent representations which are independent from some sensitive knowledge present in the data, while retaining enough information to solve the task at hand, e.g. classification. This independence constraint is incorporated via loss term based on Maximum Mean Discrepancy.
  • Pros (+): Well justified, fast implementation trick, semi-supervised setting.
  • Cons (-): requires explicit knowledge of the sensitive attribute.

Proposed

Given input data xx, the goal is to learn a representation of xx, that factorizes out nuisance or sensitive variables, ss, while retaining task-relevant content zz. Working in the VAE (Variational Autoencoder) framework, this is modeled as a generative process:

zp(z)xpθ(xz,s)\begin{align} z &\sim p(z)\\ x &\sim p_{\theta}(x | z, s) \end{align}

where the prior on latent zz is explicitly made invariant to the variables to filter out, ss. Introducing decoder qϕ:x,szq_{\phi}: x, s \mapsto z, this model can be trained using the standard variational lower bound objective (LELBO\mathcal{L}_{\text{ELBO}}) [1].

Semi-supervised model

In order to make the learned representations relevant to a specific task, the authors propose to incorporate label knowledge during the feature learning stage. This is particularly useful if the task target label yy is correlated with the sensitive information ss, otherwise unsupervised learning could yield random representations in order to get rid of ss only. In practice, this is done by considering two distinct independent sources of information for the latent: yy, the label for data point xx (categorical variable in the classification scenario) and a continuous variable cc that contains the remaining data variations which do not depend on yy (nor ss). This yields the following generative process:

yCat(y) and cp(c)zpθ(zy,c)xpθ(xz,s)\begin{align} y &\sim \text{Cat}(y) \text{ and } c \sim p(c)\\ z &\sim p_{\theta}(z | y, c)\\ x &\sim p_{\theta}(x | z, s) \end{align}

In the fully supervised setting, when the label yy is known, the ELBO can be extended to handle this two-stage latent variables scenario in a simple manner:

LELBO(x,s,y)=Ezq(x,s)(logpθ(xz,s))+Ezqϕ(x,s)KL(qϕ(cz,y)p(c))+Ecqϕ(z1,y)KL(qϕ(zx,s)pθ(zc,y))\begin{align} \mathcal{L}_{\text{ELBO}}(x, s, y) = & - \mathbb{E}_{z \sim q(\cdot | x, s)} (\log p_{\theta}(x | z, s))\tag{reconstruct}\\ & + \mathbb{E}_{z \sim q_{\phi}(\cdot | x, s)} \text{KL}(q_{\phi}(c | z, y) \| p(c))\tag{prior on c}\\ & + \mathbb{E}_{c \sim q_{\phi}(\cdot | z_1, y)} \text{KL}(q_{\phi}(z | x, s) \| p_{\theta}(z | c, y)) \tag{prior on z} \end{align}

In the semi-supervised scenario, the label y can be missing for some of the samples. In which case, the ELBO can once again be extended to consider yy as another latent variable with encoder qϕ(y  z)q_{\phi}(y\ \vert\ z) and prior p(y)p(y).

MMD objective

So far, the model incorporates explicit statistical independence between the sensitive variables to protect, ss and the latent information to capture from the input data, zz. The authors additionally propose to regularize the marginal posterior qϕ(z  s)q_{\phi}(z\ \vert\ s) using the Maximum Mean Discrepancy [2]: The MMD is a distance between probability distributions that compare mean statistics, and is often combined with the kernel trick for efficient computation. we want the latent representation to not learn any information about ss: This constraint can be enforced by making the distributions p(z  s=0)p(z\ \vert\ s = 0) and p(z  s=1)p(z\ \vert\ s = 1) close to one another, as measured by their MMD. This introduces a second loss term to minimize, LMMD\mathcal{L}_{MMD}.

The MMDMMD is defined as a difference of mean statistics over two distributions. Empirically, given two sample sets x0P0\mathbf{x^0} \sim P_0, x1P1\mathbf{x^1} \sim P_1 and a feature map operation ϕ\phi, the squared distance can be written as

MMD(P0,P1)=1x0iϕ(x0i)1x1iϕ(x1i)2=1x02iϕ(x0i)+1x122iϕ(x1i)22x0x1ij<ϕ(x0i),ϕ(x1j)>=1x02ijk(x0i,x0j)+1x12ijk(x1i,x1j)2x0x1ijk(x0i,x1j)\begin{align} &\text{MMD}(P_0, P_1) =\left \| \frac{1}{|\mathbf{x^0}|} \sum_{i} \phi(\mathbf{x^0}_i) - \frac{1}{|\mathbf{x^1}|} \sum_{i} \phi(\mathbf{x^1}_i) \right\|^2\\ &= \frac{1}{|\mathbf{x^0}|^2} \left\| \sum_{i} \phi(\mathbf{x^0}_i) \right\| + \frac{1}{|\mathbf{x^1}|^2} \left\|^2 \sum_{i} \phi(\mathbf{x^1}_i) \right\|^2 - \frac{2}{|\mathbf{x^0}| |\mathbf{x^1}|} \sum_i \sum_j < \phi(\mathbf{x^0}_i), \phi(\mathbf{x^1}_j) >\\ &= \frac{1}{|\mathbf{x^0}|^2} \sum_i \sum_j k(\mathbf{x^0}_i, \mathbf{x^0}_j) + \frac{1}{|\mathbf{x^1}|^2} \sum_i \sum_j k(\mathbf{x^1}_i, \mathbf{x^1}_j) - \frac{2}{|\mathbf{x^0}| |\mathbf{x^1}|} \sum_i \sum_j k(\mathbf{x^0}_i, \mathbf{x^1}_j) \end{align}

where the last line is the result of the kernel trick and k:x,y<ϕ(x),ϕ(y)>k: x, y \mapsto <\phi(x), \phi(y) > is an arbitrary kernel function. In practice, the MMD loss term is computed over each batch: A naive implementation would require computing the full kernel matrix of k(xi,xj)k(x_i, x_j) for each pair of samples in the batch, where each scalar product requires K2K^2 operations, where KK is the dimensionality of xx. Instead, following [6], they use Random Kitchen Sinks [5] which is a low-rank method that allows to compute the MMD loss more efficiently.

Experiments

The authors consider three experimental scenarios to test the proposed Variational Fair Autoencoder (VFAE):

  • Fairness: Here the goal is to learn a classifier while making the latent representation independent from protected information ss, which is a certain attribute of the data (e.g., predict “income” independently of “age”). In all the datasets considered, ss is correlated with the ground-truth label yy, making the setting more challenging.
  • Domain adaptation: Using the Amazon Reviews dataset, the task is to classify reviews as either positive or negative, coming from a supervised source domain and unsupervised target one, while being independent of the domain (i.e., book, dvd, electronics…)
  • Invariant representation: Finally, the last task consists in face identity recognition while being explicitly invariant to various noise features (lighting, pose etc).

The models are evaluated on (i) their performance on the target task and (ii) their ignorance of the sensitive data. The latter is evaluated by training a classifier to prediction ss from zz and evaluating its performance.

On the Fairness task, the proposed method seems to be better at ignoring the sensitive information, although its trade-of between accuracy and fairness is not always the best. The main baseline is the model presented in [3], which also exploits a moment matching penalty term, although a simpler one, and does not use the VAE framework. In the Domain Adaptation task, the authors mostly compare to the DANN [4] which proposes a simple adversarial training technique to align different domain representations. The VFAE results are on-par, or even slightly better in most scenarios.

References

  • [1] Autoencoding Variational Bayes, Kingma and Welling, ICLR 2014
  • [2] A Kernel Method for the Two-Sample-Problem, Gretton et al, NeurIPS 2006
  • [3] Learning Fair Representations, Zemel et al, ICML 2013
  • [4] Domain-Adversarial Training of Neural Networks, Ganin et al., JMLR 2016
  • [5] Random Features for Large-Scale Kernel Machines, Rahimi and Recht, NeurIPS 2007
  • [6] FastMMD: Ensemble of Circular Discrepancy for Efficient Two-Sample Test, Zhao and Meng, Neural Computation 2015