14 May 2019

# Automatically Composing Representation Transformations as a Mean for Generalization

Chang et al, ICLR 2019, [link]
tags: domain generalization - iclr - 2019

The authors focus on solving recursive tasks which can be decomposed into a sequence of simpler algorithmic procedures (e.g., arithmetic problems, geometric transformations). The main difficulties of this approach are (i) how to actually decompose the task into simpler blocks and (ii) how to extrapolate to more complex problems from learning on simpler individual tasks. The authors propose the compositional recursive learner (CRL) to learn at the same time both the structure of the task and its components.
• Pros (+): This problem is well motivated, and seems a very promising direction, for learning domain-agnostic components.
• Cons (-): The actual implementation description lacks crucial details and I am not sure how easy it would be to reimplement.

### Proposed model

#### Problem definition

A problem $P_i$ is defined as a transformation $x_i : t_x \mapsto y_i : t_y$, where $t_x$ and $t_y$ are the respective types of $x$ and $y$. However since we only consider recursive problem here, then $t_x = t_y$. We define a family of problems $\mathcal P$ as a set of composite recursive problems that share regularities. The goal of CRL is to extrapolate to solve new compositions of these tasks, using knowledge from the limited subset of tasks it has seen during training.

#### Implementation

In essence, the problem can be formulated as a sequence-decision making task via a meta-level MDP ($\mathcal X$, $\mathcal F$, $\mathcal P_{\mbox{meta}}$, r, $\gamma$), where $\mathcal X$ is the set of states, i.e., representations; $\mathcal F$ is a set of computations, i.e., istances of the transformations we consider, for instance as neural networks, and an additional special function HALT that stops the execution; $\mathcal P_{\mbox{meta}}: (x_t, f_t, x_{t + 1}) \mapsto c \in [0, 1]$ is the policy which assigns a probability to each possible transition. Finally $r$ is the reward function and $\gamma$ a decay factor.

More specifically, the CRL is implemented as a set of neural networks, $f_k \in \mathcal F$, and a controller $\pi(f\ |\ \mathbf{x}, t_y)$ which selects the best course of action given the current history of representations $\mathbf{x}$ and target type $t_y$. The loss is back-propagated through the functions $f$, and the controller is trained as a Reinforcement Learning (RL) agent with a sparse reward (it only knows the final target result). An additional important training scheme is the use of curriculum learning i.e., start by learning small transformations and then consider more complex compositions, increasing the state space little by little.

### Experiments

#### Multilingual Arithmetic

The learner will aim to solve recursive arithmetic expressions across 6 languages: English, Numerals, PigLatin, Reversed-English, Spanish. The input is a tuple $(x^s, t_y)$, where $x$ is the arithmetic expression expressed in source language $s$, and $t_y$ is the output language.

• Training: The learner trains on a curriculum of a limited set of 2, 3, 4, 5-length expressions. During training, each source language is seen with four target languages (and one held out for testing) and each target language is seen with four source languages (and one held out for testing).

• Testing: The learner is asked to generalize to 5-length expressions (test set) and to extrapolate to 10-length expressions (extrapolation set) with unseen language pairs.

The authors consider two main types of functional units for this task: A reducer, which takes as input a window of three terms in the input expression and outputs a softmax distribution over the vocabulary. While a translator applies a function to every element of the input sequence and outputs a sequence of the same size.

The CRL is compared to a baseline RNN architecture that directly tries to map a variable length input sequence to the target output. On the test set, RNN and CRL yield similar accuracies although CRL usually requires less training samples and/or less training iterations. On the extrapolation set however, CRL more clearly outperforms RNN. Interestingly the CRL results usually have a much bigger variance which would be interesting to qualitatively analyze. Moreover, the use of curriculum learning significantly improves the model performance. Finally, qualitative results show that the reducers and translators are interpretable to some degree: e.g., it is possible to map some of the reducers to specific operations, however due to the unsupervised nature of the task, the mapping is not always straight-forward.

#### Image Transformations

This time the functional units are composed of three specialized Spatial Transformer Networks [1] to learn rotation, scale and translation, and an identity function. Overall this setting does not yield very good quantitative results. More precisely, one of the main challenges, since we are acting on a visual domain, is to deduce the structure of the task from information which lacks clear structure (pixel matrices). Additionally the fact that all inputs and outputs have the same domain (images) and that only a sparse reward is available make it more difficult for the controller to distinguish between functionalities, i.e., it could collapse to using only one transformer.

### References

• [1] Spatial Transformer Networks, Jaderberg et al., NeurIPS 2016