Convergence of Online Adaptive and Recurrent Optimization Algorithms

This work was done with Yann Ollivier. The preprint is available on HAL, and on arxiv.
The "Real Time Recurrent Learning" (RTRL) algorithm, is a classical parameterised dynamical system online optimisation algorithm, which conducts a gradient descent on the parameter. It is apt at modelling wide classes of training algorithms, especially in machine learning. We describe it here when applied to a recurrent neural network, and present our results establishing the convergence of the training procedure.

Let us consider a recurrent neural network, which activities at time $t$, $a_t(1)$, $a_t(2),\,\ldots,\,a_t(\text{number of neurons})$ are updated at each time according to $$a_{t}=\mathrm{F}_\text{eedforward}(a_{t-1},\,w,\,\text{input}^t),$$ where $w$ is the set of weights of the connections between neurons, and $\mathrm{F}_\text{eedforward}$ is the feedforward pass in the neural net.

To assess the quality of the weights with respect to the task at hand, we condiser a loss function $\ell$, which evaluates the quality of the prediction performed by a set of activities $a=(a(1),\ldots,a(\text{number of neurons}))$ when applied to some training example, by computing a real number $\ell(\text{example},a)$.

Let us note $\mathbf{a}_t(w)$ the set of activities of the network at time $t$ computed with weights $w$. Let us note further $\ell_{\leadsto t}(w)=\ell(\text{example}^t,\mathbf{a}_t(w))$ the loss at time $t$ seen as a function of the weights $w$. Optimising the weights through gradient descent therefore amounts to performing weights updates of the form $$ w \leftarrow w - \text{stepsize} \times \frac{\partial \ell_{\leadsto t}}{\partial \, w}(w).$$ Since we are working with a recurrent system, computing this derivative involves going backwards in time, using backpropagation through time.

Differentiating the loss by backpropagation through time

We know that $$\frac{\partial \ell_{\leadsto t}}{\partial \, w}(w) = \frac{\partial \ell}{\partial a}(\text{example}^t,\mathbf{a}_t(w)) \cdot \frac{\partial \mathbf{a}_t}{\partial w}(w).$$

The differential of the activities at time $t$ has to be computed through backpropagation through time: $$ \frac{\partial \mathbf{a}_{t}}{\partial \, w}(w) =\frac{\partial F_{\text{eedforward}}}{\partial a}(\mathbf{a}_{t-1},\,w,\,\text{input}^t) \cdot \frac{\partial \mathbf{a}_{t-1}}{\partial \, w}(w) +\frac{\partial F_{\text{eedforward}}}{\partial w}(\mathbf{a}_{t-1},\,w,\,\text{input}^t). $$

Once we know how to differentiate the losses $\ell_{\leadsto t}$, we way define the RTRL algorithm.

The Real Time Recurrent Learning algorithm

The RTRL algorithm proceeds online by computing an approximation of the differential of the activities with respect to the weights, or its Jacobian. It proceeds by mimicking the backpropagation through time update. In the updates below, the approximation coincides with the true Jacobian at order 0 in the stepsize: when the weights are not updated, RTRL computes the correct Jacobian.

The RTRL algorithm The RTRL algorithm maintains a set of activities $a_t$, weights $w_t$, and a Jacobian estimate $J_t$, initialised at $J_0=0$, subjected to the evolution equations, for $t\geq 1$, $$ \left\{\begin{array}{1} a_{t}=F_{\text{eedforward}}(a_{t-1},\,w_{t-1},\,\text{input}^t) \\ J_t=\frac{\partial F_{\text{eedforward}}}{\partial a}(a_{t-1},\,w_{t-1},\,\text{input}^t) \cdot J_{t-1} +\frac{\partial F_{\text{eedforward}}}{\partial w}(a_{t-1},\,w_{t-1},\,\text{input}^t)\\ \text{gradient}_t= \frac{\partial \ell}{\partial a}(\text{example}^t,a_{t}) \cdot J_t \\ w_t=w_{t-1}-\text{stepsize}\times \text{gradient}_t. \end{array}\right. $$

To prove its convergence, we first need a notion of locally optimal weights.

In independant, and indentically distributed, settings, we would require the common expectation of the gradient of the losses $\ell_{\leadsto t}$ vanishes, and the common expectation of the hessians of the losses is positive definite.

However, since the inputs may issue from potentially infinite time-series, exhibiting potentially unbounded time dependencies, we require instead temporal averages of the gradients and hessians of the losses to satisfy similar first and second order conditions.

Local Optimum We therefore call local optimum of the network a weights vector $w^*$ such that there exists some $0 \leq a \lt 1$, and some positive definite matrix $H$ such that, when $T\to\infty$, we have $$ \frac{1}{T} \sum_{t=0}^{T} \frac{\partial \ell_{\leadsto_t}}{\partial w}(w^*) = O\left(\frac{T^a}{T}\right), $$ and $$ \frac{1}{T} \sum_{t=0}^{T} \frac{\partial^2 \ell_{\leadsto_t}}{\partial w^2}(w^*) = H + O\left(\frac{T^a}{T}\right). $$

The exponent $a$ quantifies the speed at which the system averages itself towards its mean behaviour at the local optimum.

In turns, this speed determines the admissible range of stepsizes one may use during optimisation. Classical Robbins and Monro conditions require that the stepsize sequence $(\eta_t)$ satisfies $\sum_{t}\eta_t = \infty$, and $\sum_{t} \eta_t^2 \lt \infty$. The second condition stems from the fact that, working in an i.i.d. setting, the dynamics at equilibrium is that above, for any $a\gt 1/2$.

Therefore, our definition of local optimum generalises these conditions to more general dynamics.

For the neural network to be able to "learn" something, it must forget past activities which are not the optimal ones, when run with optimal weights. We therefore require it has spectral radius less than one. This is a well-known stability condition in the study of dynamical systems.

A linear mapping $A$ has spectral radius (strictly) less than $1$ when all its eigenvalues have modulus (strictly) less than $1$. In that case, there exists some integer $k\geq 1$ such that $A^k$ has operator norm less than $1$.

Since we work in a nonlinear, and time-inhomogeneous setting, we extend the definition slightly, and say the network has spectral radius less than $1$ if there exist $0 \lt \alpha \leq 1$, and an integer $k\geq 1$, such that any consecutive product of $\mathbf{k}$ differentials $\frac{\partial F_{\text{eedforward}}}{\partial a}(a_{t-1}^*,w^*,\text{input}^t)$, has operator norm less than $1-\alpha$. Here, the activities $a_t^*$ designate the optimal sequence of activities, defined by $a_t^*=F_{\text{eedforward}}(a_{t-1}^*,w^*)$, for $t\geq 1$, where $w^*$ and $a_0^*$ are introduced in the definition of local optimum.

Together with regularity assumptions at the vicinity of the optimal weights and optimal activities, the spectral radius condition enforces stability of the system: provided the weights remain close to the optimal weights, the activities remain close to the optimal activities.

Since the spectral radius condition is valid only at horizon potentially greater than one, the sets which are guaranteed to contain the activities are more complicated than simple balls: in our work, we call them stable tubes.

To prove convergence, we construct an abstract model of a gradient descent algorithm applied to a dynamical system.

Our abstract model first comprises a memory variable, which represents everything which is maintained in memory by the system (in the case of neural networks, the activities and the estimation of the Jacobian). The behaviour of the latter depends on a parameter (in the case of neural networks, the weights). The transition operator takes as inputs the memory variable, and the parameter, and updates the memory.

Then, an operator computes gradients, which are eventually fed to an operator which updates the parameter.

The abstract algorithm thus writes as follows.

Abstract training algorithm It maintains a memory variable, and a parameter, subjected to the evolution equations $$ \left\{\begin{array}{1} \text{memory}\leftarrow \text{Transition}(\text{memory, parameter}) \\ \text{gradient} = \text{Gradient computation}(\text{memory, parameter})\\ \text{parameter} \leftarrow \text{Update}(\text{parameter, gradient}). \end{array}\right. $$

One of the main difficulties in showing this abstract algorithm converges is due to "everything moving at the same time": the memory influences the parameter update, through the computation of the gradient, but its update itself depends on the parameter.

To remedy this, we compare the true abstract trajectory to what we call an open-loop trajectory. In the latter, parameter updates are computed, but not applied: the memory and the gradients are computed with a fixed parameter. We compare these two trajectories on time intervals which lenghts depend on the dynamics of the losses.

The open-loop trajectory is amenable to analysis, and the difference with the true trajectory is at second order in the quantity controlling convergence, which allows us to obtain convergence of the true trajectory.

Since the RTRL algorithm maintains an approximation of the differential of the state of the system with respect to the parameter, its storage cost quickly becomes prohibitive, even for moderately large systems.

The No Back Track and UORO, for "Unbiased Online Recurrent Optimisation", algorithms, replace this differential by an unbiased, rank-one, random approximation.

They thus reduce the required memory cost, at the expanse of introducting noise in the learning process. Our results show that this does not prevent convergence: the probability they converge, to the same optimum as RTRL, is arbitrarily close to one, provided the stepsize is small enough.

On each interval on which we compare the true trajectory to the open-loop one, the noise added by No Back Track or UORO is of size $$ \text{stepsize}\,\sum_{\text{interval}} \, \xi_t, $$ where the $\xi_t$'s are uncorreletad random variables.

As a result, the size of the perturbation is $\text{stepsize}\times\sqrt{\text{length of interval}}$, which is negligible in front of the quantity controlling convergence, equal to $\text{stepsize}\times({\text{length of interval}})$.


Learning the Learning Rate algorithm

The preprint available here, written with Yann Ollivier, presents this algorithm as well as theoretical motivations for it.
This algorithm was devised by Yann Ollivier
The stochastic gradient algorithm and its derivatives are the main optimisation algorithms used to train deep systems.
To use of of them, the practitioner must chose a numerical value for the descent step size (or "learning rate"). Unfortunately, the descent trajectory is highly sensitive to this value, as may be seen on the animation on the Home page.
The LLR algorithm adaptively updates the step size, by conducting a gradient descent on it, at a cost similar to that of the original algorithm on top of which it is implemented.


Adaptive confidence bands in non parametric regression

The article, written with William Meiniel, may be found here, and the arxiv version there.
We prove that adaptive confidence bands exist in the nonparametric fixed design regression model. In the course of the proof, we show that sup-norm adaptive estimators exist as well in regression.