** Next:** Real-time recurrent learning
** Up:** Gradient-based algorithms
** Previous:** Gradient-based algorithms
** Contents**
** Index**

###

Backpropagation through
time

Backpropagation through time (BPTT) may be considered as the earliest
learning algorithm for DTRNN. The most commonly used reference for
BPTT is the book chapter by Rumelhart et al. (1986), although an earlier
description of BPTT may be found in
Werbos (1974)'s PhD
dissertation (see also Werbos (1990)). The central idea to
BPTT is the *unfolding* of the discrete-time recurrent neural
network
into a multilayer feedforward neural network (FFNN)
each time a sequence is processed. The FFNN has a layer for each ``time step'' in the sequence; each layer has
units, that is, as many as there are state units in the original networks. It is as if we are using time to
index layers in the FFNN. Next state is implemented by connecting
state units in layer and inputs in time to state units in
layer . Output units (which are also repeated in each ``time step''
where targets are available) are connected to state units (and input
lines when the DTRNN is a Mealy NSM) as in the DTRNN itself.

The resulting FFNN is trained using the standard
backpropagation (BP) algorithm, but with one
restriction: since layers have been obtained by
replicating the DTRNN over and over, weights in all layers should be the same. To achieve this, BPTT
updates all equivalent weights using the sum of the gradients obtained for weights in
equivalent layers, which may be shown to be the exact gradient of the
error function for the DTRNN.

In BPTT, weights can only be updated after a complete forward step and
a complete backward step, just as in regular
backpropagation. When processing finite
sequences, weights are usually updated after a complete presentation
of the sequence.

The time complexity of BPTT is one
of its most attractive features: for first-order
DTRNN in which the number of states is larger than
the number of inputs (), the temporal cost of the backward
step used to compute the derivatives grows
as , that is, the same as the cost of the forward step used to
process the sequence and compute the outputs. The main drawback of
BPTT is its space complexity,
which comes from the need to replicate the DTRNN for each step of the
sequence. This also makes it a bit trickier to program than RTRL.

For more details on BPTT the reader is referred to
Haykin (1998, 751) and Hertz et al. (1991, 182).

** Next:** Real-time recurrent learning
** Up:** Gradient-based algorithms
** Previous:** Gradient-based algorithms
** Contents**
** Index**
Debian User
2002-01-21