Backpropagation through time (BPTT) may be considered as the earliest learning algorithm for DTRNN. The most commonly used reference for BPTT is the book chapter by Rumelhart et al. (1986), although an earlier description of BPTT may be found in Werbos (1974)'s PhD dissertation (see also Werbos (1990)). The central idea to BPTT is the unfolding of the discrete-time recurrent neural network into a multilayer feedforward neural network (FFNN) each time a sequence is processed. The FFNN has a layer for each ``time step'' in the sequence; each layer has units, that is, as many as there are state units in the original networks. It is as if we are using time to index layers in the FFNN. Next state is implemented by connecting state units in layer and inputs in time to state units in layer . Output units (which are also repeated in each ``time step'' where targets are available) are connected to state units (and input lines when the DTRNN is a Mealy NSM) as in the DTRNN itself.
The resulting FFNN is trained using the standard backpropagation (BP) algorithm, but with one restriction: since layers have been obtained by replicating the DTRNN over and over, weights in all layers should be the same. To achieve this, BPTT updates all equivalent weights using the sum of the gradients obtained for weights in equivalent layers, which may be shown to be the exact gradient of the error function for the DTRNN.
In BPTT, weights can only be updated after a complete forward step and a complete backward step, just as in regular backpropagation. When processing finite sequences, weights are usually updated after a complete presentation of the sequence.
The time complexity of BPTT is one of its most attractive features: for first-order DTRNN in which the number of states is larger than the number of inputs (), the temporal cost of the backward step used to compute the derivatives grows as , that is, the same as the cost of the forward step used to process the sequence and compute the outputs. The main drawback of BPTT is its space complexity, which comes from the need to replicate the DTRNN for each step of the sequence. This also makes it a bit trickier to program than RTRL.
For more details on BPTT the reader is referred to Haykin (1998, 751) and Hertz et al. (1991, 182).