- ...Motivation
^{1.1} - A slightly edited version of the original preface
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

- ... trained
^{2.1} - Of course, in reasonable time, as Marvin Minsky
remarks in his foreword to the book by
Siu et al. (1995).
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

- ... S.C.
^{3.1} - Available as a PDF file at
`http://www.dlsi.ua.es/~mlf/nnafmc/papers/kleene56representation.pdf`, retyped by Juan Antonio Pérez-Ortiz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

- ...
.
^{3.2} - Some of these architectures
use sigmoid units whose output is
real and continuous instead of TLU whose output is discrete; for
details, see section 3.2.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

- ...
measured.
^{4.1} - Other authors (Stiles and Ghosh, 1997; Stiles et al., 1997) prefer
to see sequences as functions that take an integer and return a
value in .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

- ... classes:
^{4.2} - This classification
is inspired in the one given by Hertz et al. (1991, 177).
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

- ... DTRNN,
^{4.3} - Defined in
section 3.2.1.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

- ...state!units;
^{4.4} - All state units are assumed to be of the same kind.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

- ...
^{4.5} - Unlike in
eqs. (3.1-3.3), bold lettering is used here to emphasize the vectorial nature of
states, inputs, outputs, and next-state and output functions.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

- ... network!second-order
^{4.6} *Second-order*neural networks operate on products of*two*inputs, each input taken from a different subset; first-order networks instead, operate on raw inputs.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

- ... function
^{4.7} - Also called
*transfer function*,*gain function*and*squashing function*(Hertz et al., 1991, 4).. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

- ...clouse94t)
^{4.8} - The class of
finite-state recognizers representable in TDNN is that of
*definite-memory machines*(DMM), that is, finite-state machines whose state is observable because it may be determined by inspecting a finite window of past inputs (Kohavi, 1978). When the state is observable but has to be determined by inspecting both a finite window of past inputs and a finite window of past outputs, we have a*finite-memory machine*(FMM); the neural equivalent of FMM is therefore the NARX network just mentioned.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

- ... perform
^{4.9} - Although
the choice of a particular encoding scheme or a particular kind of
preprocessing --such as e.g. a normalization-- may indeed affect
the performance of the network.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

- ... parameter
^{4.10} - Learning the initial state is surprisingly not
too common in DTRNN literature, because it seems rather
straightforward to do so; we may cite the papers by
Bulsari and Saxén (1995) Forcada and Carrasco (1995), and
Blair and Pollack (1997).
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

- ...
once.
^{4.11} - For a detailed discussion of gradient-based learning
algorithms for DTRNN and their modes of application, the reader is
referred to an excellent survey by Williams and Zipser (1995), whose emphasis
is on continuously running DTRNN.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

- ...
zero.
^{4.12} - Derivatives with respect to the components of the
initial state may also be easily computed
(Forcada and Carrasco, 1995; Bulsari and Saxén, 1995; Blair and Pollack, 1997), by initializing them accordingly (that is,
).
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

- ... valence.
^{4.13} - The
*valence*(also*arity*) of a tree is the maximum number of daughters that can be found in any of its nodes.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

- ...
daughters)
^{4.14} - Stolcke and Wu (1992) added a special
unit to the state pattern which is
an indicator showing whether the pattern is a terminal
representation.
Another possibility would be to add a special output neuron.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

- ....
^{5.1} - Here, an abbreviated
notation for the concatenation of
languages,
, is used.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

- ...symbol!useless).
^{5.2} - This occurs also when a
string containing the variable may be derived in more than one step
from another string containing it, that is, when recursion is
indirect.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

- ... DTRNN.
^{5.3} - As usual, ,
with
, denotes the set of all
functions
such that, for any
there exists a positive constant such that, for all
, (upper bound). Similarly, a lower bound may
be defined; denotes the set of all functions
such that, for any there
exists a positive constant such that, for all ,
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

- ... states,
^{5.4} - , with
, is the set of all functions
belonging both to and
(see the preceding footnote).
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

- ... nondenumerable
^{5.5} - A set is
*denumerable*(also*countable*) when there is a way to assign a unique natural number to each of the members of the set. Some infinite sets are nondenumerable (or*uncountable*); for example, real numbers form a nondenumerable set.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

- ...
0).
^{5.6} - In the same spirit,
Carrasco et al. (2000) have recently proposed
similar encodings for general Mealy and Moore machines on various first- and second-order
DTRNN architectures.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

- ...
nondeterministic
^{5.7} - Nondeterminism does not add any power to
Turing machines.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

- ...salomaa73b)
^{5.8} - These machines are called
*linearly-bounded automata*because a*deterministic*TM having its tape bounded by an amount which is a linear function of the length of the input would have the same computational power as them.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

- ...
computation
^{5.9} - Nonuniform TMs are
unrealizable because the length of is not bounded, and thus, the
number of possible advice strings is infinite and cannot be stored
in finite memory previous to any computation.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

- ... task)
^{6.1} - For some recognition
tasks, positive samples may be enough.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

- ... 0.3
^{6.2} - This is
presumably because in each state, and for the language studied by
Cleeremans et al. (1989), there is a maximum of two possible successors,
and 0.3 is a safe threshold below 0.5.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

- ...parsing!shift-reduce.
^{6.3} - Also known as LR,
*Left-to-right, Rightmost-derivation*parsing (Hopcroft and Ullman, 1979, 248).. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .