Consider first the unthresholded perceptron. It's output o is
given by
We can express the training error E as a function of
as follows:
The gradient of this surface,
,
then specifies a vector in
whose direction we can obtain the greatest increase in E.
So we can get the greatest decrease by going in the opposite direction
to that pointed by
.
The following update rule is therefore used train the net using
gradient descent.
where again,
is a positive constant called the learning rate.
If
is sufficiently small, the system will not overstep a
minimum and can be guaranteed to settle into one (albeit a local one).
We can also write the above as:
where
is easy to calculate. Since