next up previous
Next: Stochastic approximation to gradient Up: Artificial Neural Nets Previous: Gradient Descent/Ascent

Gradient Descent in ANNs

Consider first the unthresholded perceptron. It's output o is given by

\begin{displaymath}o(\vec{x}) = \vec{w}\cdot\vec{x}
\end{displaymath}

We can express the training error E as a function of $\vec{w}$ as follows:

\begin{displaymath}E(\vec{w}) = \frac{1}{2}\sum_{d \in D} (t_d - o_d)^2
\end{displaymath}

where td and od are the target and output values for training example xd.

The gradient of this surface, $\nabla E$, then specifies a vector in whose direction we can obtain the greatest increase in E.

So we can get the greatest decrease by going in the opposite direction to that pointed by $\nabla E$.

The following update rule is therefore used train the net using gradient descent.

\begin{displaymath}\vec{w} \leftarrow \vec{w} -\eta \nabla E(\vec{w})
\end{displaymath}

where again, $\eta$ is a positive constant called the learning rate. If $\eta$ is sufficiently small, the system will not overstep a minimum and can be guaranteed to settle into one (albeit a local one).

We can also write the above as:

\begin{displaymath}w_i \leftarrow w_i + \Delta w_i\end{displaymath}

where $\Delta w_i = -\eta \frac{\partial E}{\partial w_i}$

$\frac{\partial E}{\partial w_i}$ is easy to calculate. Since

\begin{eqnarray*}E &=& \frac{1}{2}\sum_{d \in D} (t_d - \vec{w}\cdot \vec{x})^2 ...
...So,}\\
\Delta w_i &=& \eta \sum_{d \in D} (t_d - o_d) x_{id}\\
\end{eqnarray*}



next up previous
Next: Stochastic approximation to gradient Up: Artificial Neural Nets Previous: Gradient Descent/Ascent
Anand Venkataraman
1999-09-16