next up previous
Next: Issues in ANN Learning Up: Artificial Neural Nets Previous: Multilayer Nets, Sigmoid Units

The Backpropagation Algorithm

1.
Propagates inputs forward in the usual way, i.e.
2.
Propagates the errors backwards by apportioning them to each unit according to the amount of this error the unit is responsible for.
We now derive the stochastic Backpropagation algorithm for the general case. The derivation is simple, but unfortunately the book-keeping is a little messy. Since we update after each training example, we can simplify the notation somewhat by imagining that the training set consists of exactly one example and so the error can simply be denoted by E.

We want to calculate $\frac{\partial E}{\partial w_{ji}}$ for each input weight wji for each output unit j. Note first that since zj is a function of wji regardless of where in the network unit j is located,

\begin{eqnarray*}\frac{\partial E}{\partial w_{ji}} &=& \frac{\partial E}{\parti...
...rtial w_{ji}} \\
&=& \frac{\partial E}{\partial z_j} x_{ji}\\
\end{eqnarray*}


Furthermore, $\frac{\partial E}{\partial z_j}$ is the same regardless of which input weight of unit j we are trying to update. So we denote this quantity by $\delta_j$.

Consider the case when $j \in Outputs$. We know

\begin{displaymath}E = \frac{1}{2}\sum_{k \in Outputs} (t_k - \sigma(z_k))^2
\end{displaymath}

Since the outputs of all units $k \ne j$ are independent of wji, we can drop the summation and consider just the contribution to E by j.

\begin{eqnarray*}\delta_j = \frac{\partial E}{\partial z_j} &=& \frac{\partial }...
...o_j)(1-\sigma(z_j))\sigma(z_j)\\
&=& -(t_j - o_j)(1-o_j)o_j\\
\end{eqnarray*}


Thus

 \begin{displaymath}
\Delta w_{ji} = -\eta \frac{\partial E}{\partial w_ij} = \eta \delta_j x_{ji}
\end{displaymath} (17)

Now consider the case when j is a hidden unit. Like before, we make the following two important observations.

1.
For each unit k downstream from j, zk is a function of zj
2.
The contribution to error by all units $l \ne j$ in the same layer as j is independent of wji
We want to calculate $\frac{\partial E}{\partial w_{ji}}$ for each input weight wji for each hidden unit j. Note that wji influences just zj which influences oj which influences $z_k \forall k \in
Downstream(j)$ each of which influence E. So we can write

\begin{eqnarray*}\frac{\partial E}{\partial w_{ji}} &=& \sum_{k \in Downstream(j...
...al o_j} \cdot
\frac{\partial o_j}{\partial z_j} \cdot x_{ji}\\
\end{eqnarray*}


Again note that all the terms except xji in the above product are the same regardless of which input weight of unit j we are trying to update. Like before, we denote this common quantity by $\delta_j$. Also note that $\frac{\partial E}{\partial z_k} = \delta_k$, $\frac{\partial z_k}{\partial o_j} =
w_{kj}$ and $\frac{\partial o_j}{\partial z_j} = o_j (1-o_j)$. Substituting,

\begin{eqnarray*}\delta_j &=& \sum_{k \in Downstream(j)}
\frac{\partial E}{\par...
...
&=& \sum_{k \in Downstream(j)} \delta_k w_{kj} o_j (1-o_j)\\
\end{eqnarray*}


Thus,

 \begin{displaymath}
\delta_k = o_j (1-o_j) \sum_{k \in Downstream(j)} \delta_k w_{kj}
\end{displaymath} (18)

We are now in a position to state the Backpropagation algorithm formally.

Formal statement of the algorithm:

Stochastic Backpropagation(training examples, $\eta$, ni, nh, no)

Each training example is of the form $\langle \vec{x}, \vec{t}
\rangle$ where $\vec{x}$ is the input vector and $\vec{t}$ is the target vector. $\eta$ is the learning rate (e.g., .05). ni, nh and no are the number of input, hidden and output nodes respectively. Input from unit i to unit j is denoted xji and its weight is denoted by wji.


next up previous
Next: Issues in ANN Learning Up: Artificial Neural Nets Previous: Multilayer Nets, Sigmoid Units
Anand Venkataraman
1999-09-16