2 Some Basic Principles of Neural Computation

A simplified neuron

The central building block of a network is the neuron. Neurons link up with other neurons to form a web of connections called a neural network. The neuron can be dormant or active. Active neurons communicate their states to other neurons through this web of connections such that other neurons may become active (if their connections are excitatory) or dormant (if their connections are inhibitory). We can visualise an idealised neuron as possessing a cell body and multiple input and output connections as shown in Figure 2.1. Each input line or connection represents the flow of activity from some other neuron or from some external source (such as light falling on some photosensitive retinal cell). For most neurons, the cell body performs the simple operation of summing the net amount of activity to the neuron ($\Sigma$). If the net activity exceeds some threshold ($\theta$) then the neuron becomes active and this activity is passed forward along the output connection lines. Thus, each neuron has:

A procedure for determining the net input to the neuron ($\Sigma$).
An activation function for determining the activity of the neuron. In Figure 2.1, the activation function is the threshold function ($\theta$).
A pattern of connectivity indicating to which other neurons the cell body is connected.

It is worth emphasising that the artificial neuron typically used in connectionist modelling is a highly idealised form of its biologically inspired counterpart. For example, the neuron in Figure 2.1 makes no distinction between the electrical conduction of activity along its connecting elements (the axon and dendrites) and the chemical transmission of these impulses across the synaptic gaps. Neurotransmitters (such as epinephron) are known to modulate the effect of impulses at synaptic junctions. Furthermore, synaptic junctions do not just occur at the dendronal and axonal tips of a neuron. Projections to all parts of the neuronal surface have been observed. The firing of a neuron need not be an all or none affair. Graded electrical activity can influence the behaviour of a neuron, in addition to the aaction potentials that occur when a cell reaches its threshhold firing level. Furthermore, the idealised neuron does not habituate in the presence of continued excitation. All of these characteristics of biological neurons are likely to add to the complexity of their computation properties. It is important to remember, therefore, that the neurons used in our network models are not exact replicas of their biologically inspired counterparts. It is a matter for current research whether the simplifications made by modellers have crucial implications for the observed properties of the networks that they built. For current purposes, we make the simplifying assumption that the building blocks of networks are neurons that respond differentially to varying levels of activity and that neuronal responses are communicated as single pulses of activity to the other neurons with which the source neuron is directly connected.

Neural networks

two layer network — Figure 2.2: A layered neural network

A neural network consists of a collection of neurons. The network architecture defines a pattern of connectivity indicating how neurons are connected to each other. As we have seen, there are many architectural types. However, it is quite often the case that neurons are connected to each other in a layered fashion. For example, consider the neural network depicted in Figure 2.2. It consists of four neurons organised into two layers. In this example, the network consists simply of an input layer and an output layer. Within the input layer, all the neurons have connections which project to the output layer. There are no connections between neurons within a layer (no intra-level connections). Furthermore, in this architecture the neurons do not possess recurrent connections i.e., they do not have connections which project back to themselves. Similarly, the neurons in the output layer do not connect to each other, nor do they have connections that project back to the neurons in the input layer. Thus, in this network, the flow of activity is in one direction only—from the input layer to the output layer. This type of network is called a feedforward network. In contrast, recurrent networks may possess both intra- and inter-level connections, as well as feedback connections from one level to an earlier level.

Notice that the input neurons in Figure 2.2 have only a single connection projecting into them. Similarly, the output neurons have only a single connection projecting out from them. Again, this portrayal is most likely to be a gross simplification in comparison to biological neural networks. If we consider a simple neural network like that in Figure 2.2 to model a neural module in the brain, it is likely to receive inputs from multiple sources and send outputs to multiple destinations. Of course, there will be some biological neurons that receive only a single input. For example, retinal photoreceptors might be thought as neurons with just a single input—in this case, the light source that fires the neuron. More generally though it is appropriate to think of the single input to an input neuron in Figure 2.2 to summarise the input from multiple sources and likewise the single output from an output neuron as summarising the output to multiple destinations.

We can now begin to consider just how the neural network performs its task. First, assume that each input neuron has a certain level of activity associated with it. How does the activity of the input neurons influence the output neurons? To simplify the explanation, let’s consider the process from the point of view of just one output neuron. Let’s take the left-hand output neuron shown in Figure 2.2. This is highlighted in Figure 2.3. The input neurons have activities $a_1$ and $a_2$, respectively. How is the activity of the output neuron determined? One of the computations that the output neuron performs is to calculate its net input from other neurons. The output neuron in Figure 2.3 receives input from two input neurons. These two input neurons communicate with the output neurons via independent connections. These connections can vary in their strength or weight. The weight of a connection between two neurons determines how strongly the activity of the source neuron influences the target neuron. Thus, a weakly active neuron can still influence the activity of a target neuron if the connection is strong, whilst a moderately active neuron may have only a small effect on the target neuron if the connection between them is weak. Weights can be strong or weak. They can also be positive or negative. Positive weights between neurons result in the source neuron having an excitatory effect on the target neuron while negative weights result in inhibitory effects. The effect of $a_1$ on the output neuron is modified by the strength of the connection to the output neuron such that the input along the connection line $w_{31}$ is just the product of the activity of the source neuron $a_1$ and the weight itself $w_{31}$, i.e., $w_{31}a_{1}$. Similary, the activity of the source neuron $a_2$ on the target neuron is given by the expression $w_{32}a_{2}$. Thus the total input to the neuron is simply given by the sum of these two products, i.e., $w_{31}a_{1} + w_{32}a_{2}$.

The net input to the target neuron by itself does not determine the activity of the output neuron. It is also necessary to consider the activation function of the neuron. Artificial neurons, like their biological counterparts vary in their activation functions. For example, a threshold activation function stipulates that the neuron will fire if the net input exceeds a certain threshold value, otherwise it remains dormant. Examples of three different types of activation function are shown in Figure 2-4.

activation functions — Figure 2.4: A single output neuron

In biological terms, the firing of a neuron indicates that its action potential has been reached. A sequence of electrical spikes are transmitted along the neural axon to synapses with other neurons. The resulting activity of the neurotransmitters in the synaptic gaps may then lead to the spread of activation to other neurons. In the simplified neuron discussed here there is no temporal sequence of electrical spikes, just a single value corresponding to the activation of the neuron. This has the disadvantage of precluding the use of any information embedded in the chain of spikes of the biological neuron (e.g. rate of spiking) that might indicate habituation, etc. For activation functions with a graded response, such as the squashing function in Figure 2.4, the intensity of the response can be interpreted as an analogue of the rate of spiking.

We have now considered how a neural network can compute an output given an input. The important factors involve input activation values, weight strengths and (activation functions. It is a simple step to extend these computations from a single output neuron to multiple output neurons portrayed in the network depicted in Figure 2.2. The second output neuron is subject to the same dynamics as the first output neuron . Provided the input activations and the value of the weights connecting the input neurons to the second output neuron are known, it is possible to calculate its activation value.

The Mapping Principle

3x3 network — Figure 2.5: A 3x3 feedforward network

It is appropriate to think of a simple feedforward network as performing a mapping function. For example, consider the simple network depicted in Figure 2.5. There are three input neurons and three output neurons. Let us assume that each input neuron can take on a range of activation values (say 0.0 from up to 1.0). The three input activations thus represents a pattern of activity at the input level. Similarly, the three output neurons provide a pattern of activity at the output level. Since the value of the connections (together with the activation functions of the neurons) determine the output for any given input, we can consider the network as translating (or mapping) an input signal to an output signal. Alternatively, we can consider the pattern of activation across the input units as identifying a single point in three dimensional space—the level of activation for each unit defining the coordinate of a single dimension. This conceptualisation is depicted in Figure 2.6. The arrow V is a vector pointing to the position in the three dimensional space which corresponds to the activation state at the input level where x, y and z represent the activation level on the individual input neurons.

Normally, an network can be used to map many inputs to their respective outputs. We can thus think of the network as mapping a space of inputs to a space of outputs. This is depicted in Figure 2.7. For each point in the source areas S (the input activation state) there is a corresponding point in the target area T (the output activation state). The network performs the function F of mapping from S to T. This relation can be captured simply in mathematical form as in Equation 2.1: \[ T = F(S) \tag{2.1}\] For some networks, the function $F$ will be a well defined mathematical function such as $\log(S)$. In general, however, it will be difficult to state explicitly in a simple form the mathematical relation between the target and the source spaces.

ST Space — Figure 2.7: A 3D vector space

In the current example, which derives from the network in Figure 2.5, each point in $S$ corresponds to a point in three dimensional space. Typically, however, the vector of input activations (or output activations) will have a higher dimensionality than three — there will be more than three input neurons. Thus, each point in $S$ will represent a point in a multi-dimensional space. We will call this space a hyperspace. In general, then, a feedforward network maps from one hyperspace to another. The dimensionality of each hyperspace will differ if the number of input neurons is different to the number of output neurons. The matrix of connections in the network determines how one hyperspace (say $S$) maps to the target hyperspace $T$.

Hebbian Learning

Feedforward networks perform a mapping from an input to an output. The mapping is determined by the weights of the connections between the input and the output neurons. It is possible to train a network to produce different outputs in response to distinct input patterns. Training involves adjusting the connections (or weights) in the network by using a learning algorithm. One of the simplest (and yet most effective) learning algorithms used in neural network research is the Hebbian learning rule — named for the neuropsychologist Donald Hebb. Hebb suggested that:

‘’When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth processor metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased.’’ (see Hebb 1949, chap. 4)

Hebb, Donald O. 1949. Organization of Behavior: A Neuropsychological Theory. New York: Wiley.

Figure 2.8: Hebbian learning in a simple network

We can capture the spirit of this suggestion by considering a very simple neural network. Consider the network depicted in Figure 2.8. Two neurons are connected by a single weight $w$. Now assume that both neurons are simultaneously activated by an outside source such that both are turned on. Let us call the activation values of the two neurons $a_{in}$ and $a_{out}$. According to Hebb’s suggestion, any axon (in our example, $w$) connecting the two neurons will undergo change. One way to determine the change that occurs is to require that the change in the weight $w$ is proportional to the product of the activations of the two neurons. Thus, if either of the two neurons are dormant (activation zero) no change will occur in the weight. The Hebbian learning rule, in its simplest form, can be written as in Equation 2.2:

\[\Delta w = \epsilon a_{in} a_{out} \tag{2.2}\]

The term $\Delta w$ stands for the change in the weight that is to be made as a result of calculating the product of the two activation values $a_{in}$ and $a_{out}$. The term $\epsilon$ is just a constant that determines the rate at which the weight is allowed to change. This constant $\epsilon$ is often referred to as the learning rate parameter. Thus, if the activation values of the two neurons are $0.75$ and $0.5$ respectively and the learning rate is $0.5$ then the change in the weight $\Delta w = 0.5 (0.75) (0.5) = 0.1875$. If the initial value of $w = 0.0$ then the new value of $w = 0.1875$. Each learning trial can result in a change in the weight. If there is a tendency for both neurons to be active at the same time, then the value of $w$ will gradually increase such that activity on just one of the neurons will result in the other becoming active. The network has thereby learnt a correlation between the pattern of activity on the two neurons over time. This principle of learning can be extended to networks with any numbers of neurons. In each case, connections between neurons will adapt gradually to reflect the correlation of activity between individual neuron pairs.

Principle of Superposition

There are many ways to represent the connectivity in a neural network. One of the more popular is to use a matrix format. For example, consider the matrix in Table 2.1. The rows and columns in the matrix represent a feedforward network with four input units and four output units — hence 16 connection weights. The circles along the top of Table 2.1 indicate a pattern of input activations presented to the four input units. The circles down the right of the matrix represent the resulting activity on the four output units. The output activations are easy to calculate. Just multiply the first input activation value (1) by the value of the weight from the first input neuron to the first output neuron (the first column, first row i.e.,( -0.25)). Then multiply the second input activation value (-1) by the weight from the second input neuron (the second column, first row i.e., (+0.25)) and so on until you have the total net input for the first output unit. This should add up to -1. Try it!

The net input to the second output neuron is calculated by using the input activation values together with the weight values in the second row of the matrix. And so on. Thus, the row of the matrix indexes the target neuron whilst the column of the matrix indexes the source neuron. You can now see the motivation behind the formalism used to identify a given weight in network. Here, the subscript indexes the row of the matrix (the target neuron) while the subscript indexes the column of the matrix (the source neuron). Verify the output activation values for the input activations and connection weights in Table 2.1.

Now consider a network with four input and four output neurons as shown in Table 2.2. Notice that the input activations and the weight matrix is different to that shown in Table 2.1. Again, verify that the mapping from input to output is indeed as shown. Note that the output activations are also different to those in Table 2.1.

Your next task is to construct a composite network of the two networks you have just considered. Do this simply by superimposing the value of the corresponding weights in each network upon each other. Thus, take the weight in the first row, first column of Table 2.1 (-0.25) and add it to the weight in the first row, first column of Table 2.2 (-0.25). Remember this is the connection that goes from the first input neuron to the first output neuron. The resulting connection for the composite network at this position is, therefore, 0.0. The complete composite network with all the weight values superimposed upon each other is shown in Table 2.3.

There is a remarkable fact about this composite network: The weight matrix preserves the mapping properties of the two individual weight matrices. You can see this for yourself by taking each of the input patterns from Table 2.1 and Table 2.2 and multiplying them through the weight matrix in Table 2.3. You will find that the composite network produces exactly the right output patterns for a given input pattern.

The capacity to superimpose different mapping functions within a network and yet maintain the functionality of the individual mappings is an important characteristic of neural networks. However, this capacity is not unconstrained. As you might expect, changing the weights in a network can interfere with the mapping characteristics of the network. We will see later that the principle of superposition has important implications for our understanding of representation and learning in these systems.

Training networks with an error signal

The Orthogonality Constraint

Hebbian learning provides an effective learning algorithm for discovering the correlations of activity between two sets of neurons. However, there is a limit to the number of pairs of independent input/output pairs that a Hebbian network can represent without interference due to superposition effects. Specifically, a network trained with the Hebbian learning rule will be able to map a set of input patterns to any arbitrarily selected set of output patterns, just so long as the input patterns are orthogonal to each other. In a network with two input units, there can be at most two input patterns which are orthogonal to each other. A third input pattern will always share some features with at least one of the other two patterns. Similarly, in a network with three input units there can be at most three input patterns which are simultaneously orthogonal to each other. Since the number of input units determines the dimensionality of the vector representing the input pattern, it follows that the maximum number of patterns that any network can manage to map to arbitrarily determined outputs using Hebbian learning is just the same as the number of input neurons.

The orthogonality constraint does not mean, of course, that a network trained using the Hebbian learning rule cannot be trained on many more input patterns than there are input neurons. However, it does mean that interference effects will arise from the competition between the input/output pairs for the network’s representational resources. In some cases, these interference effects are beneficial as they lead to generalisation between similar input patterns. In other cases, interference may be deemed a bad thing and more powerful learning algorithms are required. One such algorithm is the Perceptron Convergence Rule.

Perceptron Convergence Rule

In Hebbian learning the two neurons are clamped at their appropriate activations whilst weight adjustments take place. Another approach is to let the network determine the output given an input and then compare the output to a desired teacher signal. Any discrepancy between the actual and desired outputs is used to determine the adjustment to the weights in the network. Consider the simple network shown in Figure 2.9.

Figure 2.9: Perceptron learning in a simple network

Two neurons are connected by a single weight of strength $w$. The input activation is $a_{in} so the net input to the output neuron is given by Equation 2.3:

\[netinput = w a_{in} \tag{2.3}\]

Let us also assume that the output neuron in a linear threshold unit, i.e., it will fire if and only if the net input exceeds a threshold value. This activation function is summarised in Equation 2.4 where $\theta$ is the threshold level:

\[ a_{out} = \left\{ \begin{array}{ll} 1 & \mbox{if $netinput > \theta$} \\ 0 & \mbox{otherwise} \end{array} \right. \tag{2.4}\]

Let us also assume that the desired output from the network is $t_{out}$. The Perceptron Convergence Rule (Rosenblatt 1958) attempts to reduce any discrepancy between the actual output and the desired output by adjusting both the threshold value of the output neuron and the connection from the input neuron to the output neuron. If the output is too small then the threshold value is decreased and the connection is increased. If the output is too big then the output threshold is increased and the connection decreased. These adaptations are summarised in Equation 2.5 and Equation 2.6 where is a measure of the discrepancy between the actual and desired output and are the adjustments to the output neuron threshold and weight respectively:

Rosenblatt, Frank. 1958. “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain.” Psychological Review 65 (6): 386.

\[\Delta\theta = -\epsilon(t_{out} - a_{out}) = - \epsilon\delta \tag{2.5}\] \[ \Delta w = (t_{out} - a_{out}) \epsilon a_{in} = \epsilon\delta a_{in} \tag{2.6}\]

$\epsilon$ is a constant we shall call the learning rate. Notice that the threshold of the output neuron will always change in some direction if the actual output is incorrect. On the other hand, a change in the connection will occur only if the input neuron is active. Thus, if $a_{in} = 0$, then $\Delta w =0$. Equation 2.6 provides a means of apportioning blame to the input neuron for the output error. If an input neuron is dormant then no blame is apportioned to the connections coming out of that neuron.

Consider now the application of the Perceptron Convergence Rule to the problem Boolean Or. This problem requires a neural network with two input units and a single output unit as depicted in Figure 2.10.

The mapping problem that the network has to learn is also shown in Figure 2.10. Thus, it has to map 4 distinct input patterns to 2 distinct output patterns. It is useful to think of Boolean OR as a classification problem — the four input patterns are classified as belonging to either of two categories. Let us set the initial values of $\theta = 1.0, w_{20} = 0.2$ and $w_{21} = 0.1$. Table 2.4 shows the changes of the connections and threshold as the Perceptron Convergence Rule is applied after presentation of an input pattern. In this example, the learning rate $\epsilon = 0.5$. After 4 cycles of training, the network has solved the problem. Work through the calculations and verify that the weight and threshold updates are correct.

Table 2.4: Training the network on Boolean OR

The Perceptron Convergence Rule is a powerful learning algorithm. In fact, it guarantees to find a suitable configuration of weights and thresholds for a problem, provided such a configuration exists. This guarantee is important and can be proved mathematically (Minsky and Papert 1969), though we shall not attempt to do so here. Later, we shall return to describing the conditions under which the Perceptron Convergence Rule succeeds and fails to find the right network configuration for a problem.

Minsky, M., and S. A. Papert. 1969. “Perceptrons.” Cambridge, MA: MIT Press 6 (318-362): 7.

Gradient Descent

Another technique for adjusting the weights in a network is called least-mean-square (LMS) learning procedure (Widrow and Hoff 1988). Again, consider a simple network with a single weight connecting an input neuron and an output neuron as in Figure 2.9. Let us use the same notation to denote the actual output from the network as $a_{out}$ and the desired output for a given input as $t_{out}$. We can now define an error score E which is the square of the actual error $\delta$ which we calculated before in Equation 2.5. $E$ is defined in Equation 2.7 where $\Sigma_{p}$ indexes the individual patterns:

Widrow, Bernard, and Marcian E Hoff. 1988. “Adaptive Switching Circuits.” In Neurocomputing: Foundations of Research, 123–34.

\[ E = \Sigma_{p} (t_{out} - a_{out})^2 \tag{2.7}\]

For any input pattern, there will an output pattern determined by the value of the weight in Figure 2.9. For each output, there will be a corresponding error term $E_{p}$. As $w$ varies, $E_{p}$ will vary (holding the input pattern constant). Since the square of a real number is always positive, then it is quite easy to plot an error curve.

In fact, it can be shown that for a single layered network of the kind we have considered up to now, the error curve will always look like that in Figure 2.11. The lowest point on the error curve corresponds to the minimum error that can be achieved in the network for the pattern. The minimum error will correspond to a particular value of the weight. If the minimum error is zero then the network has found a solution to the mapping problem.

If we think of the network as occupying the error space defined by Equation 2.7 and depicted in Figure 2.11, then one way to define the learning algorithm used to train the network is to perform gradient descent. For example, the network will always be perched at some point on the error curve. If the network has some means to calculate the slope of the error curve at its current position, then it has the necessary information to reduce the error. If the slope is negative, then the value of the weight must be increased. If the slope is positive, the value of must be decreased. We can calculate the slope of the error curve by performing a bit of calculus. Let us define $\Delta w$ as the change to be made to the weight $w$.

\[ \Delta w = -k \frac{dE}{dw} \tag{2.8}\]

Equation 2.8 is just the slope of the error curve in Figure 2.11. Equation 2.8 has the necessary property that the weight will be increased if the slope is positive and decreased if the slope is negative. Recall that $E = (t_{out} - a_{out})^2$ and that $a_{out} = wa_{in}$. Therefore, Equation 2.8 becomes

\[ \Delta w = -k \frac{d}{dw} (t_{out} - wa_{in})^2 = \epsilon\delta a_{in} \tag{2.9}\]

where $\epsilon = 2k$ and $\delta = t_{out} - a_{out}$. Notice that Equation 2.12 has exactly the same form as Equation 2.6. Thus, the LMS rule and the Perceptron Convergence Rule are essentially identical. In both cases, the learning algorithm leads the network to perform gradient descent to the bottom of the ‘valley’ of the error curve. In fact, both these learning algorithms are often referred to jointly as the Delta Rule.

Although we haven’t attempted to prove the claim here, it is the case that feedforward networks with only a single layer of connections will always have an error curve with just a single error minimum. Hence, if the network employs a learning algorithm that performs gradient descent, then it will always find the global minimum of the curve—if you know the valley you’re walking in is smooth with no uneven troughs, then you can be sure that if you keep walking downhill then you will eventually get to the bottom. It is now possible to understand the claim that the Perceptron Convergence Rule (or the LMS rule) will always find a solution to a mapping problem provided such a solution exists. This corresponds to saying that the network will always descend to the lowest point on the error curve. If the point on the error curve corresponds to a region of zero error then the network has solved the problem. If the error curve still hovers above the x-axis as in Figure 2.11, then there is no accurate solution to the mapping problem in the domain of a single-layered feedforward network.

Error3D — Figure 2.12: Error surface for 2 weights

It is possible to specify the conditions under which the Perceptron Convergence Rule can solve a given mapping problem. This is called the Linear Predictability Constraint. In brief, this constraint specifies that the activity on each output neuron must be computable from the linear weighted sum of the activity propagated from the input neurons. We will return to this constraint in the next section.

Waddington, Conrad Hall. 1957. The Strategy of the Genes, a Discussion of Some Aspects of Theoretical Biology. Allen; Unwin.

It is worthwhile noting here, however, that training neural networks using gradient descent type learning algorithms corresponds to moving the network through a space of states that we can think of as a landscape. In Figure 2.11, this landscape was defined by calculating the error as a single weight is varied. If we included two weights in the calculation of error, as in Figure 2.12, then the landscape would become three dimensional. Training, however, would still correspond to the network moving through landscape. It is often helpful to think of training the network as a ball bearing moving down a valley, as depicted in Figure 2.13 taken from Waddington (1957). It is always seeking the lowest potential energy.

Linear Separability

boolplus — Table 2.5: Boolean AND, OR & XOR

Single-layered, feedforward networks trained with the Delta rule can perform multiple mappings of input/output pairs provided the classification inherent in the mappings are linearly separable. To illustrated this constraint consider the three Boolean functions AND, OR and XOR shown in Table 2.5. Each function takes two input values and classifies them along one dimension. The mappings can thus be considered a projection from a two-dimensional to a one-dimensional space. We can represent the input space on a simple X,Y graph. The four input coordinates are shown in Figure 2.14.

boolspace — Figure 2.14: Boolean AND, OR & XOR

Notice how it is possible to partition the space into two categories simply by drawing a straight line through the region. In one case, the partitioning corresponds to Boolean AND, i.e., only the point (1, 1) inhabits the upper right partitioning of the space. Hence, only the input (1, 1) will be way, i.e., 0.

Notice also that there are in fact an infinite number of lines which can partition the space in this manner — they all have slightly different slopes. The different lines correspond to the variety of solutions that a single-layered, feedforward network can find to the problem Boolean AND, i.e., the different values of the connections in the weight matrix.

A different partitioning of the space in Figure 2.16 is required for Boolean OR. However, this can also be achieved by separating the space with a straight line such that now the point (0, 0) is the only point in one of the partitions.

Again, there are an infinite number of such lines that can do the job. The Boolean functions AND and OR are linearly separable. The functions inhabit a two-dimensional space where it is possible to divide the space appropriately with a single line. If the target function inhabited a three-dimensional space (three input units), then the problem space must be amenable to partitioning by a plane for the function to be linearly separable. In more than three dimensions, the partitioning must be achieved with a hyperplane.

Boolean XOR is an example of a function which is not linearly separable. A possible partitioning of the two-dimensional space for XOR is shown in Figure 2.17. The important factor to note is that it is not possible to partition the space with a single straight line such that the points (0, 0) and (1, 1) are in one partition while the points (1, 0) and (0, 1) are in another partition.

Internal Representations

A network that can solve Boolean XOR is shown in Figure 2.18. The network consists of three layers of units; an input layer consisting of two linear units to represent the binary activations for the four input patterns of XOR; an output unit to provide the correct classification of the input patterns; and an intermediate layer of units which we shall call hidden units because they are connected only to other units and are thus hidden to the external environment of the network. The hidden units and output unit are linear threshold units where $\theta = 1.0$. Before proceeding, verify for yourself that this network can indeed solve all four mappings involved in the XOR problem. Remember that the hidden units and the output unit accumulate net input in just the same fashion as we have observed in single-layered networks. The only difference is that activity is passed through an intermediate layer of units before being passed onto the output layer.

XORprob — Figure 2.18: A multi-layered perceptron for XOR

Let us now consider what the hidden units are actually doing in the network depicted in Figure 2.18. Table 2.6 shows the activation values of the units as the input patterns pass through successive layers of the network. Notice that while there are four distinct patterns of activity at the input layer, there are only three distinct patterns of activity at the hidden layer. The input activations and are both represented as (0, 0) at the hidden layer. Consequently, both (0, 0) and (1, 1) produce the same output, i.e., 0.

XORsim — Table 2.6: Representing similarity relations in XOR

In terms of the two-dimensional representation of the problem shown in Figure 2.17, the point (1, 1) has been transformed into (0, 0) by the time it reaches the hidden layer. In other words, the input space has been folded so that the similarity structure of the problem has been changed, as depicted in Figure 2.19. The point (1, 1) is now more similar to (0, 0) than it is to either of the points (1, 0) or (0, 1).

Furthermore, the fashion in which the hidden units represent the problem XOR is linearly separable. It will always be possible to draw a straight line in a two-dimensional space such that three points can be classified such that any one of them can be isolated in its own partition. Hence, it will be possible for the final layer of connections between the hidden units and the output units to transform the hidden activations into the correct response patterns demanded by XOR.

In summary, the hidden units represent a transformation of the input patterns such that the similarity structure of the input patterns is different. The hidden units thus form their own internal representations of the input patterns. By folding the input space in an appropriate manner the hidden units are able to turn a linearly inseparable problem into a linearly separable one.

Backpropagation

The weights and thresholds of the network in Figure 2.18 have all been preset to produce the appropriate mapping properties for XOR. We now introduce a learning algorithm that is able to adjust weights and thresholds in a multi-layered perceptron. The algorithm is called Backpropagation or the Generalised Delta Rule (Rumelhart, Hinton, and Williams 1986).

Informal account

Back propagation, like the Delta rule, uses the discrepancy between the desired output of a unit and its actual output to determine the changes to the connections feeding into the unit and the changes to the threshold of the unit. In fact, back propagation works very much like the delta rule at least as far as the output unit is concerned. The weight changes made to the connections between the hidden units and the output unit are proportional to the product of the error on the output unit and the activation of the hidden units themselves (compare this to the formulation of the Perceptron Convergence Procedure or LMS learning in Equation 2.12). However, the adjustment of the connections between the input units and the hidden units pose a special problem: In order to assign some measure of blame for the error to these connections, the learning algorithm needs to know how much error is already apparent at the level of the hidden units — even before the output unit is reached. Unfortunately, we do not have a predefined target for the hidden units — only the output unit. So we cannot say what their activation levels should be and hence cannot specify an error signal at this level of the network.

The designers of back propagation came up with a simple heuristic which seems to solve the error assignment problem most of the time. They suggested that an error can be assigned to the hidden units that is proportional to the error on the output unit. Furthermore, the error assigned to each output unit need not be identical. In fact, Rumelhart, Hinton, and Williams (1986) proposed that error assigned to a hidden unit should be proportional to the strength of the connection between the output unit and the given hidden unit. In effect, they assign blame for the output error to the hidden unit in proportion to the strength of the connection between the output unit and the hidden unit. Hence, the name of the learning algorithm — back propagation — reflects the fact that error is propagated backwards in the network from the output unit to the hidden units. Once the hidden units have been assigned an error signal, the changes to the connections between the hidden units and the input units can proceed in the usual fashion, i.e., weight changes will be proportional to the product of the error on the hidden units and the activation of the input units.

It is also worth considering here what happens to networks with more than one output unit. In such cases, we can assume that there will be an error term associated with each of the output units. How do we know which error term to use to propagate backwards to each of the hidden units? The answer is that, in general, we use all of the output error terms to calculate the error terms for each of the hidden units. For each hidden unit, we determine the output units to which it is connected. In a fully feedforward network, all the hidden units will be connected to all the output units. Error is then propagated backwards from each output unit to the target hidden unit in proportion to the strength of the connections between them. Thus, the hidden unit accumulates error from the output units in much the same fashion as a neuron accumulates input from other active neurons.

Formal Account

The back propagation learning algorithm is derived in a similar fashion to the LMS algorithm described earlier. That is, back propagation is also a learning procedure that performs gradient descent. In the account that follows it might help to refer to a very simple multi-layered perceptron like the one depicted in Figure 2.20 which possesses just two connections; one connection from the input unit to the hidden unit ($w_1$) and one connection from the hidden unit to the output unit ($w_2$).

MLPsimp — Figure 2.20: Simple multi-layered perceptron

Let us first consider the learning algorithm for changing the connection between the hidden unit and the output unit $w_2$. Just as with LMS, the algorithm attempt to perform gradient descent on the error curve (see Figure 2.11). Thus, the change in the weight $\Delta w_2$ will be proportional to the rate of change of the error with respect to changes in $w_2$. This is given in Equation 2.10. Note that the form of Equation 2.10 is identical to the algorithm defining LMS in Equation 2.8. \[ \Delta w_2 = -k \frac{dE}{dw_2} \tag{2.10}\]

Remember that

\[ E = \Sigma_{p} (t_{out} - a_{out})^2 = \Sigma_p (t_{out} -f(w_2 a_{in}))^2 \tag{2.11}\]

where $f(w_2 a_{in})$ is just the activation function of the output unit¹. Using the chain rule (see Rumelhart, Hinton, and Williams 1986), it can be shown that \[\Delta w_2 = \epsilon\delta a_{in} \tag{2.12}\] where $\epsilon$ is the learning rate, $a_{in}$ is the activation of the hidden unit and \[\delta = (t_{out} - a_{out}) \frac{da_{out}}{dnet_{in}} \tag{2.13}\] The last term in Equation 2.13 ($\frac{da_{out}}{dnet_{in}}$) is just the rate of change of the output of the unit as the input changes while the first term ($t_{out} - a_{out}$) is, of course, just the raw error.

¹ Note that we used a linear threshold unit fo the Perceptron Convergence Procedure. The activation function used here will be different—see below.

Rumelhart, David E, Geoffrey E Hinton, and Ronald J Williams. 1986. “Learning Representations by Back-Propagating Errors.” Nature 323 (6088): 533–36.

Useful to know

It is useful here to take a brief digression. It will be recalled that the activation function used in the Perceptron Convergence Procedure was a linear threshold function. This is shown in Figure 2.4. Note that the linear threshold function contains a discontinuity at 0.0 and 1.5, i.e., there is a sudden change in gradient at these points. Therefore, it is not possible to take the first derivative of this activation function as Equation 2.13 requires. In contrast, the squashing activation function shown in Figure 2.4 has a finite gradient at all points. It is, therefore, possible to define the gradient for all possible inputs to this activation function. The squashing curve in Figure 2.4 is called the Logistic function defined in Equation 2.14:

\[a_{out} = \frac{1}{1 + e^{-net_{in}}} \tag{2.14}\]

It is often called the squashing function becomes it constrains output between the values of $0.0$ and $1.0$ irrespective of the net input. The continuity of the gradient in the logistic function makes it a popular activation function for neurons in artificial neural networks that possess multiple layers of neurons and learn through gradient descent.

It is quite easy to show that \[\frac{da_{out}}{dnet_{in}} = \frac{d}{dnet_{in}} (\frac{1}{1 + e^{-net_{in}}}) = a_{out} (1 - a_{out}) \tag{2.15}\] Substituting Equation 2.15 in Equation 2.13 we obtain \[ \delta_{out} = (t_{out} - a_{out}) a_{out} (1 - a_{out}) \tag{2.16}\] for the output unit. However, for the hidden unit we must assign blame in proportion to the strength of the weight $w_2$ connecting the hidden unit and the output unit. Thus, the $\delta$ for the hidden unit is given by \[ \delta_{hidd} = a_{hidd} (1 - a_{hidd}) \delta_{out} w_2 \tag{2.17}\] Substituting the values for $\delta_{out}$ and $\delta_{hidd}$ in Equation 2.12 the weight changes for $w_1$ and $w_2$ and can be calculated.

Before moving on from this quasi-formal derivation of back propagation, it is worth noting several properties of the squashing function which affects the learning in networks which use activation functions of this type.

The squashing function is almost linear in the region where the net input to the unit is around zero. This means that even non-linear units can behave like linear units under certain conditions.
The logistic function outputs a value of $0.5$ when its net input is zero. Neurons will therefore participate in the learning process even though they are not receiving any input.
The gradient of the squashing function is steepest around the point where the net input is zero. This means that most learning will occur for those neurons which are as yet uncommitted to the task, i.e., those neurons which receive a net input of around zero. Conversely, those neurons which already receive non-zero net input and are on a relatively shallow gradient are less likely to change their current role in the mapping problem.

These special properties of the squashing function around the zero point determine the manner in which many simulations are initiated. In particular, networks are initialised (before any training) with a set of random weights that are constrained within the range . As a result, the net input to any unit is likely to sum to zero and learning is maximised.

Local Minima

It turns out that the a multi-layered perceptron can, in principle, learn any (continuous) mapping function you care to throw at it, unlike the Hebbian learning rule and the Perceptron Convergence Procedure which are severely constrained. Whereas the Perceptron Convergence Procedure guarantees to find a solution to the mapping problem if such a solution exists, back propagation guarantees that a solution exists but does not guarantee to find it. To understand why this is the case, we need to introduce the concept of a local minimum.

localglobal — Figure 2.21: Local and global minima

In Figure 2.11, we plotted the error curve for a single-layered, feedforward network trained with the LMS algorithm. Recall that the curve had no bumps in it and consequently had only a single global minimum which would eventually be attained if training was continued for long enough. It turns out that the error surface for a multi-layered perceptron looks rather different.

An hypothetical example is shown in Figure 2.21. Note that there is a local bump or minimum in the curve. In general, there may be many such bumps in the error curve for a multi-layered perceptron but one will suffice for purposes of illustration. Also, recall that gradient descent algorithms adapt the weights in the network by calculating the slope of the curve that corresponds to the current state of the network. If you like, the learning algorithm can only see the slope in the local vicinity — rather like determining the slope in a mountainous terrain at night. Furthermore, the algorithm adapts the connections in the network in proportion to the steepness of the slope. Thus, if the slope is shallow the connections will not be changed very much and if the slope is zero the connections will not be changed at all.

Now suppose that the initial state of the network corresponds to the point A depicted in Figure 2.21. Back propagation will force the network to increase the value of the connection so that the network descends the error gradient. However, if the increments in the weight changes are not very large (determined by the learning rate parameter $\epsilon$), then the network may end up in the local minimum. Since the error gradient at the local minimum is zero, the learning algorithm will behave as if it has found a solution to the mapping problem and make no further attempts to change the value of the weights. Hence the network gets stuck in a local minimum. There are various fixes that can be applied to reduce the chances that a network gets caught in this fashion (such as manipulating the learning rate) but they are not foolproof solutions.