Perceptrons and Neurons: A Simple NN Model

Chapter 13: From Perceptrons to Deep Networks

Learning objectives

  • Explain the biological analogy for artificial neurons
  • Compute the output of a perceptron given weights and inputs
  • Apply the perceptron learning rule
  • Identify the XOR limitation and the need for multi-layer networks
  • Describe common activation functions: sigmoid, ReLU, tanh
  • Understand forward propagation through a multi-layer perceptron

From Biology to Artificial Neurons

The human brain contains roughly 86 billion neurons, each receiving signals through dendrites, processing them in the cell body, and transmitting output through the axon. An artificial neuron mimics this: it receives numerical inputs, computes a weighted sum, applies an activation function, and produces an output.

1. The Perceptron

Model

A perceptron takes nn inputs x1,x2,,xnx_1, x_2, \ldots, x_n, multiplies each by a weight wiw_i, adds a bias bb, and applies a step activation function:

z=i=1nwixi+b=wTx+bz = \sum_{i=1}^{n} w_i x_i + b = \mathbf{w}^T\mathbf{x} + b
y^=f(z)={1if z00if z<0\hat{y} = f(z) = \begin{cases} 1 & \text{if } z \geq 0 \\ 0 & \text{if } z < 0 \end{cases}

The perceptron is a binary classifier: it divides the input space with a hyperplane (a line in 2D) and assigns one class to each side.

Perceptron Learning Rule

Given a training example (x,y)(\mathbf{x}, y) where yy is the true label, the weights are updated as:

wiwi+α(yy^)xiw_i \leftarrow w_i + \alpha(y - \hat{y})x_i
bb+α(yy^)b \leftarrow b + \alpha(y - \hat{y})

where α\alpha is the learning rate (typically 0.01 to 1). If the prediction y^\hat{y} is correct, yy^=0y - \hat{y} = 0 and no update occurs. If wrong, the weights shift in the direction that would make the prediction closer to correct.

Geometric Interpretation

The perceptron defines a decision boundary: the set of points where wTx+b=0\mathbf{w}^T\mathbf{x} + b = 0. In 2D with inputs x1,x2x_1, x_2, this is the line w1x1+w2x2+b=0w_1 x_1 + w_2 x_2 + b = 0. Data on one side is classified as 1, and on the other as 0.

2. Limitations: The XOR Problem

Linear Separability

A perceptron can only classify data that is linearly separable—data where a single straight line (or hyperplane) can separate the two classes. The logical AND and OR functions are linearly separable, but the XOR function is not:

x1x_1x2x_2ANDORXOR
00000
01011
10011
11110

No single line can separate the 1s from the 0s in XOR. This limitation, highlighted by Minsky and Papert (1969), led to reduced interest in neural networks for over a decade.

3. Multi-Layer Perceptron (MLP)

Architecture

The solution to the XOR problem (and other nonlinear problems) is to stack multiple layers of neurons:

  • Input layer: receives the raw features (e.g., well-log values)
  • Hidden layer(s): intermediate layers that learn nonlinear representations
  • Output layer: produces the final prediction

Each neuron in a hidden layer computes z=wTx+bz = \mathbf{w}^T\mathbf{x} + b followed by a nonlinear activation function a=g(z)a = g(z).

4. Activation Functions

Sigmoid

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

Output range: (0,1)(0, 1). Smooth and differentiable. Drawback: gradients become very small for large z|z| (vanishing gradient problem).

ReLU (Rectified Linear Unit)

ReLU(z)=max(0,z)\text{ReLU}(z) = \max(0, z)

Output range: [0,)[0, \infty). Computationally efficient and avoids vanishing gradients for positive values. Most widely used in modern deep learning.

Tanh (Hyperbolic Tangent)

tanh(z)=ezezez+ez\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}

Output range: (1,1)(-1, 1). Zero-centred, which often helps training converge faster than sigmoid. Still suffers from vanishing gradients at extremes.

5. Forward Propagation

The Computation Flow

For a network with one hidden layer:

  • Hidden layer: z[1]=W[1]x+b[1]\mathbf{z}^{[1]} = W^{[1]}\mathbf{x} + \mathbf{b}^{[1]}, then a[1]=g(z[1])\mathbf{a}^{[1]} = g(\mathbf{z}^{[1]})
  • Output layer: z[2]=W[2]a[1]+b[2]\mathbf{z}^{[2]} = W^{[2]}\mathbf{a}^{[1]} + \mathbf{b}^{[2]}, then y^=g(z[2])\hat{\mathbf{y}} = g(\mathbf{z}^{[2]})

The superscript [l][l] denotes the layer number.

6. Backpropagation (Intuition)

How Does the Network Learn?

After a forward pass, we compute the loss (e.g., cross-entropy or MSE). Backpropagation uses the chain rule of calculus to compute how much each weight contributed to the error, then updates all weights simultaneously using gradient descent. The key insight: errors at the output layer propagate backward through the network, allowing each layer to adjust its weights to reduce the overall error.

Geoscience Applications

Neural networks are used for well-log facies classification: given a set of well-log measurements (gamma ray, resistivity, density, neutron porosity), classify each depth interval into a rock type (sandstone, shale, limestone, etc.). They are also used for seismic inversion, earthquake detection, and mineral prospectivity mapping.

[Refs: Goodfellow et al., Deep Learning; Haykin, Neural Networks and Learning Machines]

References

  • Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning, ch. 6 (feedforward networks, activations). MIT Press.
  • LeCun, Y., Bengio, Y., Hinton, G. (2015). Deep learning. Nature 521, 436–444.
  • Bishop, C.M. (2006). Pattern Recognition and Machine Learning, ch. 5 (neural networks). Springer.
  • Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction, ch. 13 (neural networks). MIT Press.

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.