Deep Learning: Convolutional Neural Networks

Chapter 14: Convolutional Networks for Spatial Data

Learning objectives

Explain why spatial structure motivates convolutional layers over fully connected layers
Describe the convolution operation and the role of kernels/filters
Calculate output dimensions after convolution and pooling
Identify the components of a CNN architecture (Conv, ReLU, Pool, Flatten, Dense)
Explain transfer learning and its benefits
Apply CNN concepts to geoscience problems such as seismic facies classification and satellite imagery

From Fully Connected to Convolutional Layers

In a standard fully connected (dense) neural network, every neuron in one layer is connected to every neuron in the next. For an image with $n$ pixels, the first hidden layer alone would need $n \times h$ weights (where $h$ is the number of hidden neurons). For a modest $256 \times 256$ grayscale image, that is over 65,000 input features — and a single hidden layer of 1,000 neurons would require over 65 million parameters. This is computationally prohibitive and ignores the spatial structure of images: nearby pixels are related, and useful patterns (edges, textures) appear regardless of their absolute position.

Convolutional Neural Networks (CNNs) solve both problems by exploiting three key ideas:

Local connectivity: each neuron connects only to a small local region of the input (the receptive field)
Weight sharing: the same small set of weights (a kernel or filter) is applied across the entire input
Translation invariance: a feature detector finds the same pattern regardless of where it appears

These properties make CNNs dramatically more efficient for grid-structured data such as images, seismic sections, and spatial geophysical maps.

The Convolution Operation

Mathematically, discrete convolution of a 1-D signal $f$ with a kernel $g$ is defined as:

(f * g)(t) = \sum_{\tau} f(\tau)\, g(t - \tau)

In the context of CNNs operating on 2-D images, the operation becomes a 2-D cross-correlation (commonly called "convolution" in deep learning). Given an input matrix $X$ and a kernel $K$ of size $k \times k$ , the output at position $(i, j)$ is:

Y(i, j) = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} K(m, n) \cdot X(i + m,\, j + n)

The kernel slides across the input, computing a weighted sum at each position. This produces a feature map (also called an activation map) that highlights where in the input a particular pattern is detected.

Kernels (Filters) as Learned Feature Detectors

A kernel is a small matrix of learnable weights — typically $3 \times 3$ or $5 \times 5$ . During training, the network learns kernel values that detect useful features:

Early layers learn simple features: edges, gradients, corners
Middle layers combine simple features into textures and patterns
Deep layers detect complex, high-level structures (e.g., a fault plane in a seismic image)

Each convolutional layer typically has multiple kernels. If a layer has $F$ filters, it produces $F$ feature maps — one per filter. These feature maps become the input "channels" for the next layer.

Output Size Formula

Given an input of size $n_{in}$ , a kernel of size $k$ , stride $s$ , and zero-padding $p$ , the output dimension is:

n_{out} = \left\lfloor \frac{n_{in} + 2p - k}{s} \right\rfloor + 1

For 2-D inputs, this formula applies independently to height and width. With $s = 1$ and $p = 0$ (the defaults), a $3 \times 3$ kernel reduces each spatial dimension by 2.

Stride and Padding

Stride ( $s$ ) controls how many pixels the kernel moves between positions. Stride 1 means the kernel shifts one pixel at a time (maximum overlap). Stride 2 means it shifts two pixels, halving the output size.

Padding ( $p$ ) adds rows/columns of zeros around the input border. Same padding ( $p = \lfloor k/2 \rfloor$ with $s = 1$ ) keeps the output the same size as the input. Valid padding ( $p = 0$ ) uses no padding, so the output shrinks.

Activation Functions

After each convolution, a non-linear activation function is applied element-wise. The most common is the Rectified Linear Unit (ReLU):

\text{ReLU}(x) = \max(0, x)

ReLU is fast to compute, avoids the vanishing gradient problem for positive values, and introduces the non-linearity needed for the network to learn complex mappings.

Pooling Layers

Pooling layers downsample the feature maps, reducing their spatial dimensions while retaining the most important information. This reduces computation, controls overfitting, and introduces a degree of translation invariance.

Max pooling: takes the maximum value in each pooling window. A $2 \times 2$ max pool with stride 2 halves each spatial dimension.

Average pooling: computes the mean value in each pooling window. Less aggressive than max pooling; sometimes used in the final layers.

Pooling also uses the output size formula with $k$ = pool size and $p = 0$ :

n_{out}^{\text{pool}} = \left\lfloor \frac{n_{in}}{s_{\text{pool}}} \right\rfloor

CNN Architecture: The Complete Pipeline

A typical CNN processes an image through these stages:

Input layer: the raw image (e.g., $64 \times 64 \times 3$ for an RGB image)
Convolutional block (repeated): Conv2D $\to$ ReLU $\to$ (optional) Pooling
Flatten: reshape the final feature maps into a 1-D vector
Dense (fully connected) layers: standard neural network layers for classification/regression
Output layer: softmax for classification, linear for regression

Feature Maps and Channels

The input to the first Conv layer has a certain number of channels: 1 for grayscale, 3 for RGB, or more for multispectral satellite imagery. Each Conv layer with $F$ filters outputs $F$ channels. So if the first Conv has 32 filters applied to a $64 \times 64 \times 3$ input (with $3 \times 3$ kernels, stride 1, no padding), the output is $62 \times 62 \times 32$ .

The number of parameters in that layer is: $F \times (k \times k \times C_{in} + 1)$ where $C_{in}$ is the number of input channels and $+1$ is for the bias. So: $32 \times (3 \times 3 \times 3 + 1) = 32 \times 28 = 896$ parameters.

Classic Architectures (Overview)

LeNet-5 (1998): one of the earliest CNNs. Two conv layers followed by pooling, then dense layers. Originally for handwritten digit recognition. Simple but foundational.

VGG-16 (2014): uses only $3 \times 3$ convolutions stacked very deep (16 weight layers). Showed that depth matters. ~138 million parameters.

Modern architectures (ResNet, Inception, EfficientNet) use skip connections, multi-scale filters, and other tricks, but the Conv-ReLU-Pool building block remains central.

Transfer Learning

Transfer learning reuses a model trained on a large dataset (e.g., ImageNet with millions of natural images) as a starting point for a new task. Because early CNN layers learn generic features (edges, textures) that are useful across domains, we can:

Take a pre-trained model and freeze its early layers
Replace the final dense layers with new ones suited to our geoscience task
Fine-tune on a (potentially small) geoscience dataset

This dramatically reduces the amount of labeled training data needed — crucial in geoscience where labeled examples are expensive to obtain.

Geoscience Applications of CNNs

Seismic facies classification: CNNs classify 2-D patches of seismic amplitude data into facies categories (channel fill, salt body, carbonate platform, etc.). The network learns to recognize subtle reflection patterns.

Satellite image classification: land-use mapping, vegetation monitoring, flood extent detection from multispectral satellite imagery. Multi-channel inputs naturally fit the CNN framework.

Core image analysis: classifying rock types and identifying sedimentary structures from photographs of drill cores.

Thin section mineral identification: recognizing minerals in polarized light microscopy images of rock thin sections — quartz, feldspar, mica, etc.

References

Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning, ch. 9 (convolutional networks). MIT Press.
LeCun, Y., Bengio, Y., Hinton, G. (2015). Deep learning. Nature 521, 436–444.
Krizhevsky, A., Sutskever, I., Hinton, G.E. (2012). ImageNet classification with deep convolutional neural networks. NeurIPS, 1097–1105.
Mousavi, S.M., Beroza, G.C. (2022). Deep-learning seismology. Science 377, eabm4470.