Deep Learning: Convolutional Neural Networks
Learning objectives
- Explain why spatial structure motivates convolutional layers over fully connected layers
- Describe the convolution operation and the role of kernels/filters
- Calculate output dimensions after convolution and pooling
- Identify the components of a CNN architecture (Conv, ReLU, Pool, Flatten, Dense)
- Explain transfer learning and its benefits
- Apply CNN concepts to geoscience problems such as seismic facies classification and satellite imagery
From Fully Connected to Convolutional Layers
In a standard fully connected (dense) neural network, every neuron in one layer is connected to every neuron in the next. For an image with pixels, the first hidden layer alone would need weights (where is the number of hidden neurons). For a modest grayscale image, that is over 65,000 input features — and a single hidden layer of 1,000 neurons would require over 65 million parameters. This is computationally prohibitive and ignores the spatial structure of images: nearby pixels are related, and useful patterns (edges, textures) appear regardless of their absolute position.
Convolutional Neural Networks (CNNs) solve both problems by exploiting three key ideas:
- Local connectivity: each neuron connects only to a small local region of the input (the receptive field)
- Weight sharing: the same small set of weights (a kernel or filter) is applied across the entire input
- Translation invariance: a feature detector finds the same pattern regardless of where it appears
These properties make CNNs dramatically more efficient for grid-structured data such as images, seismic sections, and spatial geophysical maps.
The Convolution Operation
Mathematically, discrete convolution of a 1-D signal with a kernel is defined as:
In the context of CNNs operating on 2-D images, the operation becomes a 2-D cross-correlation (commonly called "convolution" in deep learning). Given an input matrix and a kernel of size , the output at position is:
The kernel slides across the input, computing a weighted sum at each position. This produces a feature map (also called an activation map) that highlights where in the input a particular pattern is detected.
Kernels (Filters) as Learned Feature Detectors
A kernel is a small matrix of learnable weights — typically or . During training, the network learns kernel values that detect useful features:
- Early layers learn simple features: edges, gradients, corners
- Middle layers combine simple features into textures and patterns
- Deep layers detect complex, high-level structures (e.g., a fault plane in a seismic image)
Each convolutional layer typically has multiple kernels. If a layer has filters, it produces feature maps — one per filter. These feature maps become the input "channels" for the next layer.
Output Size Formula
Given an input of size , a kernel of size , stride , and zero-padding , the output dimension is:
For 2-D inputs, this formula applies independently to height and width. With and (the defaults), a kernel reduces each spatial dimension by 2.
Stride and Padding
Stride () controls how many pixels the kernel moves between positions. Stride 1 means the kernel shifts one pixel at a time (maximum overlap). Stride 2 means it shifts two pixels, halving the output size.
Padding () adds rows/columns of zeros around the input border. Same padding ( with ) keeps the output the same size as the input. Valid padding () uses no padding, so the output shrinks.
Activation Functions
After each convolution, a non-linear activation function is applied element-wise. The most common is the Rectified Linear Unit (ReLU):
ReLU is fast to compute, avoids the vanishing gradient problem for positive values, and introduces the non-linearity needed for the network to learn complex mappings.
Pooling Layers
Pooling layers downsample the feature maps, reducing their spatial dimensions while retaining the most important information. This reduces computation, controls overfitting, and introduces a degree of translation invariance.
Max pooling: takes the maximum value in each pooling window. A max pool with stride 2 halves each spatial dimension.
Average pooling: computes the mean value in each pooling window. Less aggressive than max pooling; sometimes used in the final layers.
Pooling also uses the output size formula with = pool size and :
CNN Architecture: The Complete Pipeline
A typical CNN processes an image through these stages:
- Input layer: the raw image (e.g., for an RGB image)
- Convolutional block (repeated): Conv2D ReLU (optional) Pooling
- Flatten: reshape the final feature maps into a 1-D vector
- Dense (fully connected) layers: standard neural network layers for classification/regression
- Output layer: softmax for classification, linear for regression
Feature Maps and Channels
The input to the first Conv layer has a certain number of channels: 1 for grayscale, 3 for RGB, or more for multispectral satellite imagery. Each Conv layer with filters outputs channels. So if the first Conv has 32 filters applied to a input (with kernels, stride 1, no padding), the output is .
The number of parameters in that layer is: where is the number of input channels and is for the bias. So: parameters.
Classic Architectures (Overview)
LeNet-5 (1998): one of the earliest CNNs. Two conv layers followed by pooling, then dense layers. Originally for handwritten digit recognition. Simple but foundational.
VGG-16 (2014): uses only convolutions stacked very deep (16 weight layers). Showed that depth matters. ~138 million parameters.
Modern architectures (ResNet, Inception, EfficientNet) use skip connections, multi-scale filters, and other tricks, but the Conv-ReLU-Pool building block remains central.
Transfer Learning
Transfer learning reuses a model trained on a large dataset (e.g., ImageNet with millions of natural images) as a starting point for a new task. Because early CNN layers learn generic features (edges, textures) that are useful across domains, we can:
- Take a pre-trained model and freeze its early layers
- Replace the final dense layers with new ones suited to our geoscience task
- Fine-tune on a (potentially small) geoscience dataset
This dramatically reduces the amount of labeled training data needed — crucial in geoscience where labeled examples are expensive to obtain.
Geoscience Applications of CNNs
Seismic facies classification: CNNs classify 2-D patches of seismic amplitude data into facies categories (channel fill, salt body, carbonate platform, etc.). The network learns to recognize subtle reflection patterns.
Satellite image classification: land-use mapping, vegetation monitoring, flood extent detection from multispectral satellite imagery. Multi-channel inputs naturally fit the CNN framework.
Core image analysis: classifying rock types and identifying sedimentary structures from photographs of drill cores.
Thin section mineral identification: recognizing minerals in polarized light microscopy images of rock thin sections — quartz, feldspar, mica, etc.
References
- Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning, ch. 9 (convolutional networks). MIT Press.
- LeCun, Y., Bengio, Y., Hinton, G. (2015). Deep learning. Nature 521, 436–444.
- Krizhevsky, A., Sutskever, I., Hinton, G.E. (2012). ImageNet classification with deep convolutional neural networks. NeurIPS, 1097–1105.
- Mousavi, S.M., Beroza, G.C. (2022). Deep-learning seismology. Science 377, eabm4470.