Recurrent Neural Networks
Learning objectives
- Explain why sequential/temporal data requires specialized architectures
- Describe the vanilla RNN and its hidden state update equation
- Identify the vanishing gradient problem and explain why it limits vanilla RNNs
- Describe LSTM architecture: forget gate, input gate, output gate, and cell state
- Compare LSTM and GRU architectures
- Explain bidirectional RNNs and the attention mechanism
- Apply RNN concepts to geoscience time series and sequence problems
Sequential Data and Why Standard Networks Fail
Many geoscience datasets are sequential: the data has an inherent ordering and temporal or spatial dependencies. Examples include:
- Seismic traces: amplitude as a function of time
- Well logs: measurements as a function of depth
- Earthquake catalogs: event sequences in time
- Climate records: temperature, CO concentration over centuries
- GPS displacement: tectonic motion time series
Standard feedforward networks (including CNNs) process each input independently — they have no memory of previous inputs. But for sequential data, the current observation depends on what came before. A seismic reflection at time is related to reflections at . Recurrent Neural Networks (RNNs) address this by maintaining a hidden state that evolves as the network processes each element of the sequence.
Vanilla RNN
The simplest RNN processes a sequence one step at a time. At each time step , it:
- Takes the current input and the previous hidden state
- Computes a new hidden state:
- Optionally produces an output:
where are weight matrices, are biases, and is typically or ReLU. The hidden state serves as the network's memory, carrying information about all previous inputs.
Crucially, the same weights are used at every time step — this is weight sharing through time, analogous to weight sharing across space in CNNs.
Unrolling the RNN
To understand RNN computation, we "unroll" it through time. For a sequence of length 3:
- Step 1: where is initialized (usually to zeros)
- Step 2:
- Step 3:
The final hidden state encodes information about the entire sequence . For classification, we might pass through a Dense layer. For sequence-to-sequence tasks, we use all .
The Vanishing Gradient Problem
When training RNNs via backpropagation through time (BPTT), gradients must flow backward through every time step. The gradient at step involves the product:
If the eigenvalues of are less than 1, this product vanishes exponentially as the sequence gets longer. If eigenvalues exceed 1, gradients explode.
Practical consequence: vanilla RNNs cannot learn long-range dependencies. Information from early time steps "fades away" before it can influence the loss, making it impossible to learn patterns that span more than ~10-20 time steps.
Long Short-Term Memory (LSTM)
The LSTM (Hochreiter & Schmidhuber, 1997) solves the vanishing gradient problem by introducing a cell state — a "highway" for information to flow through time with minimal interference — and three gates that control information flow:
1. Forget gate — decides what information to discard from the cell state:
where is the sigmoid function (output in ). Values near 0 mean "forget this" and near 1 mean "keep this."
2. Input gate — decides what new information to store:
The input gate controls how much of the candidate update to add.
3. Cell state update:
The cell state is a linear combination of the old state (gated by forget) and the new candidate (gated by input). The denotes element-wise multiplication. Because this update is additive (not multiplicative like vanilla RNN), gradients can flow through without vanishing.
4. Output gate — decides what part of the cell state to output:
The hidden state is a filtered version of the cell state, used for the current output and passed to the next time step.
LSTM Parameter Count
An LSTM with input size and hidden size has four weight matrices (one per gate plus the candidate), each of size , plus biases. Total parameters:
For : parameters.
Gated Recurrent Unit (GRU)
The GRU (Cho et al., 2014) is a simplified variant of LSTM with only two gates and no separate cell state:
Update gate (combines forget + input gates):
Reset gate (controls how much of previous hidden state to forget):
Candidate hidden state:
Final hidden state:
When , the hidden state is kept unchanged (memory preserved). When , it is replaced with the new candidate. GRU has ~25% fewer parameters than LSTM and often performs comparably.
Bidirectional RNNs
A standard RNN processes the sequence left-to-right. A bidirectional RNN processes the sequence in both directions:
- Forward RNN: processes , producing
- Backward RNN: processes , producing
At each position, the outputs are concatenated:
This gives each position context from both the past and the future. This is especially useful for well log analysis where a formation at depth is characterized by measurements both above and below it.
Sequence-to-Sequence Models
Some tasks require mapping an input sequence to an output sequence (possibly of different length). The encoder-decoder architecture handles this:
- Encoder RNN: processes the entire input sequence and produces a final hidden state (context vector)
- Decoder RNN: takes the context vector as its initial hidden state and generates the output sequence one step at a time
In geoscience, this could map a sequence of seismic attributes to a sequence of rock property predictions at different depths.
Attention Mechanism (Introduction)
The basic encoder-decoder has a bottleneck: the entire input sequence must be compressed into a single context vector. For long sequences, this loses information. The attention mechanism solves this by allowing the decoder to "look back" at all encoder hidden states:
At each decoder step , attention computes a weighted sum of all encoder states:
The weights indicate how much the decoder at step should "attend" to encoder step . This mechanism is the foundation of modern Transformer architectures used in large language models.
Geoscience Applications of RNNs
Seismic time series prediction: LSTMs can predict future seismic activity based on historical patterns. The hidden state captures temporal correlations in earthquake occurrence.
Well log sequence analysis: Bidirectional LSTMs analyze well log curves (gamma ray, resistivity, porosity) as depth sequences to predict facies, identify formation boundaries, or fill gaps in measurements.
Earthquake aftershock prediction: Given the sequence of events after a main shock, RNNs can model the temporal decay of aftershock rate (competing with statistical models like Omori's law).
Paleoclimate reconstruction: RNNs model proxy records (ice cores, tree rings, ocean sediment cores) as time series to reconstruct past climate variables, capturing non-linear relationships between proxies and climate.
Production forecasting: LSTMs predict future oil/gas production from wells based on historical production time series and operational parameters.
References
- Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning, ch. 10 (sequence modeling: RNNs). MIT Press.
- Hochreiter, S., Schmidhuber, J. (1997). Long short-term memory. Neural Comput. 9(8), 1735–1780.
- Cho, K., van Merriënboer, B., Gulcehre, C., et al. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation (GRU). EMNLP.
- Mousavi, S.M., Beroza, G.C. (2022). Deep-learning seismology. Science 377, eabm4470.