Recurrent Neural Networks

Chapter 15: Recurrent Networks for Sequential Data

Learning objectives

Explain why sequential/temporal data requires specialized architectures
Describe the vanilla RNN and its hidden state update equation
Identify the vanishing gradient problem and explain why it limits vanilla RNNs
Describe LSTM architecture: forget gate, input gate, output gate, and cell state
Compare LSTM and GRU architectures
Explain bidirectional RNNs and the attention mechanism
Apply RNN concepts to geoscience time series and sequence problems

Sequential Data and Why Standard Networks Fail

Many geoscience datasets are sequential: the data has an inherent ordering and temporal or spatial dependencies. Examples include:

Seismic traces: amplitude as a function of time
Well logs: measurements as a function of depth
Earthquake catalogs: event sequences in time
Climate records: temperature, CO $_2$ concentration over centuries
GPS displacement: tectonic motion time series

Standard feedforward networks (including CNNs) process each input independently — they have no memory of previous inputs. But for sequential data, the current observation depends on what came before. A seismic reflection at time $t$ is related to reflections at $t-1, t-2, \ldots$ . Recurrent Neural Networks (RNNs) address this by maintaining a hidden state that evolves as the network processes each element of the sequence.

Vanilla RNN

The simplest RNN processes a sequence $x_1, x_2, \ldots, x_T$ one step at a time. At each time step $t$ , it:

Takes the current input $x_t$ and the previous hidden state $h_{t-1}$
Computes a new hidden state: $h_t = \sigma(W_h h_{t-1} + W_x x_t + b)$
Optionally produces an output: $y_t = W_y h_t + b_y$

where $W_h, W_x, W_y$ are weight matrices, $b, b_y$ are biases, and $\sigma$ is typically $\tanh$ or ReLU. The hidden state $h_t$ serves as the network's memory, carrying information about all previous inputs.

Crucially, the same weights $W_h, W_x$ are used at every time step — this is weight sharing through time, analogous to weight sharing across space in CNNs.

Unrolling the RNN

To understand RNN computation, we "unroll" it through time. For a sequence of length 3:

Step 1: $h_1 = \sigma(W_h h_0 + W_x x_1 + b)$ where $h_0$ is initialized (usually to zeros)
Step 2: $h_2 = \sigma(W_h h_1 + W_x x_2 + b)$
Step 3: $h_3 = \sigma(W_h h_2 + W_x x_3 + b)$

The final hidden state $h_3$ encodes information about the entire sequence $(x_1, x_2, x_3)$ . For classification, we might pass $h_T$ through a Dense layer. For sequence-to-sequence tasks, we use all $h_t$ .

The Vanishing Gradient Problem

When training RNNs via backpropagation through time (BPTT), gradients must flow backward through every time step. The gradient at step $t$ involves the product:

\frac{\partial L}{\partial W} \propto \prod_{k=t}^{T} \frac{\partial h_k}{\partial h_{k-1}} = \prod_{k=t}^{T} W_h^T \cdot \text{diag}(\sigma'(\cdot))

If the eigenvalues of $W_h$ are less than 1, this product vanishes exponentially as the sequence gets longer. If eigenvalues exceed 1, gradients explode.

Practical consequence: vanilla RNNs cannot learn long-range dependencies. Information from early time steps "fades away" before it can influence the loss, making it impossible to learn patterns that span more than ~10-20 time steps.

Long Short-Term Memory (LSTM)

The LSTM (Hochreiter & Schmidhuber, 1997) solves the vanishing gradient problem by introducing a cell state $C_t$ — a "highway" for information to flow through time with minimal interference — and three gates that control information flow:

1. Forget gate — decides what information to discard from the cell state:

f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)

where $\sigma$ is the sigmoid function (output in $[0, 1]$ ). Values near 0 mean "forget this" and near 1 mean "keep this."

2. Input gate — decides what new information to store:

i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)

\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)

The input gate $i_t$ controls how much of the candidate update $\tilde{C}_t$ to add.

3. Cell state update:

C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t

The cell state is a linear combination of the old state (gated by forget) and the new candidate (gated by input). The $\odot$ denotes element-wise multiplication. Because this update is additive (not multiplicative like vanilla RNN), gradients can flow through $C_t$ without vanishing.

4. Output gate — decides what part of the cell state to output:

o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)

h_t = o_t \odot \tanh(C_t)

The hidden state $h_t$ is a filtered version of the cell state, used for the current output and passed to the next time step.

LSTM Parameter Count

An LSTM with input size $d$ and hidden size $h$ has four weight matrices (one per gate plus the candidate), each of size $(h + d) \times h$ , plus biases. Total parameters:

4 \times [(h + d) \times h + h] = 4h(h + d + 1)

For $d = 10, h = 32$ : $4 \times 32 \times (32 + 10 + 1) = 4 \times 32 \times 43 = 5{,}504$ parameters.

Gated Recurrent Unit (GRU)

The GRU (Cho et al., 2014) is a simplified variant of LSTM with only two gates and no separate cell state:

Update gate (combines forget + input gates):

z_t = \sigma(W_z \cdot [h_{t-1}, x_t] + b_z)

Reset gate (controls how much of previous hidden state to forget):

r_t = \sigma(W_r \cdot [h_{t-1}, x_t] + b_r)

Candidate hidden state:

\tilde{h}_t = \tanh(W \cdot [r_t \odot h_{t-1}, x_t] + b)

Final hidden state:

h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t

When $z_t \approx 0$ , the hidden state is kept unchanged (memory preserved). When $z_t \approx 1$ , it is replaced with the new candidate. GRU has ~25% fewer parameters than LSTM and often performs comparably.

Bidirectional RNNs

A standard RNN processes the sequence left-to-right. A bidirectional RNN processes the sequence in both directions:

Forward RNN: processes $x_1 \to x_2 \to \cdots \to x_T$ , producing $\overrightarrow{h_1}, \ldots, \overrightarrow{h_T}$
Backward RNN: processes $x_T \to x_{T-1} \to \cdots \to x_1$ , producing $\overleftarrow{h_1}, \ldots, \overleftarrow{h_T}$

At each position, the outputs are concatenated: $h_t = [\overrightarrow{h_t}; \overleftarrow{h_t}]$

This gives each position context from both the past and the future. This is especially useful for well log analysis where a formation at depth $d$ is characterized by measurements both above and below it.

Sequence-to-Sequence Models

Some tasks require mapping an input sequence to an output sequence (possibly of different length). The encoder-decoder architecture handles this:

Encoder RNN: processes the entire input sequence and produces a final hidden state (context vector)
Decoder RNN: takes the context vector as its initial hidden state and generates the output sequence one step at a time

In geoscience, this could map a sequence of seismic attributes to a sequence of rock property predictions at different depths.

Attention Mechanism (Introduction)

The basic encoder-decoder has a bottleneck: the entire input sequence must be compressed into a single context vector. For long sequences, this loses information. The attention mechanism solves this by allowing the decoder to "look back" at all encoder hidden states:

At each decoder step $t$ , attention computes a weighted sum of all encoder states:

\alpha_{t,s} = \frac{\exp(\text{score}(h_t^{\text{dec}}, h_s^{\text{enc}}))} {\sum_{s'} \exp(\text{score}(h_t^{\text{dec}}, h_{s'}^{\text{enc}}))}

c_t = \sum_s \alpha_{t,s} \cdot h_s^{\text{enc}}

The weights $\alpha_{t,s}$ indicate how much the decoder at step $t$ should "attend" to encoder step $s$ . This mechanism is the foundation of modern Transformer architectures used in large language models.

Geoscience Applications of RNNs

Seismic time series prediction: LSTMs can predict future seismic activity based on historical patterns. The hidden state captures temporal correlations in earthquake occurrence.

Well log sequence analysis: Bidirectional LSTMs analyze well log curves (gamma ray, resistivity, porosity) as depth sequences to predict facies, identify formation boundaries, or fill gaps in measurements.

Earthquake aftershock prediction: Given the sequence of events after a main shock, RNNs can model the temporal decay of aftershock rate (competing with statistical models like Omori's law).

Paleoclimate reconstruction: RNNs model proxy records (ice cores, tree rings, ocean sediment cores) as time series to reconstruct past climate variables, capturing non-linear relationships between proxies and climate.

Production forecasting: LSTMs predict future oil/gas production from wells based on historical production time series and operational parameters.

References

Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning, ch. 10 (sequence modeling: RNNs). MIT Press.
Hochreiter, S., Schmidhuber, J. (1997). Long short-term memory. Neural Comput. 9(8), 1735–1780.
Cho, K., van Merriënboer, B., Gulcehre, C., et al. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation (GRU). EMNLP.
Mousavi, S.M., Beroza, G.C. (2022). Deep-learning seismology. Science 377, eabm4470.