Recurrent Neural Networks

Chapter 15: Recurrent Networks for Sequential Data

Learning objectives

  • Explain why sequential/temporal data requires specialized architectures
  • Describe the vanilla RNN and its hidden state update equation
  • Identify the vanishing gradient problem and explain why it limits vanilla RNNs
  • Describe LSTM architecture: forget gate, input gate, output gate, and cell state
  • Compare LSTM and GRU architectures
  • Explain bidirectional RNNs and the attention mechanism
  • Apply RNN concepts to geoscience time series and sequence problems

Sequential Data and Why Standard Networks Fail

Many geoscience datasets are sequential: the data has an inherent ordering and temporal or spatial dependencies. Examples include:

  • Seismic traces: amplitude as a function of time
  • Well logs: measurements as a function of depth
  • Earthquake catalogs: event sequences in time
  • Climate records: temperature, CO2_2 concentration over centuries
  • GPS displacement: tectonic motion time series

Standard feedforward networks (including CNNs) process each input independently — they have no memory of previous inputs. But for sequential data, the current observation depends on what came before. A seismic reflection at time tt is related to reflections at t1,t2,t-1, t-2, \ldots. Recurrent Neural Networks (RNNs) address this by maintaining a hidden state that evolves as the network processes each element of the sequence.

Vanilla RNN

The simplest RNN processes a sequence x1,x2,,xTx_1, x_2, \ldots, x_T one step at a time. At each time step tt, it:

  • Takes the current input xtx_t and the previous hidden state ht1h_{t-1}
  • Computes a new hidden state: ht=σ(Whht1+Wxxt+b)h_t = \sigma(W_h h_{t-1} + W_x x_t + b)
  • Optionally produces an output: yt=Wyht+byy_t = W_y h_t + b_y

where Wh,Wx,WyW_h, W_x, W_y are weight matrices, b,byb, b_y are biases, and σ\sigma is typically tanh\tanh or ReLU. The hidden state hth_t serves as the network's memory, carrying information about all previous inputs.

Crucially, the same weights Wh,WxW_h, W_x are used at every time step — this is weight sharing through time, analogous to weight sharing across space in CNNs.

Unrolling the RNN

To understand RNN computation, we "unroll" it through time. For a sequence of length 3:

  • Step 1: h1=σ(Whh0+Wxx1+b)h_1 = \sigma(W_h h_0 + W_x x_1 + b) where h0h_0 is initialized (usually to zeros)
  • Step 2: h2=σ(Whh1+Wxx2+b)h_2 = \sigma(W_h h_1 + W_x x_2 + b)
  • Step 3: h3=σ(Whh2+Wxx3+b)h_3 = \sigma(W_h h_2 + W_x x_3 + b)

The final hidden state h3h_3 encodes information about the entire sequence (x1,x2,x3)(x_1, x_2, x_3). For classification, we might pass hTh_T through a Dense layer. For sequence-to-sequence tasks, we use all hth_t.

The Vanishing Gradient Problem

When training RNNs via backpropagation through time (BPTT), gradients must flow backward through every time step. The gradient at step tt involves the product:

LWk=tThkhk1=k=tTWhTdiag(σ())\frac{\partial L}{\partial W} \propto \prod_{k=t}^{T} \frac{\partial h_k}{\partial h_{k-1}} = \prod_{k=t}^{T} W_h^T \cdot \text{diag}(\sigma'(\cdot))

If the eigenvalues of WhW_h are less than 1, this product vanishes exponentially as the sequence gets longer. If eigenvalues exceed 1, gradients explode.

Practical consequence: vanilla RNNs cannot learn long-range dependencies. Information from early time steps "fades away" before it can influence the loss, making it impossible to learn patterns that span more than ~10-20 time steps.

Long Short-Term Memory (LSTM)

The LSTM (Hochreiter & Schmidhuber, 1997) solves the vanishing gradient problem by introducing a cell state CtC_t — a "highway" for information to flow through time with minimal interference — and three gates that control information flow:

1. Forget gate — decides what information to discard from the cell state:

ft=σ(Wf[ht1,xt]+bf)f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)

where σ\sigma is the sigmoid function (output in [0,1][0, 1]). Values near 0 mean "forget this" and near 1 mean "keep this."

2. Input gate — decides what new information to store:

it=σ(Wi[ht1,xt]+bi)i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)
C~t=tanh(WC[ht1,xt]+bC)\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)

The input gate iti_t controls how much of the candidate update C~t\tilde{C}_t to add.

3. Cell state update:

Ct=ftCt1+itC~tC_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t

The cell state is a linear combination of the old state (gated by forget) and the new candidate (gated by input). The \odot denotes element-wise multiplication. Because this update is additive (not multiplicative like vanilla RNN), gradients can flow through CtC_t without vanishing.

4. Output gate — decides what part of the cell state to output:

ot=σ(Wo[ht1,xt]+bo)o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)
ht=ottanh(Ct)h_t = o_t \odot \tanh(C_t)

The hidden state hth_t is a filtered version of the cell state, used for the current output and passed to the next time step.

LSTM Parameter Count

An LSTM with input size dd and hidden size hh has four weight matrices (one per gate plus the candidate), each of size (h+d)×h(h + d) \times h, plus biases. Total parameters:

4×[(h+d)×h+h]=4h(h+d+1)4 \times [(h + d) \times h + h] = 4h(h + d + 1)

For d=10,h=32d = 10, h = 32: 4×32×(32+10+1)=4×32×43=5,5044 \times 32 \times (32 + 10 + 1) = 4 \times 32 \times 43 = 5{,}504 parameters.

Gated Recurrent Unit (GRU)

The GRU (Cho et al., 2014) is a simplified variant of LSTM with only two gates and no separate cell state:

Update gate (combines forget + input gates):

zt=σ(Wz[ht1,xt]+bz)z_t = \sigma(W_z \cdot [h_{t-1}, x_t] + b_z)

Reset gate (controls how much of previous hidden state to forget):

rt=σ(Wr[ht1,xt]+br)r_t = \sigma(W_r \cdot [h_{t-1}, x_t] + b_r)

Candidate hidden state:

h~t=tanh(W[rtht1,xt]+b)\tilde{h}_t = \tanh(W \cdot [r_t \odot h_{t-1}, x_t] + b)

Final hidden state:

ht=(1zt)ht1+zth~th_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t

When zt0z_t \approx 0, the hidden state is kept unchanged (memory preserved). When zt1z_t \approx 1, it is replaced with the new candidate. GRU has ~25% fewer parameters than LSTM and often performs comparably.

Bidirectional RNNs

A standard RNN processes the sequence left-to-right. A bidirectional RNN processes the sequence in both directions:

  • Forward RNN: processes x1x2xTx_1 \to x_2 \to \cdots \to x_T, producing h1,,hT\overrightarrow{h_1}, \ldots, \overrightarrow{h_T}
  • Backward RNN: processes xTxT1x1x_T \to x_{T-1} \to \cdots \to x_1, producing h1,,hT\overleftarrow{h_1}, \ldots, \overleftarrow{h_T}

At each position, the outputs are concatenated: ht=[ht;ht]h_t = [\overrightarrow{h_t}; \overleftarrow{h_t}]

This gives each position context from both the past and the future. This is especially useful for well log analysis where a formation at depth dd is characterized by measurements both above and below it.

Sequence-to-Sequence Models

Some tasks require mapping an input sequence to an output sequence (possibly of different length). The encoder-decoder architecture handles this:

  • Encoder RNN: processes the entire input sequence and produces a final hidden state (context vector)
  • Decoder RNN: takes the context vector as its initial hidden state and generates the output sequence one step at a time

In geoscience, this could map a sequence of seismic attributes to a sequence of rock property predictions at different depths.

Attention Mechanism (Introduction)

The basic encoder-decoder has a bottleneck: the entire input sequence must be compressed into a single context vector. For long sequences, this loses information. The attention mechanism solves this by allowing the decoder to "look back" at all encoder hidden states:

At each decoder step tt, attention computes a weighted sum of all encoder states:

αt,s=exp(score(htdec,hsenc))sexp(score(htdec,hsenc))\alpha_{t,s} = \frac{\exp(\text{score}(h_t^{\text{dec}}, h_s^{\text{enc}}))} {\sum_{s'} \exp(\text{score}(h_t^{\text{dec}}, h_{s'}^{\text{enc}}))}
ct=sαt,shsencc_t = \sum_s \alpha_{t,s} \cdot h_s^{\text{enc}}

The weights αt,s\alpha_{t,s} indicate how much the decoder at step tt should "attend" to encoder step ss. This mechanism is the foundation of modern Transformer architectures used in large language models.

Geoscience Applications of RNNs

Seismic time series prediction: LSTMs can predict future seismic activity based on historical patterns. The hidden state captures temporal correlations in earthquake occurrence.

Well log sequence analysis: Bidirectional LSTMs analyze well log curves (gamma ray, resistivity, porosity) as depth sequences to predict facies, identify formation boundaries, or fill gaps in measurements.

Earthquake aftershock prediction: Given the sequence of events after a main shock, RNNs can model the temporal decay of aftershock rate (competing with statistical models like Omori's law).

Paleoclimate reconstruction: RNNs model proxy records (ice cores, tree rings, ocean sediment cores) as time series to reconstruct past climate variables, capturing non-linear relationships between proxies and climate.

Production forecasting: LSTMs predict future oil/gas production from wells based on historical production time series and operational parameters.

References

  • Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning, ch. 10 (sequence modeling: RNNs). MIT Press.
  • Hochreiter, S., Schmidhuber, J. (1997). Long short-term memory. Neural Comput. 9(8), 1735–1780.
  • Cho, K., van Merriënboer, B., Gulcehre, C., et al. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation (GRU). EMNLP.
  • Mousavi, S.M., Beroza, G.C. (2022). Deep-learning seismology. Science 377, eabm4470.

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.