Auto-encoders

Chapter 16: Representation Learning with Auto-encoders

Learning objectives

Describe the encoder-bottleneck-decoder architecture of an auto-encoder
Explain the reconstruction loss and how auto-encoders are trained
Distinguish between vanilla, denoising, and variational auto-encoders
Explain the KL divergence term in the VAE loss
Apply auto-encoders to anomaly detection and dimensionality reduction in geoscience

What Is an Auto-encoder?

An auto-encoder is a neural network that learns to compress its input into a smaller representation and then reconstruct the original input from that compressed form. The network is trained so that the output $\hat{x}$ is as close as possible to the input $x$ .

The key insight is this: by forcing the data through a bottleneck (a hidden layer smaller than the input), the network must learn to capture the most important features of the data in a compact representation. This makes auto-encoders powerful tools for dimensionality reduction, feature learning, and anomaly detection.

Architecture: Encoder $\to$ Bottleneck $\to$ Decoder

An auto-encoder has three conceptual parts:

Encoder: maps the input $x \in \mathbb{R}^n$ to a lower-dimensional latent representation $z \in \mathbb{R}^d$ where $d < n$ : $z = f_{\text{enc}}(x)$
Bottleneck (latent space): the compressed representation $z$ . Its dimension $d$ controls how much compression occurs.
Decoder: maps the latent representation back to the original space: $\hat{x} = f_{\text{dec}}(z) \in \mathbb{R}^n$

For example, an auto-encoder might compress 100-dimensional well log data into a 10-dimensional latent representation and then reconstruct it back to 100 dimensions.

Training: Reconstruction Loss

The auto-encoder is trained to minimize the reconstruction error — the difference between the input and the output:

L = \|x - \hat{x}\|^2 = \sum_{i=1}^{n}(x_i - \hat{x}_i)^2

This is the mean squared error (MSE) loss. Alternatively, binary cross-entropy can be used when inputs are normalized to $[0, 1]$ :

L_{BCE} = -\sum_{i=1}^{n}\left[x_i \log(\hat{x}_i) + (1-x_i) \log(1-\hat{x}_i)\right]

The network has no external labels — it is self-supervised, using the input itself as the target.

Vanilla Auto-encoder

The simplest auto-encoder uses fully connected layers:

Encoder: Input( $n$ ) $\to$ Dense( $h_1$ , relu) $\to$ Dense( $d$ , relu)
Decoder: Dense( $h_1$ , relu) $\to$ Dense( $n$ , sigmoid)

If the encoder and decoder are single linear layers and the loss is MSE, the auto-encoder learns a subspace equivalent to PCA. Adding non-linearity (ReLU) allows it to capture non-linear structure that PCA cannot.

Denoising Auto-encoder (DAE)

A denoising auto-encoder deliberately corrupts the input with noise before feeding it to the encoder, but trains the decoder to reconstruct the clean original:

Corrupt: $\tilde{x} = x + \epsilon$ where $\epsilon \sim \mathcal{N}(0, \sigma^2)$
Encode: $z = f_{\text{enc}}(\tilde{x})$
Decode: $\hat{x} = f_{\text{dec}}(z)$
Loss: $L = |x - \hat{x}|^2$ (compare to the clean $x$ )

This forces the network to learn robust features rather than simply copying the input. The DAE learns to "see through" noise, making it excellent for denoising applications.

Geoscience application: denoising well logs that contain measurement noise, or cleaning up seismic traces affected by random noise.

Variational Auto-encoder (VAE)

The Variational Auto-encoder is a generative model that learns a smooth, continuous latent space from which new data can be sampled. Unlike vanilla auto-encoders, the encoder does not output a single point $z$ but rather the parameters of a probability distribution:

The encoder outputs $\mu$ (mean) and $\log \sigma^2$ (log-variance) for each latent dimension. A sample is drawn using the reparameterization trick:

z = \mu + \sigma \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

where $\odot$ is element-wise multiplication. This trick allows gradients to flow through the sampling operation during backpropagation.

VAE Loss Function

The VAE loss has two terms:

L_{VAE} = \underbrace{\|x - \hat{x}\|^2}_{\text{reconstruction}} + \underbrace{D_{KL}(q(z|x) \| p(z))}_{\text{regularization}}

The KL divergence measures how different the learned latent distribution $q(z|x) = \mathcal{N}(\mu, \sigma^2)$ is from the prior $p(z) = \mathcal{N}(0, I)$ :

D_{KL}(q \| p) = \sum_{j=1}^{d} q(z_j) \log \frac{q(z_j)}{p(z_j)}

For Gaussian distributions, this has a closed-form expression:

D_{KL} = -\frac{1}{2} \sum_{j=1}^{d} \left(1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2\right)

The KL term penalizes the encoder for producing latent distributions that deviate from a standard normal. This ensures the latent space is smooth and well-structured, allowing meaningful interpolation and generation.

Applications in Geoscience

Anomaly detection: Train an auto-encoder on "normal" data (e.g., typical seismic patterns). At test time, if the reconstruction error for a new sample is high, it is likely anomalous. This is used to detect unusual seismic events, equipment malfunctions in well logs, or abnormal reservoir conditions.

Dimensionality reduction: The bottleneck representation $z$ serves as a non-linear compression of the data. Unlike PCA (linear), auto-encoders can capture curved manifolds in the data space. Useful for visualizing high-dimensional geochemical data or seismic attributes.

Data generation (VAE): Sample from the latent space to generate synthetic data. In geoscience, VAEs can generate synthetic core images for data augmentation, or create plausible seismic sections for training other models.

Well log denoising (DAE): Train on pairs of noisy and clean well logs. The DAE learns to remove measurement artifacts while preserving true geological signal.

Reparameterization Trick in Detail

Making Sampling Differentiable

The core challenge in training a VAE is that the sampling step $z \sim q(z|x)$ is stochastic and not differentiable. The reparameterization trick solves this elegantly:

z = \mu + \sigma \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

Here $\mu$ and $\sigma$ are outputs of the encoder (deterministic, differentiable), and $\epsilon$ is random noise sampled from a standard normal distribution (independent of model parameters). The key insight: the randomness is "externalized" into $\epsilon$ , so gradients with respect to $\mu$ and $\sigma$ can flow through normally.

Without this trick, we would need to use high-variance estimators like REINFORCE, which make training extremely slow and unstable.

Beta-VAE and the VAE Loss

Balancing Reconstruction and Regularization

The full VAE loss with a tunable coefficient $\beta$ is:

\mathcal{L} = \underbrace{\mathbb{E}_{q(z|x)}[\|x - \hat{x}\|^2]}_{\text{reconstruction}} + \beta \cdot \underbrace{D_{KL}(q(z|x) \| p(z))}_{\text{regularization}}

When $\beta = 1$ , this is the standard VAE (Evidence Lower Bound / ELBO). The Beta-VAE sets $\beta > 1$ to encourage disentangled representations, where each latent dimension controls a single, independent factor of variation.

Trade-offs:

$\beta \to 0$ : pure auto-encoder (good reconstruction, unstructured latent space, cannot generate)
$\beta = 1$ : standard VAE (balanced)
$\beta > 1$ : Beta-VAE (disentangled latent dimensions, but blurrier reconstruction)

In geoscience, disentangled representations from Beta-VAE might separate factors like lithology, fluid content, and noise in well-log embeddings.

Sparse and Contractive Auto-encoders

Sparse Auto-encoders

Instead of compressing via a small bottleneck, a sparse auto-encoder uses a wide hidden layer but adds a sparsity constraint that forces most neurons to be inactive for any given input:

L = \|x - \hat{x}\|^2 + \lambda \sum_j |h_j|

where $h_j$ are the hidden activations. Alternatively, a KL-divergence penalty encourages the average activation of each neuron to be close to a target $\rho$ (e.g., $\rho = 0.05$ ). This learns a distributed, sparse representation akin to what biological neurons do.

Contractive Auto-encoders

A contractive auto-encoder penalizes the sensitivity of the encoder to its inputs by adding a penalty on the Frobenius norm of the Jacobian:

L = \|x - \hat{x}\|^2 + \lambda \left\|\frac{\partial h}{\partial x}\right\|_F^2

This encourages the encoder to produce representations that are locally insensitive to small changes in input, learning features that are robust to noise. Contractive auto-encoders are theoretically connected to denoising auto-encoders: both encourage robustness, but from different mathematical angles.

Beyond Reconstruction: Feature Learning and Generation

Auto-encoders as Feature Extractors

The encoder portion of a trained auto-encoder can serve as a powerful feature extractor. The latent representation $z$ captures the most salient information about the input in a compact form. These learned features can then be fed to downstream classifiers (SVM, Random Forest) that may outperform models trained on raw features.

Pre-training strategy: (1) Train an auto-encoder on a large unlabeled dataset (plentiful in geoscience). (2) Discard the decoder. (3) Use the encoder to extract features for a labeled dataset (scarce). (4) Train a classifier on the extracted features. This is a form of self-supervised pre-training.

Geoscience: Seismic Attribute Extraction and Well Log Imputation

Seismic attribute extraction: Train an auto-encoder on multi-attribute seismic volumes. The bottleneck features capture the essential variability in seismic character, which can then be used for facies classification or reservoir characterization with fewer, more informative inputs.

Well log imputation: When certain logs (e.g., sonic, density) are missing in older wells, a denoising auto-encoder trained on complete well-log suites can reconstruct the missing curves from available ones. The DAE learns the multivariate relationships between log curves and "fills in" missing data while properly handling measurement noise.

Synthetic data generation: VAEs trained on core photographs or thin-section images can generate synthetic training examples, augmenting small labeled datasets for image-based classification tasks.

References

Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning, ch. 14 (autoencoders). MIT Press.
Kingma, D.P., Welling, M. (2014). Auto-encoding variational Bayes. ICLR.
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.-A. (2010). Stacked denoising autoencoders. J. Mach. Learn. Res. 11, 3371–3408.
Mousavi, S.M., Beroza, G.C. (2022). Deep-learning seismology. Science 377, eabm4470.