Auto-encoders
Learning objectives
- Describe the encoder-bottleneck-decoder architecture of an auto-encoder
- Explain the reconstruction loss and how auto-encoders are trained
- Distinguish between vanilla, denoising, and variational auto-encoders
- Explain the KL divergence term in the VAE loss
- Apply auto-encoders to anomaly detection and dimensionality reduction in geoscience
What Is an Auto-encoder?
An auto-encoder is a neural network that learns to compress its input into a smaller representation and then reconstruct the original input from that compressed form. The network is trained so that the output is as close as possible to the input .
The key insight is this: by forcing the data through a bottleneck (a hidden layer smaller than the input), the network must learn to capture the most important features of the data in a compact representation. This makes auto-encoders powerful tools for dimensionality reduction, feature learning, and anomaly detection.
Architecture: Encoder Bottleneck Decoder
An auto-encoder has three conceptual parts:
- Encoder: maps the input to a lower-dimensional latent representation where :
- Bottleneck (latent space): the compressed representation . Its dimension controls how much compression occurs.
- Decoder: maps the latent representation back to the original space:
For example, an auto-encoder might compress 100-dimensional well log data into a 10-dimensional latent representation and then reconstruct it back to 100 dimensions.
Training: Reconstruction Loss
The auto-encoder is trained to minimize the reconstruction error — the difference between the input and the output:
This is the mean squared error (MSE) loss. Alternatively, binary cross-entropy can be used when inputs are normalized to :
The network has no external labels — it is self-supervised, using the input itself as the target.
Vanilla Auto-encoder
The simplest auto-encoder uses fully connected layers:
- Encoder: Input() Dense(, relu) Dense(, relu)
- Decoder: Dense(, relu) Dense(, sigmoid)
If the encoder and decoder are single linear layers and the loss is MSE, the auto-encoder learns a subspace equivalent to PCA. Adding non-linearity (ReLU) allows it to capture non-linear structure that PCA cannot.
Denoising Auto-encoder (DAE)
A denoising auto-encoder deliberately corrupts the input with noise before feeding it to the encoder, but trains the decoder to reconstruct the clean original:
- Corrupt: where
- Encode:
- Decode:
- Loss: (compare to the clean )
This forces the network to learn robust features rather than simply copying the input. The DAE learns to "see through" noise, making it excellent for denoising applications.
Geoscience application: denoising well logs that contain measurement noise, or cleaning up seismic traces affected by random noise.
Variational Auto-encoder (VAE)
The Variational Auto-encoder is a generative model that learns a smooth, continuous latent space from which new data can be sampled. Unlike vanilla auto-encoders, the encoder does not output a single point but rather the parameters of a probability distribution:
The encoder outputs (mean) and (log-variance) for each latent dimension. A sample is drawn using the reparameterization trick:
where is element-wise multiplication. This trick allows gradients to flow through the sampling operation during backpropagation.
VAE Loss Function
The VAE loss has two terms:
The KL divergence measures how different the learned latent distribution is from the prior :
For Gaussian distributions, this has a closed-form expression:
The KL term penalizes the encoder for producing latent distributions that deviate from a standard normal. This ensures the latent space is smooth and well-structured, allowing meaningful interpolation and generation.
Applications in Geoscience
Anomaly detection: Train an auto-encoder on "normal" data (e.g., typical seismic patterns). At test time, if the reconstruction error for a new sample is high, it is likely anomalous. This is used to detect unusual seismic events, equipment malfunctions in well logs, or abnormal reservoir conditions.
Dimensionality reduction: The bottleneck representation serves as a non-linear compression of the data. Unlike PCA (linear), auto-encoders can capture curved manifolds in the data space. Useful for visualizing high-dimensional geochemical data or seismic attributes.
Data generation (VAE): Sample from the latent space to generate synthetic data. In geoscience, VAEs can generate synthetic core images for data augmentation, or create plausible seismic sections for training other models.
Well log denoising (DAE): Train on pairs of noisy and clean well logs. The DAE learns to remove measurement artifacts while preserving true geological signal.
Reparameterization Trick in Detail
Making Sampling Differentiable
The core challenge in training a VAE is that the sampling step is stochastic and not differentiable. The reparameterization trick solves this elegantly:
Here and are outputs of the encoder (deterministic, differentiable), and is random noise sampled from a standard normal distribution (independent of model parameters). The key insight: the randomness is "externalized" into , so gradients with respect to and can flow through normally.
Without this trick, we would need to use high-variance estimators like REINFORCE, which make training extremely slow and unstable.
Beta-VAE and the VAE Loss
Balancing Reconstruction and Regularization
The full VAE loss with a tunable coefficient is:
When , this is the standard VAE (Evidence Lower Bound / ELBO). The Beta-VAE sets to encourage disentangled representations, where each latent dimension controls a single, independent factor of variation.
Trade-offs:
- : pure auto-encoder (good reconstruction, unstructured latent space, cannot generate)
- : standard VAE (balanced)
- : Beta-VAE (disentangled latent dimensions, but blurrier reconstruction)
In geoscience, disentangled representations from Beta-VAE might separate factors like lithology, fluid content, and noise in well-log embeddings.
Sparse and Contractive Auto-encoders
Sparse Auto-encoders
Instead of compressing via a small bottleneck, a sparse auto-encoder uses a wide hidden layer but adds a sparsity constraint that forces most neurons to be inactive for any given input:
where are the hidden activations. Alternatively, a KL-divergence penalty encourages the average activation of each neuron to be close to a target (e.g., ). This learns a distributed, sparse representation akin to what biological neurons do.
Contractive Auto-encoders
A contractive auto-encoder penalizes the sensitivity of the encoder to its inputs by adding a penalty on the Frobenius norm of the Jacobian:
This encourages the encoder to produce representations that are locally insensitive to small changes in input, learning features that are robust to noise. Contractive auto-encoders are theoretically connected to denoising auto-encoders: both encourage robustness, but from different mathematical angles.
Beyond Reconstruction: Feature Learning and Generation
Auto-encoders as Feature Extractors
The encoder portion of a trained auto-encoder can serve as a powerful feature extractor. The latent representation captures the most salient information about the input in a compact form. These learned features can then be fed to downstream classifiers (SVM, Random Forest) that may outperform models trained on raw features.
Pre-training strategy: (1) Train an auto-encoder on a large unlabeled dataset (plentiful in geoscience). (2) Discard the decoder. (3) Use the encoder to extract features for a labeled dataset (scarce). (4) Train a classifier on the extracted features. This is a form of self-supervised pre-training.
Geoscience: Seismic Attribute Extraction and Well Log Imputation
Seismic attribute extraction: Train an auto-encoder on multi-attribute seismic volumes. The bottleneck features capture the essential variability in seismic character, which can then be used for facies classification or reservoir characterization with fewer, more informative inputs.
Well log imputation: When certain logs (e.g., sonic, density) are missing in older wells, a denoising auto-encoder trained on complete well-log suites can reconstruct the missing curves from available ones. The DAE learns the multivariate relationships between log curves and "fills in" missing data while properly handling measurement noise.
Synthetic data generation: VAEs trained on core photographs or thin-section images can generate synthetic training examples, augmenting small labeled datasets for image-based classification tasks.
References
- Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning, ch. 14 (autoencoders). MIT Press.
- Kingma, D.P., Welling, M. (2014). Auto-encoding variational Bayes. ICLR.
- Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.-A. (2010). Stacked denoising autoencoders. J. Mach. Learn. Res. 11, 3371–3408.
- Mousavi, S.M., Beroza, G.C. (2022). Deep-learning seismology. Science 377, eabm4470.