From Sensors to Petabytes: The Geoscience Data Pipeline

Part 12, Chapter 12: Big Data Pipelines in Earth Science

Learning objectives

Define big data and the 5 V's
Describe the end-to-end ML pipeline
Understand train/test/validation splits and their purpose
Formulate a cost function for a learning task
Recognize overfitting and underfitting

What Is Big Data?

Big data refers to datasets that are so large, fast-moving, or complex that traditional tools cannot store, manage, or analyze them effectively. In geosciences, big data is everywhere:

3D/4D Seismic Surveys: A single survey can produce terabytes of data, billions of amplitude samples arranged on a 3D grid.
Satellite Imagery: NASA's Landsat program generates petabytes of Earth observation data. Sentinel satellites add terabytes daily.
Well Logs: A large oil company may have hundreds of thousands of wells, each with dozens of log curves sampled every 15 cm.
Sensor Networks: Seismograph networks, GPS stations, weather stations, ocean buoys, streaming data 24/7.
Geochemical Databases: Millions of rock, soil, and water samples with multi-element analyses.

The 5 V's of Big Data

1. Volume: The sheer amount of data. A seismic survey may contain 10+ TB. Global satellite archives exceed exabytes.

2. Velocity: The speed at which data is generated. Real-time seismic monitoring, streaming GPS, hourly satellite passes.

3. Variety: Different data types and formats. Structured tables (well logs), unstructured text (geological reports), images (thin sections, satellite), time series (seismograms).

4. Veracity: Uncertainty and quality of data. Sensor noise, missing measurements, inconsistent labeling, human error in geological descriptions.

5. Value: The actionable insights extracted from data. A terabyte of seismic data is worthless unless it leads to better subsurface models, safer construction, or successful exploration.

The Machine Learning Pipeline

Building an ML model is not just "throw data at an algorithm." It follows a structured pipeline:

The diagram below is that pipeline end to end, click any stage to see what the data looks like there and how much of it there is. Watch the funnel: raw signal (~10 TB/day) is ingested, cleaned, and distilled into compact feature vectors (~GB) before the model ever sees it, and the prediction is only a few KB. The model is the small end; most of the work, and most of the bugs, live upstream in ingest, QC, and feature engineering.

Step 1: Data Collection

Gather relevant data. In geoscience: download well logs from databases, acquire seismic surveys, collect field samples. This is often the most time-consuming step.

Step 2: Data Preprocessing

Clean and prepare the data. This includes:

Handling missing values (imputation, removal)
Outlier detection (erroneous sensor readings)
Normalization/standardization: Scale features to similar ranges so no single feature dominates. Common approaches:
Min-max scaling: $x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}}$
Z-score standardization: $x' = \frac{x - \mu}{\sigma}$
Format conversion: merging databases, aligning coordinate systems

Step 3: Feature Engineering

Create informative input features from raw data. Examples: compute the derivative of a well-log curve, extract texture features from a thin-section image, compute spectral decomposition of seismic data. Good features often matter more than the choice of algorithm.

Step 4: Model Selection

Choose an appropriate ML algorithm. This depends on the problem type (classification vs. regression), data size, feature types, and domain knowledge. We will study many algorithms in this course.

Step 5: Training

Feed the training data to the algorithm and let it learn the model parameters by minimizing a cost function (also called loss function or objective function).

Step 6: Evaluation

Assess model performance on held-out data (data the model has never seen). Common metrics: accuracy, precision, recall, F1-score (classification); MSE, RMSE, R-squared (regression).

Step 7: Deployment

Put the trained model into production. In geoscience: apply the lithology classifier to a new well, use the seismic facies model on an unexplored survey area, integrate into a real-time monitoring system.

Train / Test / Validation Split

Never evaluate your model on the same data you trained it on! This would be like a student grading their own homework. We split data into:

Training set (typically 60-80%): Used to fit the model parameters.
Validation set (typically 10-20%): Used to tune hyperparameters (learning rate, model complexity) and prevent overfitting during development.
Test set (typically 10-20%): Used only once at the very end to estimate real-world performance. The model never sees this data during training or tuning.

For small datasets, use k-fold cross-validation: divide data into $k$ folds, train on $k - 1$ folds, validate on the remaining fold, and repeat $k$ times.

The Cost Function

A cost function (or loss function) measures how wrong the model's predictions are. Training = finding parameters that minimize the cost function.

The general form of an empirical cost function is:

$J(\theta) = \frac{1}{m} \sum_{i=1}^{m} L\bigl(h_\theta(x^{(i)}),\, y^{(i)}\bigr)$ theta(x(i)),,y(i)bigr)

where:

$\theta$ = model parameters (weights, biases)
$m$ = number of training examples
$h_\theta(x^{(i)})$ theta(x(i)) = model prediction for input $x^{(i)}$
$y^{(i)}$ = true label/value for example $i$
$L$ = loss function for a single example

Common Loss Functions

Mean Squared Error (MSE), for regression:

$J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \bigl(h_\theta(x^{(i)}) - y^{(i)}\bigr)^2$

Cross-Entropy Loss, for binary classification:

$J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \bigl[y^{(i)} \log h_\theta(x^{(i)}) + (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)}))\bigr]$

Overfitting and Underfitting

Underfitting (High Bias)

The model is too simple to capture the underlying pattern. Training error is high. Example: fitting a straight line to a clearly curved relationship between depth and temperature.

Overfitting (High Variance)

The model is too complex and memorizes the training data, including noise. Training error is very low, but test error is high. Example: fitting a 20th-degree polynomial to 25 data points, it passes through every point but oscillates wildly between them.

Good Fit

The model captures the true pattern without memorizing noise. Both training and test errors are acceptably low. This is the sweet spot we aim for.

How to combat overfitting:

Get more training data
Reduce model complexity (fewer parameters)
Regularization (add a penalty for large weights)
Early stopping (stop training before the model memorizes noise)
Dropout (randomly disable neurons during training, for neural networks)
Cross-validation to detect overfitting early

References

Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.), ch. 7 (model assessment and selection). Springer.
James, G., Witten, D., Hastie, T., Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.), ch. 2 & 5 (statistical learning, resampling). Springer.
Bergen, K.J., Johnson, P.A., de Hoop, M.V., Beroza, G.C. (2019). Machine learning for data-driven discovery in solid Earth geoscience. Science 363, eaau0323.
Reichstein, M., Camps-Valls, G., Stevens, B., et al. (2019). Deep learning and process understanding for data-driven Earth-system science. Nature 566, 195-204.