From Sensors to Petabytes: The Geoscience Data Pipeline
Learning objectives
- Define big data and the 5 V's
- Describe the end-to-end ML pipeline
- Understand train/test/validation splits and their purpose
- Formulate a cost function for a learning task
- Recognize overfitting and underfitting
What Is Big Data?
Big data refers to datasets that are so large, fast-moving, or complex that traditional tools cannot store, manage, or analyze them effectively. In geosciences, big data is everywhere:
- 3D/4D Seismic Surveys: A single survey can produce terabytes of data, billions of amplitude samples arranged on a 3D grid.
- Satellite Imagery: NASA's Landsat program generates petabytes of Earth observation data. Sentinel satellites add terabytes daily.
- Well Logs: A large oil company may have hundreds of thousands of wells, each with dozens of log curves sampled every 15 cm.
- Sensor Networks: Seismograph networks, GPS stations, weather stations, ocean buoys, streaming data 24/7.
- Geochemical Databases: Millions of rock, soil, and water samples with multi-element analyses.
The 5 V's of Big Data
1. Volume: The sheer amount of data. A seismic survey may contain 10+ TB. Global satellite archives exceed exabytes.
2. Velocity: The speed at which data is generated. Real-time seismic monitoring, streaming GPS, hourly satellite passes.
3. Variety: Different data types and formats. Structured tables (well logs), unstructured text (geological reports), images (thin sections, satellite), time series (seismograms).
4. Veracity: Uncertainty and quality of data. Sensor noise, missing measurements, inconsistent labeling, human error in geological descriptions.
5. Value: The actionable insights extracted from data. A terabyte of seismic data is worthless unless it leads to better subsurface models, safer construction, or successful exploration.
The Machine Learning Pipeline
Building an ML model is not just "throw data at an algorithm." It follows a structured pipeline:
Step 1: Data Collection
Gather relevant data. In geoscience: download well logs from databases, acquire seismic surveys, collect field samples. This is often the most time-consuming step.
Step 2: Data Preprocessing
Clean and prepare the data. This includes:
- Handling missing values (imputation, removal)
- Outlier detection (erroneous sensor readings)
- Normalization/standardization: Scale features to similar ranges so no single feature dominates. Common approaches:
Min-max scaling:
Z-score standardization: - Format conversion: merging databases, aligning coordinate systems
Step 3: Feature Engineering
Create informative input features from raw data. Examples: compute the derivative of a well-log curve, extract texture features from a thin-section image, compute spectral decomposition of seismic data. Good features often matter more than the choice of algorithm.
Step 4: Model Selection
Choose an appropriate ML algorithm. This depends on the problem type (classification vs. regression), data size, feature types, and domain knowledge. We will study many algorithms in this course.
Step 5: Training
Feed the training data to the algorithm and let it learn the model parameters by minimizing a cost function (also called loss function or objective function).
Step 6: Evaluation
Assess model performance on held-out data (data the model has never seen). Common metrics: accuracy, precision, recall, F1-score (classification); MSE, RMSE, R-squared (regression).
Step 7: Deployment
Put the trained model into production. In geoscience: apply the lithology classifier to a new well, use the seismic facies model on an unexplored survey area, integrate into a real-time monitoring system.
Train / Test / Validation Split
Never evaluate your model on the same data you trained it on! This would be like a student grading their own homework. We split data into:
- Training set (typically 60–80%): Used to fit the model parameters.
- Validation set (typically 10–20%): Used to tune hyperparameters (learning rate, model complexity) and prevent overfitting during development.
- Test set (typically 10–20%): Used only once at the very end to estimate real-world performance. The model never sees this data during training or tuning.
For small datasets, use k-fold cross-validation: divide data into folds, train on folds, validate on the remaining fold, and repeat times.
The Cost Function
A cost function (or loss function) measures how wrong the model's predictions are. Training = finding parameters that minimize the cost function.
The general form of an empirical cost function is:
where:
- = model parameters (weights, biases)
- = number of training examples
- = model prediction for input
- = true label/value for example
- = loss function for a single example
Common Loss Functions
Mean Squared Error (MSE), for regression:
Cross-Entropy Loss, for binary classification:
Overfitting and Underfitting
Underfitting (High Bias)
The model is too simple to capture the underlying pattern. Training error is high. Example: fitting a straight line to a clearly curved relationship between depth and temperature.
Overfitting (High Variance)
The model is too complex and memorizes the training data, including noise. Training error is very low, but test error is high. Example: fitting a 20th-degree polynomial to 25 data points, it passes through every point but oscillates wildly between them.
Good Fit
The model captures the true pattern without memorizing noise. Both training and test errors are acceptably low. This is the sweet spot we aim for.
How to combat overfitting:
- Get more training data
- Reduce model complexity (fewer parameters)
- Regularization (add a penalty for large weights)
- Early stopping (stop training before the model memorizes noise)
- Dropout (randomly disable neurons during training, for neural networks)
- Cross-validation to detect overfitting early
References
- Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.), ch. 7 (model assessment and selection). Springer.
- James, G., Witten, D., Hastie, T., Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.), ch. 2 & 5 (statistical learning, resampling). Springer.
- Bergen, K.J., Johnson, P.A., de Hoop, M.V., Beroza, G.C. (2019). Machine learning for data-driven discovery in solid Earth geoscience. Science 363, eaau0323.
- Reichstein, M., Camps-Valls, G., Stevens, B., et al. (2019). Deep learning and process understanding for data-driven Earth-system science. Nature 566, 195–204.