Workshop: Working with NumPy, Pandas, and Matplotlib

Part 12, Chapter 12: Big Data Pipelines in Earth Science

Learning objectives

Create and manipulate NumPy arrays
Perform basic operations with Pandas DataFrames
Create plots with Matplotlib
Apply these tools to geoscience data

Lab preview. This lab works with NumPy arrays, Pandas DataFrames, and Matplotlib. Below is the data pipeline those libraries serve, click through the stages to see how raw signals become arrays, tables, and feature vectors, the shapes you'll manipulate in this lab.

NumPy: Numerical Computing

NumPy (Numerical Python) is the foundation of scientific computing in Python. Its core object is the ndarray, a fast, memory-efficient multidimensional array.

Import NumPy with the standard alias:

%%%python import numpy as np %%%

Creating Arrays

%%%python # From a list depths = np.array([100, 250, 500, 750, 1000]) print(depths) # [100 250 500 750 1000] print(type(depths)) # # Useful constructors zeros = np.zeros(5) # [0. 0. 0. 0. 0.] ones = np.ones(3) # [1. 1. 1.] range_arr = np.arange(0, 10, 2) # [0 2 4 6 8] linspace = np.linspace(0, 1, 5) # [0. 0.25 0.5 0.75 1.] # 2D array (matrix) matrix = np.array([[1, 2, 3], [4, 5, 6]]) print(matrix.shape) # (2, 3) -> 2 rows, 3 columns %%%

Array Properties

%%%python data = np.array([2.65, 2.71, 2.54, 2.80, 2.62]) print(data.shape) # (5,) -> 1D array with 5 elements print(data.dtype) # float64 print(data.size) # 5 print(data.ndim) # 1 (number of dimensions) %%%

Array Operations (Vectorized)

NumPy performs operations on every element at once (vectorized), which is much faster than Python loops:

%%%python # Element-wise arithmetic velocities = np.array([2000, 3500, 4500, 5500]) thicknesses = np.array([200, 350, 500, 150]) times = thicknesses / velocities # element-wise division print(times) # [0.1 0.1 0.1111 0.0273] # Mathematical functions porosity = np.array([0.15, 0.22, 0.31, 0.18]) log_poro = np.log10(porosity) # log of each element print(log_poro) # [-0.824 -0.658 -0.509 -0.745] # Statistics print(np.mean(porosity)) # 0.215 print(np.std(porosity)) # 0.0602 print(np.min(porosity)) # 0.15 print(np.max(porosity)) # 0.31 %%%

Indexing and Slicing

%%%python arr = np.array([10, 20, 30, 40, 50]) print(arr[0]) # 10 print(arr[1:4]) # [20 30 40] print(arr[arr > 25]) # [30 40 50] (boolean indexing!) # 2D indexing matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) print(matrix[0, 2]) # 3 (row 0, column 2) print(matrix[:, 1]) # [2 5 8] (all rows, column 1) print(matrix[1:, :2]) # [[4,5],[7,8]] (rows 1+, cols 0-1) %%%

Reshaping

%%%python arr = np.arange(12) # [0, 1, 2, ..., 11] reshaped = arr.reshape(3, 4) # 3 rows, 4 columns print(reshaped) # [[ 0 1 2 3] # [ 4 5 6 7] # [ 8 9 10 11]] print(reshaped.T) # transpose: 4 rows, 3 columns %%%

Pandas: Data Analysis

Pandas provides the DataFrame, a table (like a spreadsheet) with named columns. It is the go-to tool for loading, cleaning, and exploring structured data.

%%%python import pandas as pd %%%

Creating a DataFrame

%%%python # From a dictionary data = { "well_id": ["WL-001", "WL-002", "WL-003", "WL-004", "WL-005"], "depth_m": [1200, 1850, 950, 2100, 1600], "porosity": [0.22, 0.18, 0.31, 0.15, 0.24], "lithology": ["sandstone", "shale", "sandstone", "limestone", "sandstone"] } df = pd.DataFrame(data) print(df) %%%

well_id depth_m porosity lithology 0 WL-001 1200 0.22 sandstone 1 WL-002 1850 0.18 shale 2 WL-003 950 0.31 sandstone 3 WL-004 2100 0.15 limestone 4 WL-005 1600 0.24 sandstone

Exploring Data

%%%python # First few rows print(df.head(3)) # first 3 rows # Summary statistics print(df.describe()) # depth_m porosity # count 5.00 5.00 # mean 1540.00 0.22 # std 468.24 0.06 # min 950.00 0.15 # max 2100.00 0.31 # Column names and types print(df.columns) # Index(['well_id', 'depth_m', ...]) print(df.dtypes) # Shape print(df.shape) # (5, 4) -> 5 rows, 4 columns %%%

Accessing Columns and Filtering

%%%python # Access a single column (returns a Series) print(df["porosity"]) # Access multiple columns print(df[["well_id", "porosity"]]) # Filter rows high_poro = df[df["porosity"] > 0.20] print(high_poro) # well_id depth_m porosity lithology # 0 WL-001 1200 0.22 sandstone # 2 WL-003 950 0.31 sandstone # 4 WL-005 1600 0.24 sandstone # Multiple conditions (use & for AND, | for OR) deep_porous = df[(df["depth_m"] > 1000) & (df["porosity"] > 0.20)] print(deep_porous) %%%

Adding and Modifying Columns

%%%python # Add a new column df["permeability_mD"] = [150, 5, 420, 2, 200] # Computed column df["depth_ft"] = df["depth_m"] * 3.281 print(df.head()) %%%

Matplotlib: Visualization

Matplotlib is Python's most widely used plotting library. We typically import its pyplot module:

%%%python import matplotlib.pyplot as plt %%%

Line Plot

%%%python # Geothermal gradient depths = np.linspace(0, 5000, 100) # 0 to 5000 m temp = 15 + 0.03 * depths # 15 C surface, 30 C/km plt.figure(figsize=(6, 4)) plt.plot(temp, depths) # note: temp on x, depth on y plt.gca().invert_yaxis() # depth increases downward plt.xlabel("Temperature (C)") plt.ylabel("Depth (m)") plt.title("Geothermal Gradient") plt.grid(True) plt.tight_layout() plt.show() %%%

Scatter Plot

%%%python # Porosity vs. Depth porosity = np.array([0.35, 0.30, 0.25, 0.22, 0.18, 0.15, 0.12]) depth = np.array([200, 500, 800, 1100, 1500, 2000, 2500]) plt.figure(figsize=(6, 4)) plt.scatter(porosity, depth, c="steelblue", s=60) plt.gca().invert_yaxis() plt.xlabel("Porosity") plt.ylabel("Depth (m)") plt.title("Porosity vs. Depth") plt.grid(True, alpha=0.3) plt.tight_layout() plt.show() %%%

Histogram

%%%python # Distribution of seismic amplitudes np.random.seed(42) amplitudes = np.random.normal(0, 1, 1000) # 1000 samples plt.figure(figsize=(6, 4)) plt.hist(amplitudes, bins=30, color="coral", edgecolor="black") plt.xlabel("Amplitude") plt.ylabel("Frequency") plt.title("Seismic Amplitude Distribution") plt.tight_layout() plt.show() %%%

Multiple Subplots

%%%python fig, axes = plt.subplots(1, 2, figsize=(10, 4)) # Left: line plot axes[0].plot(temp, depths) axes[0].invert_yaxis() axes[0].set_xlabel("Temperature (C)") axes[0].set_ylabel("Depth (m)") axes[0].set_title("Geothermal Gradient") # Right: histogram axes[1].hist(amplitudes, bins=30, color="coral") axes[1].set_xlabel("Amplitude") axes[1].set_title("Amplitude Distribution") plt.tight_layout() plt.show() %%%

Matplotlib Essentials

plt.figure(figsize=(w, h)), create a new figure
plt.plot(x, y), plt.scatter(x, y), plt.hist(data), plot types
plt.xlabel(), plt.ylabel(), plt.title(), labels
plt.legend(), show legend
plt.grid(True), add grid lines
plt.savefig("plot.png"), save to file
plt.show(), display the plot

References

Harris, C.R., Millman, K.J., van der Walt, S.J., et al. (2020). Array programming with NumPy. Nature 585, 357-362.
McKinney, W. (2017). Python for Data Analysis (2nd ed.), ch. 4-6 (NumPy, pandas). O’Reilly.
VanderPlas, J. (2016). Python Data Science Handbook, ch. 2-4 (NumPy, pandas, Matplotlib). O’Reilly.