Lab: Working with NumPy, Pandas, and Matplotlib
Learning objectives
- Create and manipulate NumPy arrays
- Perform basic operations with Pandas DataFrames
- Create plots with Matplotlib
- Apply these tools to geoscience data
NumPy: Numerical Computing
NumPy (Numerical Python) is the foundation of scientific computing in Python. Its core object is the ndarray, a fast, memory-efficient multidimensional array.
Import NumPy with the standard alias:
%%%python import numpy as np %%%
Creating Arrays
%%%python # From a list depths = np.array([100, 250, 500, 750, 1000]) print(depths) # [100 250 500 750 1000] print(type(depths)) # # Useful constructors zeros = np.zeros(5) # [0. 0. 0. 0. 0.] ones = np.ones(3) # [1. 1. 1.] range_arr = np.arange(0, 10, 2) # [0 2 4 6 8] linspace = np.linspace(0, 1, 5) # [0. 0.25 0.5 0.75 1.] # 2D array (matrix) matrix = np.array([[1, 2, 3], [4, 5, 6]]) print(matrix.shape) # (2, 3) -> 2 rows, 3 columns %%%
Array Properties
%%%python data = np.array([2.65, 2.71, 2.54, 2.80, 2.62]) print(data.shape) # (5,) -> 1D array with 5 elements print(data.dtype) # float64 print(data.size) # 5 print(data.ndim) # 1 (number of dimensions) %%%
Array Operations (Vectorized)
NumPy performs operations on every element at once (vectorized), which is much faster than Python loops:
%%%python # Element-wise arithmetic velocities = np.array([2000, 3500, 4500, 5500]) thicknesses = np.array([200, 350, 500, 150]) times = thicknesses / velocities # element-wise division print(times) # [0.1 0.1 0.1111 0.0273] # Mathematical functions porosity = np.array([0.15, 0.22, 0.31, 0.18]) log_poro = np.log10(porosity) # log of each element print(log_poro) # [-0.824 -0.658 -0.509 -0.745] # Statistics print(np.mean(porosity)) # 0.215 print(np.std(porosity)) # 0.0596 print(np.min(porosity)) # 0.15 print(np.max(porosity)) # 0.31 %%%
Indexing and Slicing
%%%python arr = np.array([10, 20, 30, 40, 50]) print(arr[0]) # 10 print(arr[1:4]) # [20 30 40] print(arr[arr > 25]) # [30 40 50] (boolean indexing!) # 2D indexing matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) print(matrix[0, 2]) # 3 (row 0, column 2) print(matrix[:, 1]) # [2 5 8] (all rows, column 1) print(matrix[1:, :2]) # [[4,5],[7,8]] (rows 1+, cols 0-1) %%%
Reshaping
%%%python arr = np.arange(12) # [0, 1, 2, ..., 11] reshaped = arr.reshape(3, 4) # 3 rows, 4 columns print(reshaped) # [[ 0 1 2 3] # [ 4 5 6 7] # [ 8 9 10 11]] print(reshaped.T) # transpose: 4 rows, 3 columns %%%
Pandas: Data Analysis
Pandas provides the DataFrame, a table (like a spreadsheet) with named columns. It is the go-to tool for loading, cleaning, and exploring structured data.
%%%python import pandas as pd %%%
Creating a DataFrame
%%%python # From a dictionary data = { "well_id": ["WL-001", "WL-002", "WL-003", "WL-004", "WL-005"], "depth_m": [1200, 1850, 950, 2100, 1600], "porosity": [0.22, 0.18, 0.31, 0.15, 0.24], "lithology": ["sandstone", "shale", "sandstone", "limestone", "sandstone"] } df = pd.DataFrame(data) print(df) %%%
well_id depth_m porosity lithology 0 WL-001 1200 0.22 sandstone 1 WL-002 1850 0.18 shale 2 WL-003 950 0.31 sandstone 3 WL-004 2100 0.15 limestone 4 WL-005 1600 0.24 sandstone
Exploring Data
%%%python # First few rows print(df.head(3)) # first 3 rows # Summary statistics print(df.describe()) # depth_m porosity # count 5.00 5.00 # mean 1540.00 0.22 # std 440.45 0.06 # min 950.00 0.15 # max 2100.00 0.31 # Column names and types print(df.columns) # Index(['well_id', 'depth_m', ...]) print(df.dtypes) # Shape print(df.shape) # (5, 4) -> 5 rows, 4 columns %%%
Accessing Columns and Filtering
%%%python # Access a single column (returns a Series) print(df["porosity"]) # Access multiple columns print(df[["well_id", "porosity"]]) # Filter rows high_poro = df[df["porosity"] > 0.20] print(high_poro) # well_id depth_m porosity lithology # 0 WL-001 1200 0.22 sandstone # 2 WL-003 950 0.31 sandstone # 4 WL-005 1600 0.24 sandstone # Multiple conditions (use & for AND, | for OR) deep_porous = df[(df["depth_m"] > 1000) & (df["porosity"] > 0.20)] print(deep_porous) %%%
Adding and Modifying Columns
%%%python # Add a new column df["permeability_mD"] = [150, 5, 420, 2, 200] # Computed column df["depth_ft"] = df["depth_m"] * 3.281 print(df.head()) %%%
Matplotlib: Visualization
Matplotlib is Python's most widely used plotting library. We typically import its pyplot module:
%%%python import matplotlib.pyplot as plt %%%
Line Plot
%%%python # Geothermal gradient depths = np.linspace(0, 5000, 100) # 0 to 5000 m temp = 15 + 0.03 * depths # 15 C surface, 30 C/km plt.figure(figsize=(6, 4)) plt.plot(temp, depths) # note: temp on x, depth on y plt.gca().invert_yaxis() # depth increases downward plt.xlabel("Temperature (C)") plt.ylabel("Depth (m)") plt.title("Geothermal Gradient") plt.grid(True) plt.tight_layout() plt.show() %%%
Scatter Plot
%%%python # Porosity vs. Depth porosity = np.array([0.35, 0.30, 0.25, 0.22, 0.18, 0.15, 0.12]) depth = np.array([200, 500, 800, 1100, 1500, 2000, 2500]) plt.figure(figsize=(6, 4)) plt.scatter(porosity, depth, c="steelblue", s=60) plt.gca().invert_yaxis() plt.xlabel("Porosity") plt.ylabel("Depth (m)") plt.title("Porosity vs. Depth") plt.grid(True, alpha=0.3) plt.tight_layout() plt.show() %%%
Histogram
%%%python # Distribution of seismic amplitudes np.random.seed(42) amplitudes = np.random.normal(0, 1, 1000) # 1000 samples plt.figure(figsize=(6, 4)) plt.hist(amplitudes, bins=30, color="coral", edgecolor="black") plt.xlabel("Amplitude") plt.ylabel("Frequency") plt.title("Seismic Amplitude Distribution") plt.tight_layout() plt.show() %%%
Multiple Subplots
%%%python fig, axes = plt.subplots(1, 2, figsize=(10, 4)) # Left: line plot axes[0].plot(temp, depths) axes[0].invert_yaxis() axes[0].set_xlabel("Temperature (C)") axes[0].set_ylabel("Depth (m)") axes[0].set_title("Geothermal Gradient") # Right: histogram axes[1].hist(amplitudes, bins=30, color="coral") axes[1].set_xlabel("Amplitude") axes[1].set_title("Amplitude Distribution") plt.tight_layout() plt.show() %%%
Matplotlib Essentials
plt.figure(figsize=(w, h)), create a new figureplt.plot(x, y),plt.scatter(x, y),plt.hist(data), plot typesplt.xlabel(),plt.ylabel(),plt.title(), labelsplt.legend(), show legendplt.grid(True), add grid linesplt.savefig("plot.png"), save to fileplt.show(), display the plot
References
- Harris, C.R., Millman, K.J., van der Walt, S.J., et al. (2020). Array programming with NumPy. Nature 585, 357–362.
- McKinney, W. (2017). Python for Data Analysis (2nd ed.), ch. 4–6 (NumPy, pandas). O’Reilly.
- VanderPlas, J. (2016). Python Data Science Handbook, ch. 2–4 (NumPy, pandas, Matplotlib). O’Reilly.