Workshop: Implementing Gradient Descent from Scratch

Part 3, Chapter 3: Numerical Optimization for Learning

Learning objectives

Implement gradient descent from scratch in Python
Trace through iterations and visualize convergence
Experiment with different learning rates
Apply optimization to a curve-fitting problem

Lab preview. Below is the gradient-descent optimiser you will implement from scratch in this lab. Drag the learning rate and step through the iterations to watch it converge (or diverge), then build it yourself in Python.

Implementing Gradient Descent in Python

Let's implement gradient descent step by step, starting with the simple function $J(x) = x^2$ .

Step 1: Define the Function and Its Derivative

%%%python # Cost function def J(x): return x ** 2 # Gradient (derivative) def dJ(x): return 2 * x %%%

Step 2: Implement the Gradient Descent Loop

%%%python def gradient_descent(x0, lr, num_iters): """Run gradient descent. Args: x0: initial value lr: learning rate (alpha) num_iters: number of iterations Returns: history: list of (x, cost) at each step """ x = x0 history = [(x, J(x))] for i in range(num_iters): grad = dJ(x) # compute gradient x = x - lr * grad # update step cost = J(x) # compute new cost history.append((x, cost)) # Print progress every 5 steps if i % 5 == 0 or i == num_iters - 1: print(f"Step {i+1:3d}: x = {x:8.4f}, J(x) = {cost:10.6f}") return history %%%

Step 3: Run It

%%%python print("--- Gradient Descent on J(x) = x^2 ---") print(f"Starting at x0 = 4.0, learning rate = 0.1\n") history = gradient_descent(x0=4.0, lr=0.1, num_iters=30) final_x = history[-1][0] final_cost = history[-1][1] print(f"\nFinal: x = {final_x:.6f}, J(x) = {final_cost:.6f}") %%%

Expected output (selected steps):

Step 1: x = 3.2000, J(x) = 10.240000 Step 6: x = 1.0486, J(x) = 1.099511 Step 11: x = 0.3436, J(x) = 0.118060 Step 16: x = 0.1126, J(x) = 0.012677 Step 21: x = 0.0369, J(x) = 0.001361 Step 26: x = 0.0121, J(x) = 0.000146 Step 30: x = 0.0053, J(x) = 0.000028

Final: x = 0.005262, J(x) = 0.000028

Step 4: Visualize the Convergence

%%%python import matplotlib.pyplot as plt import numpy as np # Extract x and cost values from history x_vals = [h[0] for h in history] cost_vals = [h[1] for h in history] # Plot 1: Cost vs. iteration fig, axes = plt.subplots(1, 2, figsize=(12, 4)) axes[0].plot(cost_vals, "o-", markersize=3) axes[0].set_xlabel("Iteration") axes[0].set_ylabel("Cost J(x)") axes[0].set_title("Cost vs. Iteration") axes[0].grid(True, alpha=0.3) # Plot 2: Path on the function x_curve = np.linspace(-5, 5, 200) y_curve = x_curve ** 2 axes[1].plot(x_curve, y_curve, "b-", label="J(x) = x^2") axes[1].plot(x_vals, [x**2 for x in x_vals], "ro-", markersize=4, label="GD path") axes[1].set_xlabel("x") axes[1].set_ylabel("J(x)") axes[1].set_title("Gradient Descent Path") axes[1].legend() axes[1].grid(True, alpha=0.3) plt.tight_layout() plt.show() %%%

Comparing Learning Rates

Let's run gradient descent with three different learning rates and compare:

%%%python import matplotlib.pyplot as plt learning_rates = [0.01, 0.1, 0.5] colors = ["red", "blue", "green"] plt.figure(figsize=(8, 5)) for lr, color in zip(learning_rates, colors): history = gradient_descent(x0=4.0, lr=lr, num_iters=30) costs = [h[1] for h in history] plt.plot(costs, color=color, label=f"lr = {lr}", linewidth=2) plt.xlabel("Iteration") plt.ylabel("Cost J(x)") plt.title("Effect of Learning Rate on Convergence") plt.legend() plt.grid(True, alpha=0.3) plt.yscale("log") # log scale to see small values plt.tight_layout() plt.show() %%%

You will see that $\alpha = 0.5$ converges fastest, $\alpha = 0.1$ converges moderately, and $\alpha = 0.01$ is very slow.

Application: Linear Regression with Gradient Descent

Let's fit a line $y = wx + b$ to geoscience data (depth vs. temperature) using gradient descent:

%%%python import numpy as np import matplotlib.pyplot as plt # Geothermal data: depth (km) vs temperature (C) depth = np.array([0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0]) temp = np.array([30, 45, 58, 72, 88, 97, 115, 128]) # Normalize depth for better convergence depth_norm = (depth - depth.mean()) / depth.std() # Initialize parameters w = 0.0 # slope b = 0.0 # intercept lr = 0.1 m = len(depth) # Gradient descent cost_history = [] for epoch in range(100): # Predictions y_pred = w * depth_norm + b # Cost (MSE) cost = (1 / m) * np.sum((y_pred - temp) ** 2) cost_history.append(cost) # Gradients dw = (2 / m) * np.sum((y_pred - temp) * depth_norm) db = (2 / m) * np.sum(y_pred - temp) # Update w = w - lr * dw b = b - lr * db print(f"Final: w = {w:.2f}, b = {b:.2f}") print(f"Final cost: {cost_history[-1]:.2f}") %%%

Visualize the Fit

%%%python fig, axes = plt.subplots(1, 2, figsize=(12, 4)) # Left: cost convergence axes[0].plot(cost_history) axes[0].set_xlabel("Epoch") axes[0].set_ylabel("MSE") axes[0].set_title("Training Loss") axes[0].grid(True, alpha=0.3) # Right: data + fitted line axes[1].scatter(depth, temp, c="steelblue", s=60, label="Data") x_line = np.linspace(0, 4.5, 100) x_norm = (x_line - depth.mean()) / depth.std() y_line = w * x_norm + b axes[1].plot(x_line, y_line, "r-", linewidth=2, label="GD fit") axes[1].set_xlabel("Depth (km)") axes[1].set_ylabel("Temperature (C)") axes[1].set_title("Geothermal Gradient Fit") axes[1].legend() axes[1].grid(True, alpha=0.3) plt.tight_layout() plt.show() %%%

Key Takeaway

Gradient descent successfully found the slope and intercept for the geothermal gradient. In practice, you would use library functions (like np.polyfit or sklearn.linear_model.LinearRegression), but understanding the mechanics of gradient descent is essential for understanding how all ML models are trained.

References

Harris, C.R., et al. (2020). Array programming with NumPy. Nature 585, 357-362.
Pedregosa, F., Varoquaux, G., Gramfort, A., et al. (2011). Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825-2830.
Géron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (3rd ed.), ch. 4 (training models, gradient descent). O’Reilly.