Lab: Implementing Gradient Descent from Scratch

Part 3, Chapter 3: Numerical Optimization for Learning

Learning objectives

  • Implement gradient descent from scratch in Python
  • Trace through iterations and visualize convergence
  • Experiment with different learning rates
  • Apply optimization to a curve-fitting problem

Implementing Gradient Descent in Python

Let's implement gradient descent step by step, starting with the simple function J(x)=x2J(x) = x^2.

Step 1: Define the Function and Its Derivative

%%%python # Cost function def J(x): return x ** 2 # Gradient (derivative) def dJ(x): return 2 * x %%%

Step 2: Implement the Gradient Descent Loop

%%%python def gradient_descent(x0, lr, num_iters): """Run gradient descent. Args: x0: initial value lr: learning rate (alpha) num_iters: number of iterations Returns: history: list of (x, cost) at each step """ x = x0 history = [(x, J(x))] for i in range(num_iters): grad = dJ(x) # compute gradient x = x - lr * grad # update step cost = J(x) # compute new cost history.append((x, cost)) # Print progress every 5 steps if i % 5 == 0 or i == num_iters - 1: print(f"Step {i+1:3d}: x = {x:8.4f}, J(x) = {cost:10.6f}") return history %%%

Step 3: Run It

%%%python print("--- Gradient Descent on J(x) = x^2 ---") print(f"Starting at x0 = 4.0, learning rate = 0.1\n") history = gradient_descent(x0=4.0, lr=0.1, num_iters=30) final_x = history[-1][0] final_cost = history[-1][1] print(f"\nFinal: x = {final_x:.6f}, J(x) = {final_cost:.6f}") %%%

Expected output (selected steps):

Step 1: x = 3.2000, J(x) = 10.240000 Step 6: x = 1.0486, J(x) = 1.099511 Step 11: x = 0.3436, J(x) = 0.118060 Step 16: x = 0.1126, J(x) = 0.012677 Step 21: x = 0.0369, J(x) = 0.001361 Step 26: x = 0.0121, J(x) = 0.000146 Step 30: x = 0.0053, J(x) = 0.000028

Final: x = 0.005262, J(x) = 0.000028

Step 4: Visualize the Convergence

%%%python import matplotlib.pyplot as plt import numpy as np # Extract x and cost values from history x_vals = [h[0] for h in history] cost_vals = [h[1] for h in history] # Plot 1: Cost vs. iteration fig, axes = plt.subplots(1, 2, figsize=(12, 4)) axes[0].plot(cost_vals, "o-", markersize=3) axes[0].set_xlabel("Iteration") axes[0].set_ylabel("Cost J(x)") axes[0].set_title("Cost vs. Iteration") axes[0].grid(True, alpha=0.3) # Plot 2: Path on the function x_curve = np.linspace(-5, 5, 200) y_curve = x_curve ** 2 axes[1].plot(x_curve, y_curve, "b-", label="J(x) = x^2") axes[1].plot(x_vals, [x**2 for x in x_vals], "ro-", markersize=4, label="GD path") axes[1].set_xlabel("x") axes[1].set_ylabel("J(x)") axes[1].set_title("Gradient Descent Path") axes[1].legend() axes[1].grid(True, alpha=0.3) plt.tight_layout() plt.show() %%%

Comparing Learning Rates

Let's run gradient descent with three different learning rates and compare:

%%%python import matplotlib.pyplot as plt learning_rates = [0.01, 0.1, 0.5] colors = ["red", "blue", "green"] plt.figure(figsize=(8, 5)) for lr, color in zip(learning_rates, colors): history = gradient_descent(x0=4.0, lr=lr, num_iters=30) costs = [h[1] for h in history] plt.plot(costs, color=color, label=f"lr = {lr}", linewidth=2) plt.xlabel("Iteration") plt.ylabel("Cost J(x)") plt.title("Effect of Learning Rate on Convergence") plt.legend() plt.grid(True, alpha=0.3) plt.yscale("log") # log scale to see small values plt.tight_layout() plt.show() %%%

You will see that alpha=0.5\alpha = 0.5 converges fastest, alpha=0.1\alpha = 0.1 converges moderately, and alpha=0.01\alpha = 0.01 is very slow.

Application: Linear Regression with Gradient Descent

Let's fit a line y=wx+by = wx + b to geoscience data (depth vs. temperature) using gradient descent:

%%%python import numpy as np import matplotlib.pyplot as plt # Geothermal data: depth (km) vs temperature (C) depth = np.array([0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0]) temp = np.array([30, 45, 58, 72, 88, 97, 115, 128]) # Normalize depth for better convergence depth_norm = (depth - depth.mean()) / depth.std() # Initialize parameters w = 0.0 # slope b = 0.0 # intercept lr = 0.1 m = len(depth) # Gradient descent cost_history = [] for epoch in range(100): # Predictions y_pred = w * depth_norm + b # Cost (MSE) cost = (1 / m) * np.sum((y_pred - temp) ** 2) cost_history.append(cost) # Gradients dw = (2 / m) * np.sum((y_pred - temp) * depth_norm) db = (2 / m) * np.sum(y_pred - temp) # Update w = w - lr * dw b = b - lr * db print(f"Final: w = {w:.2f}, b = {b:.2f}") print(f"Final cost: {cost_history[-1]:.2f}") %%%

Visualize the Fit

%%%python fig, axes = plt.subplots(1, 2, figsize=(12, 4)) # Left: cost convergence axes[0].plot(cost_history) axes[0].set_xlabel("Epoch") axes[0].set_ylabel("MSE") axes[0].set_title("Training Loss") axes[0].grid(True, alpha=0.3) # Right: data + fitted line axes[1].scatter(depth, temp, c="steelblue", s=60, label="Data") x_line = np.linspace(0, 4.5, 100) x_norm = (x_line - depth.mean()) / depth.std() y_line = w * x_norm + b axes[1].plot(x_line, y_line, "r-", linewidth=2, label="GD fit") axes[1].set_xlabel("Depth (km)") axes[1].set_ylabel("Temperature (C)") axes[1].set_title("Geothermal Gradient Fit") axes[1].legend() axes[1].grid(True, alpha=0.3) plt.tight_layout() plt.show() %%%

Key Takeaway

Gradient descent successfully found the slope and intercept for the geothermal gradient. In practice, you would use library functions (like np.polyfit or sklearn.linear_model.LinearRegression), but understanding the mechanics of gradient descent is essential for understanding how all ML models are trained.

References

  • Harris, C.R., et al. (2020). Array programming with NumPy. Nature 585, 357–362.
  • Pedregosa, F., Varoquaux, G., Gramfort, A., et al. (2011). Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830.
  • Géron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (3rd ed.), ch. 4 (training models, gradient descent). O’Reilly.

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.