Building a Simple Neural Network

Chapter 13: From Perceptrons to Deep Networks

Learning objectives

Use scikit-learn MLPClassifier to build a neural network
Prepare well-log data with proper feature scaling for NN training
Train, evaluate, and tune a simple MLP for facies classification
Interpret model performance with confusion matrices and classification reports
Visualise training loss curves and diagnose convergence issues
Perform hyperparameter tuning across architectures, activations, and regularisation

Neural Networks in Python with scikit-learn

scikit-learn's MLPClassifier provides a quick way to build multi-layer perceptrons without needing deep-learning frameworks like TensorFlow or PyTorch. In this lab, we walk through the complete process of preparing well-log data, building an MLP, evaluating its performance, and tuning hyperparameters for a lithofacies classification task.

Step 1: Preparing Well-Log Data

Real well-log data typically comes as a table with columns for each logging tool measurement. For this lab, we simulate data with three common logs: gamma ray (GR), bulk density (RHOB), and neutron porosity (NPHI). Each log measures different physical properties and responds differently to rock types.

Gamma Ray (GR): measures natural radioactivity. Shales have high GR (~90 API), sandstones moderate (~40 API), limestones low (~20 API).
Bulk Density (RHOB): measures formation density. Sandstones (~2.30 g/cc), shales (~2.55 g/cc), limestones (~2.70 g/cc).
Neutron Porosity (NPHI): responds to hydrogen content. Shales show high apparent porosity (~35%), sandstones moderate (~20%), limestones low (~3%).

%%%python import numpy as np import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split

Simulated well-log data for facies classification

Features: gamma_ray (API), density (g/cc), neutron_porosity (%)

Labels: 0=shale, 1=sandstone, 2=limestone

np.random.seed(42)

gamma_ray = np.concatenate([ np.random.normal(90, 15, 80), # shale: high GR np.random.normal(40, 10, 70), # sandstone: moderate GR np.random.normal(20, 8, 50) # limestone: low GR ]) density = np.concatenate([ np.random.normal(2.55, 0.05, 80), # shale np.random.normal(2.30, 0.08, 70), # sandstone np.random.normal(2.70, 0.04, 50) # limestone ]) neutron = np.concatenate([ np.random.normal(35, 5, 80), # shale: high NPHI np.random.normal(20, 6, 70), # sandstone: moderate np.random.normal(3, 2, 50) # limestone: low ]) labels = np.array([0]*80 + [1]*70 + [2]*50)

Build the feature matrix and label vector

X = np.column_stack([gamma_ray, density, neutron]) y = labels

print(f"Dataset shape: {X.shape}") print(f"Class distribution: shale={np.sum(y==0)}, " f"sandstone={np.sum(y==1)}, limestone={np.sum(y==2)}") %%%

Step 2: Train/Test Split and Feature Scaling

Neural networks require feature scaling because they use gradient-based optimisation. Features with different scales (e.g., GR in 0-150 API vs density in 2.0-2.8 g/cc) cause some weights to receive disproportionately large gradients, leading to slow or unstable convergence. Always standardise or normalise your features before training an NN.

%%%python

Train/test split with stratification to preserve class ratios

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y )

print(f"Training set: {X_train.shape[0]} samples") print(f"Test set: {X_test.shape[0]} samples")

Standardise features (zero mean, unit variance)

CRITICAL: fit on training data ONLY, then transform both sets

scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # use the SAME scaler

Verify scaling

print(f"\nAfter scaling:") print(f"Train means: {X_train_scaled.mean(axis=0).round(2)}") print(f"Train stds: {X_train_scaled.std(axis=0).round(2)}") %%%

After scaling, each feature has approximately zero mean and unit standard deviation. The test set will not have exactly zero mean because the scaler was fitted on the training data only — this is correct and avoids data leakage.

Step 3: Building and Training the MLP

The MLPClassifier key parameters are:

hidden_layer_sizes: a tuple defining the number of neurons in each hidden layer. E.g., (16, 8) means two hidden layers with 16 and 8 neurons.
activation: the activation function — "relu" (default, most common), "tanh", or "logistic" (sigmoid).
solver: the optimisation algorithm — "adam" (default, works well for most problems), "sgd" (stochastic gradient descent, more tuneable), or "lbfgs" (good for small datasets).
max_iter: maximum number of training epochs. Increase if you see a convergence warning.
alpha: L2 regularisation strength (default 0.0001). Increase to reduce overfitting.
learning_rate_init: initial learning rate for sgd/adam (default 0.001).

%%%python from sklearn.neural_network import MLPClassifier

Create the MLP: 2 hidden layers with 16 and 8 neurons

mlp = MLPClassifier( hidden_layer_sizes=(16, 8), # two hidden layers activation="relu", # ReLU activation solver="adam", # Adam optimiser max_iter=500, # training epochs random_state=42, verbose=False # set True to see progress )

Train the network

mlp.fit(X_train_scaled, y_train)

Evaluate

train_acc = mlp.score(X_train_scaled, y_train) test_acc = mlp.score(X_test_scaled, y_test) print(f"Train accuracy: {train_acc:.3f}") print(f"Test accuracy: {test_acc:.3f}") print(f"Number of iterations: {mlp.n_iter_}") print(f"Final loss: {mlp.loss_:.4f}") %%%

Step 4: Evaluating with Classification Report and Confusion Matrix

Accuracy alone can be misleading, especially with imbalanced classes. The classification report provides precision, recall, and F1-score for each class. The confusion matrix shows exactly which classes are being confused with each other.

%%%python from sklearn.metrics import confusion_matrix, classification_report import matplotlib.pyplot as plt

y_pred = mlp.predict(X_test_scaled)

Classification report — precision, recall, F1 per class

print(classification_report(y_test, y_pred, target_names=["Shale", "Sandstone", "Limestone"])) %%%

Precision = of all samples predicted as class X, what fraction actually is class X? Recall = of all actual class X samples, what fraction did we correctly predict? F1 = harmonic mean of precision and recall.

%%%python

Confusion matrix visualisation

cm = confusion_matrix(y_test, y_pred) fig, ax = plt.subplots(figsize=(6, 5)) im = ax.imshow(cm, cmap="Blues") plt.colorbar(im) class_names = ["Shale", "Sandstone", "Limestone"] ax.set_xticks([0, 1, 2]) ax.set_xticklabels(class_names) ax.set_yticks([0, 1, 2]) ax.set_yticklabels(class_names) ax.set_xlabel("Predicted") ax.set_ylabel("Actual") ax.set_title("Confusion Matrix — Facies Classification") for i in range(3): for j in range(3): color = "white" if cm[i, j] > cm.max() / 2 else "black" ax.text(j, i, str(cm[i, j]), ha="center", va="center", color=color, fontsize=14) plt.tight_layout() plt.show() %%%

Step 5: Visualising the Training Loss Curve

The loss curve shows how the training loss decreases over epochs. It is crucial for diagnosing convergence problems.

%%%python plt.figure(figsize=(8, 4)) plt.plot(mlp.loss_curve_, linewidth=2) plt.xlabel("Epoch") plt.ylabel("Loss") plt.title("Training Loss Curve") plt.grid(alpha=0.3) plt.tight_layout() plt.show() %%%

How to read the loss curve:

Smooth decrease to a plateau: good convergence. The model has learned the patterns.
Loss still decreasing at the end: increase max_iter — the model needs more training time.
Oscillating wildly: the learning rate is too high. Reduce with learning_rate_init=0.0005.
Flat from the beginning: the learning rate may be too low, or the model architecture is too simple.
Sharp drop then flat: the model quickly learned major patterns and is done. Normal for simple problems.

Step 6: Hyperparameter Tuning — Architectures

Systematically compare different network architectures to find the best configuration:

%%%python

Compare different architectures

results = [] for layers in [(8,), (16,), (16, 8), (32, 16), (32, 16, 8), (64, 32)]: mlp = MLPClassifier( hidden_layer_sizes=layers, max_iter=500, random_state=42 ) mlp.fit(X_train_scaled, y_train) train_a = mlp.score(X_train_scaled, y_train) test_a = mlp.score(X_test_scaled, y_test) results.append((str(layers), train_a, test_a)) print(f"Layers {str(layers):>15}: " f"train={train_a:.3f}, test={test_a:.3f}")

Find best architecture by test accuracy

best = max(results, key=lambda r: r[2]) print(f"\nBest architecture: {best[0]} " f"(test acc = {best[2]:.3f})") %%%

General guidelines: start simple and increase complexity. More layers and neurons allow the model to learn more complex patterns, but also increase the risk of overfitting. For well-log data with 3-5 features and a few hundred samples, architectures like (16, 8) or (32, 16) usually work well.

Step 7: Hyperparameter Tuning — Activation and Regularisation

%%%python

Compare activation functions

print("=== Activation Functions ===") for act in ["relu", "tanh", "logistic"]: mlp = MLPClassifier( hidden_layer_sizes=(16, 8), activation=act, max_iter=500, random_state=42 ) mlp.fit(X_train_scaled, y_train) print(f"{act:>10}: test acc = " f"{mlp.score(X_test_scaled, y_test):.3f}")

Compare regularisation strengths

print("\n=== Regularisation (alpha) ===") for alpha in [0.0001, 0.001, 0.01, 0.1]: mlp = MLPClassifier( hidden_layer_sizes=(16, 8), alpha=alpha, max_iter=500, random_state=42 ) mlp.fit(X_train_scaled, y_train) tr = mlp.score(X_train_scaled, y_train) te = mlp.score(X_test_scaled, y_test) print(f"alpha={alpha:.4f}: train={tr:.3f}, test={te:.3f}") %%%

If train accuracy is much higher than test accuracy, increase alpha (more regularisation). If both are low, decrease alpha or use a larger network.

Step 8: Comparing Loss Curves Across Architectures

%%%python fig, ax = plt.subplots(figsize=(10, 5)) for layers in [(8,), (16, 8), (32, 16, 8)]: mlp = MLPClassifier( hidden_layer_sizes=layers, max_iter=500, random_state=42 ) mlp.fit(X_train_scaled, y_train) ax.plot(mlp.loss_curve_, label=str(layers), linewidth=2)

ax.set_xlabel("Epoch") ax.set_ylabel("Loss") ax.set_title("Loss Curves for Different Architectures") ax.legend() ax.grid(alpha=0.3) plt.tight_layout() plt.show() %%%

Larger networks typically converge faster (steeper initial drop) but may overfit if the dataset is small.

Step 9: Making Predictions on New Data

%%%python

Predict the lithology for a new well-log reading

GR=75 API, RHOB=2.50 g/cc, NPHI=30%

new_sample = np.array([[75, 2.50, 30]])

IMPORTANT: apply the same scaler

new_scaled = scaler.transform(new_sample) prediction = mlp.predict(new_scaled) proba = mlp.predict_proba(new_scaled)

class_names = ["Shale", "Sandstone", "Limestone"] print(f"Predicted class: {class_names[prediction[0]]}") print(f"Class probabilities:") for name, p in zip(class_names, proba[0]): print(f" {name}: {p:.3f}") %%%

The predict_proba method returns the probability of each class. This is useful for identifying uncertain predictions (e.g., when no class has probability above 0.5, the model is unsure).

[Library: scikit-learn MLPClassifier]

References

Pedregosa, F., et al. (2011). Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830.
Géron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (3rd ed.), ch. 10 (introduction to artificial neural networks with Keras). O’Reilly.
Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning, ch. 6 (feedforward networks). MIT Press.