Naive Bayes Practice

Chapter 6: Probabilistic Classification — Naive Bayes

Learning objectives

Train a Gaussian Naive Bayes classifier using scikit-learn
Compare NB accuracy to KNN and Decision Tree baselines
Examine class-conditional distributions learned by NB
Evaluate NB on well-log lithofacies classification
Interpret predict_proba output and assess probability calibration
Understand when NB works well versus poorly

Preparing Well Log Data for Classification

Before training any classifier, we need to prepare the data properly. For Naive Bayes, no feature scaling is strictly required (NB is scale-invariant because it models each feature distribution independently). However, we still need to handle missing values, remove outliers, and split into train/test sets.

%%%python import numpy as np import matplotlib.pyplot as plt from sklearn.naive_bayes import GaussianNB from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, classification_report

Simulated well-log lithofacies data

np.random.seed(42) n = 300

Generate features by class

Sandstone: low GR, medium RHOB, medium NPHI, medium RT

Shale: high GR, high RHOB, high NPHI, low RT

Limestone: low GR, high RHOB, low NPHI, high RT

classes = [0, 1, 2] class_names = ["Sandstone", "Shale", "Limestone"] samples_per_class = [120, 100, 80]

Class-conditional parameters: [mean_GR, mean_RHOB, mean_NPHI, mean_RT]

means = { 0: [40, 2.35, 0.18, 15], # Sandstone 1: [120, 2.55, 0.35, 5], # Shale 2: [20, 2.71, 0.02, 200], # Limestone } stds = { 0: [10, 0.05, 0.03, 5], 1: [15, 0.08, 0.05, 2], 2: [8, 0.04, 0.02, 50], }

X_list, y_list = [], [] for cls, n_cls in zip(classes, samples_per_class): for i in range(4): col = np.random.normal(means[cls][i], stds[cls][i], n_cls) if i == 0: X_cls = col.reshape(-1, 1) else: X_cls = np.column_stack([X_cls, col]) X_list.append(X_cls) y_list.extend([cls] * n_cls)

X = np.vstack(X_list) y = np.array(y_list) feature_names = ["GR", "RHOB", "NPHI", "RT"]

print(f"Dataset shape: {X.shape}") print(f"Class distribution: {dict(zip(class_names, [sum(y==c) for c in classes]))}")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y) print(f"Train: {X_train.shape[0]}, Test: {X_test.shape[0]}") %%%

Training Gaussian NB and Getting Predictions

GaussianNB is one of the simplest classifiers to train. It has no hyperparameters to tune (unlike KNN, Decision Trees, or Random Forests). Training is instantaneous because it simply computes means and variances for each feature within each class.

%%%python

Train Gaussian NB

nb = GaussianNB() nb.fit(X_train, y_train) y_pred = nb.predict(X_test)

print("Gaussian NB Accuracy:", accuracy_score(y_test, y_pred)) print("\nClassification Report:") print(classification_report(y_test, y_pred, target_names=class_names)) %%%

Inspecting Learned Parameters

A key advantage of NB is interpretability. After training, we can inspect exactly what the model learned: the mean ( $\mu$ ) and variance ( $\sigma^2$ ) for each feature within each class, plus the class prior probabilities.

%%%python

The parameters NB learned from the training data

print("Class priors:", np.round(nb.class_prior_, 3)) print("\nClass means (theta_) - rows=classes, cols=features:") header = " " for f in feature_names: header += f"{f:>10}" print(header) for k in range(3): row = f"{class_names[k]:>12}" for j in range(4): row += f"{nb.theta_[k][j]:>10.3f}" print(row)

print("\nClass variances (var_):") for k in range(3): row = f"{class_names[k]:>12}" for j in range(4): row += f"{nb.var_[k][j]:>10.3f}" print(row)

print("\nTotal parameters:", 3 * 4 * 2 + 3, "(3 classes x 4 features x 2 [mean+var] + 3 priors)") %%%

Probability Predictions and Confidence

Unlike some classifiers that only output class labels, NB naturally provides class probability estimates via predict_proba. These probabilities tell us how confident the model is in each prediction. However, NB probabilities are often poorly calibrated — they tend to be too extreme (too close to 0 or 1).

%%%python

NB provides class probability estimates

probs = nb.predict_proba(X_test[:8]) print("Predicted probabilities for first 8 test samples:") header = f"{"Sample":>8}" for name in class_names: header += f"{name:>12}" header += f"{"Predicted":>12}{"True":>10}" print(header) print("-" * 66)

for i in range(8): row = f"{i+1:>8}" for p in probs[i]: row += f"{p:>12.4f}" pred_name = class_names[y_pred[i]] true_name = class_names[y_test[i]] marker = " " if pred_name != true_name else "" row += f"{pred_name:>12}{true_name:>10}{marker}" print(row) print("\n = misclassified") %%%

Comparing NB vs. KNN vs. Decision Tree

NB serves as an excellent baseline model. Comparing it against KNN and Decision Trees reveals whether the data has structure that warrants more complex models. NB is typically much faster to train but may sacrifice accuracy when the independence assumption is badly violated.

%%%python from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import cross_val_score import time

models = { "Gaussian NB": GaussianNB(), "KNN (k=5)": KNeighborsClassifier(n_neighbors=5), "Decision Tree": DecisionTreeClassifier(random_state=42), }

print("5-Fold Cross-Validation Accuracy:") print("-" * 45) for name, model in models.items(): scores = cross_val_score(model, X, y, cv=5, scoring="accuracy") print(f"{name:>15}: {scores.mean():.4f} +/- {scores.std():.4f}")

Training time comparison

print("\nTraining Speed (100 fits):") print("-" * 45) for name, model in models.items(): start = time.time() for _ in range(100): model.fit(X_train, y_train) elapsed = time.time() - start print(f"{name:>15}: {elapsed:.4f}s for 100 fits")

NB is typically 10-100x faster than KNN and DT

%%%

Visualizing Class-Conditional Distributions

This is perhaps the most informative plot for understanding NB. For each feature, we plot the histogram of training values for each class and overlay the Gaussian density that NB learned. Where the Gaussians are well-separated, NB will classify accurately. Where they overlap heavily, NB will struggle.

%%%python fig, axes = plt.subplots(2, 2, figsize=(12, 10)) colors = ["gold", "gray", "skyblue"]

for idx, (ax, fname) in enumerate(zip(axes.flat, feature_names)): for k in range(3): mask = y_train == k ax.hist(X_train[mask, idx], bins=20, alpha=0.5, color=colors[k], label=class_names[k], density=True) # Overlay the Gaussian that NB learned x_range = np.linspace(X_train[:, idx].min(), X_train[:, idx].max(), 100) pdf = (1 / np.sqrt(2 * np.pi * nb.var_[k][idx])) *
np.exp(-0.5 * (x_range - nb.theta_[k][idx])**2 / nb.var_[k][idx]) ax.plot(x_range, pdf, color=colors[k], linewidth=2) ax.set_title(fname) ax.set_ylabel("Density") ax.legend()

plt.suptitle("Class-Conditional Distributions (Histogram + NB Gaussian Fit)") plt.tight_layout() plt.show() %%%

When Does NB Work Well vs. Poorly?

NB Works Well When:

Features are approximately independent within each class (rare but ideal).
Training data is scarce: NB has very few parameters ( $2 \times p \times K + K$ ), so it generalizes well from small samples.
Features are approximately Gaussian: for GaussianNB specifically.
Number of classes is large: NB scales gracefully because it only needs per-class statistics.
Speed is critical: real-time classification, very large datasets.

NB Struggles When:

Features are highly correlated: e.g., GR and SP are both driven by clay content. The independence assumption double-counts evidence.
Decision boundary is non-linear and complex: NB produces quadratic boundaries (in Gaussian case), which may be too simple.
Feature distributions are multimodal: a single Gaussian per class cannot capture bimodal distributions (e.g., a clean sand + silty sand in one class).
Well-calibrated probabilities are needed: NB probabilities are typically too extreme.

Confusion Matrix Visualization

%%%python from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, y_pred) disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=class_names) fig, ax = plt.subplots(figsize=(8, 6)) disp.plot(ax=ax, cmap="Blues") plt.title("Gaussian NB Confusion Matrix") plt.tight_layout() plt.show()

Per-class accuracy

for k in range(3): mask = y_test == k acc = accuracy_score(y_test[mask], y_pred[mask]) print(f"{class_names[k]:>12}: {acc:.3f} ({sum(mask)} samples)") %%%

Probability Calibration Check

A well-calibrated classifier assigns probability 0.7 to events that actually occur 70% of the time. NB is known for poor calibration. We can visualize this with a calibration curve.

%%%python from sklearn.calibration import calibration_curve

probs_all = nb.predict_proba(X_test)

fig, axes = plt.subplots(1, 3, figsize=(15, 4)) for k, ax in enumerate(axes): y_binary = (y_test == k).astype(int) prob_true, prob_pred = calibration_curve(y_binary, probs_all[:, k], n_bins=8) ax.plot(prob_pred, prob_true, "o-", label="GaussianNB") ax.plot([0, 1], [0, 1], "k--", label="Perfect") ax.set_xlabel("Predicted probability") ax.set_ylabel("True frequency") ax.set_title(f"Calibration: {class_names[k]}") ax.legend()

plt.suptitle("NB Probability Calibration Curves") plt.tight_layout() plt.show() print("NB probabilities are typically too extreme (too close to 0 or 1).") print("This is due to the independence assumption double-counting correlated evidence.") %%%

References

Pedregosa, F., et al. (2011). Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830.
Géron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (3rd ed.), ch. 3 (classification). O’Reilly.
Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction, ch. 9. MIT Press.