Naive Bayes Practice
Learning objectives
- Train a Gaussian Naive Bayes classifier using scikit-learn
- Compare NB accuracy to KNN and Decision Tree baselines
- Examine class-conditional distributions learned by NB
- Evaluate NB on well-log lithofacies classification
- Interpret predict_proba output and assess probability calibration
- Understand when NB works well versus poorly
Preparing Well Log Data for Classification
Before training any classifier, we need to prepare the data properly. For Naive Bayes, no feature scaling is strictly required (NB is scale-invariant because it models each feature distribution independently). However, we still need to handle missing values, remove outliers, and split into train/test sets.
%%%python import numpy as np import matplotlib.pyplot as plt from sklearn.naive_bayes import GaussianNB from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, classification_report
Simulated well-log lithofacies data
np.random.seed(42) n = 300
Generate features by class
Sandstone: low GR, medium RHOB, medium NPHI, medium RT
Shale: high GR, high RHOB, high NPHI, low RT
Limestone: low GR, high RHOB, low NPHI, high RT
classes = [0, 1, 2] class_names = ["Sandstone", "Shale", "Limestone"] samples_per_class = [120, 100, 80]
Class-conditional parameters: [mean_GR, mean_RHOB, mean_NPHI, mean_RT]
means = { 0: [40, 2.35, 0.18, 15], # Sandstone 1: [120, 2.55, 0.35, 5], # Shale 2: [20, 2.71, 0.02, 200], # Limestone } stds = { 0: [10, 0.05, 0.03, 5], 1: [15, 0.08, 0.05, 2], 2: [8, 0.04, 0.02, 50], }
X_list, y_list = [], [] for cls, n_cls in zip(classes, samples_per_class): for i in range(4): col = np.random.normal(means[cls][i], stds[cls][i], n_cls) if i == 0: X_cls = col.reshape(-1, 1) else: X_cls = np.column_stack([X_cls, col]) X_list.append(X_cls) y_list.extend([cls] * n_cls)
X = np.vstack(X_list) y = np.array(y_list) feature_names = ["GR", "RHOB", "NPHI", "RT"]
print(f"Dataset shape: {X.shape}") print(f"Class distribution: {dict(zip(class_names, [sum(y==c) for c in classes]))}")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y) print(f"Train: {X_train.shape[0]}, Test: {X_test.shape[0]}") %%%
Training Gaussian NB and Getting Predictions
GaussianNB is one of the simplest classifiers to train. It has no hyperparameters to tune (unlike KNN, Decision Trees, or Random Forests). Training is instantaneous because it simply computes means and variances for each feature within each class.
%%%python
Train Gaussian NB
nb = GaussianNB() nb.fit(X_train, y_train) y_pred = nb.predict(X_test)
print("Gaussian NB Accuracy:", accuracy_score(y_test, y_pred)) print("\nClassification Report:") print(classification_report(y_test, y_pred, target_names=class_names)) %%%
Inspecting Learned Parameters
A key advantage of NB is interpretability. After training, we can inspect exactly what the model learned: the mean () and variance () for each feature within each class, plus the class prior probabilities.
%%%python
The parameters NB learned from the training data
print("Class priors:", np.round(nb.class_prior_, 3)) print("\nClass means (theta_) - rows=classes, cols=features:") header = " " for f in feature_names: header += f"{f:>10}" print(header) for k in range(3): row = f"{class_names[k]:>12}" for j in range(4): row += f"{nb.theta_[k][j]:>10.3f}" print(row)
print("\nClass variances (var_):") for k in range(3): row = f"{class_names[k]:>12}" for j in range(4): row += f"{nb.var_[k][j]:>10.3f}" print(row)
print("\nTotal parameters:", 3 * 4 * 2 + 3, "(3 classes x 4 features x 2 [mean+var] + 3 priors)") %%%
Probability Predictions and Confidence
Unlike some classifiers that only output class labels, NB naturally provides class probability estimates via predict_proba. These probabilities tell us how confident the model is in each prediction. However, NB probabilities are often poorly calibrated — they tend to be too extreme (too close to 0 or 1).
%%%python
NB provides class probability estimates
probs = nb.predict_proba(X_test[:8]) print("Predicted probabilities for first 8 test samples:") header = f"{"Sample":>8}" for name in class_names: header += f"{name:>12}" header += f"{"Predicted":>12}{"True":>10}" print(header) print("-" * 66)
for i in range(8): row = f"{i+1:>8}" for p in probs[i]: row += f"{p:>12.4f}" pred_name = class_names[y_pred[i]] true_name = class_names[y_test[i]] marker = " " if pred_name != true_name else "" row += f"{pred_name:>12}{true_name:>10}{marker}" print(row) print("\n = misclassified") %%%
Comparing NB vs. KNN vs. Decision Tree
NB serves as an excellent baseline model. Comparing it against KNN and Decision Trees reveals whether the data has structure that warrants more complex models. NB is typically much faster to train but may sacrifice accuracy when the independence assumption is badly violated.
%%%python from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import cross_val_score import time
models = { "Gaussian NB": GaussianNB(), "KNN (k=5)": KNeighborsClassifier(n_neighbors=5), "Decision Tree": DecisionTreeClassifier(random_state=42), }
print("5-Fold Cross-Validation Accuracy:") print("-" * 45) for name, model in models.items(): scores = cross_val_score(model, X, y, cv=5, scoring="accuracy") print(f"{name:>15}: {scores.mean():.4f} +/- {scores.std():.4f}")
Training time comparison
print("\nTraining Speed (100 fits):") print("-" * 45) for name, model in models.items(): start = time.time() for _ in range(100): model.fit(X_train, y_train) elapsed = time.time() - start print(f"{name:>15}: {elapsed:.4f}s for 100 fits")
NB is typically 10-100x faster than KNN and DT
%%%
Visualizing Class-Conditional Distributions
This is perhaps the most informative plot for understanding NB. For each feature, we plot the histogram of training values for each class and overlay the Gaussian density that NB learned. Where the Gaussians are well-separated, NB will classify accurately. Where they overlap heavily, NB will struggle.
%%%python fig, axes = plt.subplots(2, 2, figsize=(12, 10)) colors = ["gold", "gray", "skyblue"]
for idx, (ax, fname) in enumerate(zip(axes.flat, feature_names)):
for k in range(3):
mask = y_train == k
ax.hist(X_train[mask, idx], bins=20, alpha=0.5, color=colors[k],
label=class_names[k], density=True)
# Overlay the Gaussian that NB learned
x_range = np.linspace(X_train[:, idx].min(), X_train[:, idx].max(), 100)
pdf = (1 / np.sqrt(2 * np.pi * nb.var_[k][idx])) *
np.exp(-0.5 * (x_range - nb.theta_[k][idx])**2 / nb.var_[k][idx])
ax.plot(x_range, pdf, color=colors[k], linewidth=2)
ax.set_title(fname)
ax.set_ylabel("Density")
ax.legend()
plt.suptitle("Class-Conditional Distributions (Histogram + NB Gaussian Fit)") plt.tight_layout() plt.show() %%%
When Does NB Work Well vs. Poorly?
NB Works Well When:
- Features are approximately independent within each class (rare but ideal).
- Training data is scarce: NB has very few parameters (), so it generalizes well from small samples.
- Features are approximately Gaussian: for GaussianNB specifically.
- Number of classes is large: NB scales gracefully because it only needs per-class statistics.
- Speed is critical: real-time classification, very large datasets.
NB Struggles When:
- Features are highly correlated: e.g., GR and SP are both driven by clay content. The independence assumption double-counts evidence.
- Decision boundary is non-linear and complex: NB produces quadratic boundaries (in Gaussian case), which may be too simple.
- Feature distributions are multimodal: a single Gaussian per class cannot capture bimodal distributions (e.g., a clean sand + silty sand in one class).
- Well-calibrated probabilities are needed: NB probabilities are typically too extreme.
Confusion Matrix Visualization
%%%python from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_test, y_pred) disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=class_names) fig, ax = plt.subplots(figsize=(8, 6)) disp.plot(ax=ax, cmap="Blues") plt.title("Gaussian NB Confusion Matrix") plt.tight_layout() plt.show()
Per-class accuracy
for k in range(3): mask = y_test == k acc = accuracy_score(y_test[mask], y_pred[mask]) print(f"{class_names[k]:>12}: {acc:.3f} ({sum(mask)} samples)") %%%
Probability Calibration Check
A well-calibrated classifier assigns probability 0.7 to events that actually occur 70% of the time. NB is known for poor calibration. We can visualize this with a calibration curve.
%%%python from sklearn.calibration import calibration_curve
probs_all = nb.predict_proba(X_test)
fig, axes = plt.subplots(1, 3, figsize=(15, 4)) for k, ax in enumerate(axes): y_binary = (y_test == k).astype(int) prob_true, prob_pred = calibration_curve(y_binary, probs_all[:, k], n_bins=8) ax.plot(prob_pred, prob_true, "o-", label="GaussianNB") ax.plot([0, 1], [0, 1], "k--", label="Perfect") ax.set_xlabel("Predicted probability") ax.set_ylabel("True frequency") ax.set_title(f"Calibration: {class_names[k]}") ax.legend()
plt.suptitle("NB Probability Calibration Curves") plt.tight_layout() plt.show() print("NB probabilities are typically too extreme (too close to 0 or 1).") print("This is due to the independence assumption double-counting correlated evidence.") %%%
References
- Pedregosa, F., et al. (2011). Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830.
- Géron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (3rd ed.), ch. 3 (classification). O’Reilly.
- Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction, ch. 9. MIT Press.