Naive Bayes

Chapter 6: Probabilistic Classification — Naive Bayes

Learning objectives

State Bayes' theorem and apply it to classification
Explain the 'naive' independence assumption and its implications
Distinguish Gaussian, Multinomial, and Bernoulli Naive Bayes
Write the Gaussian NB probability formula
Identify use cases and limitations of Naive Bayes
Apply NB to geoscience classification problems

Bayes' Theorem

Bayes' theorem is the foundation of Bayesian inference. It relates the conditional probability of a hypothesis given evidence to the probability of the evidence given the hypothesis:

Bayes' Theorem

P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}

where:

$P(A|B)$ = posterior: probability of hypothesis A given evidence B
$P(B|A)$ = likelihood: probability of evidence B given hypothesis A is true
$P(A)$ = prior: probability of hypothesis A before seeing evidence
$P(B)$ = evidence (normalizing constant): total probability of observing B

For classification, we want to find the class $C_k$ with the highest posterior probability given a feature vector $\mathbf{x} = (x_1, x_2, \ldots, x_n)$ :

\hat{y} = \arg\max_{C_k} P(C_k | x_1, x_2, \ldots, x_n) = \arg\max_{C_k} \frac{P(x_1, x_2, \ldots, x_n | C_k) \cdot P(C_k)}{P(x_1, x_2, \ldots, x_n)}

Since $P(x_1, \ldots, x_n)$ is the same for all classes (it is just a normalizing constant), we can simplify to:

\hat{y} = \arg\max_{C_k} P(x_1, x_2, \ldots, x_n | C_k) \cdot P(C_k)

The "Naive" Independence Assumption

The joint likelihood $P(x_1, x_2, \ldots, x_n | C_k)$ is extremely difficult to estimate because it requires modeling the joint distribution of all features. With $n$ continuous features, this requires an $n$ -dimensional probability density — impossible to estimate reliably from limited data.

The naive assumption simplifies this dramatically by assuming that all features are conditionally independent given the class:

Naive Independence Assumption

P(x_1, x_2, \ldots, x_n | C_k) = \prod_{i=1}^{n} P(x_i | C_k)

This means that knowing the class, the features are independent of each other. Each feature contributes independently to the evidence for each class.

This assumption is almost always wrong in practice (e.g., GR and NPHI are correlated within each lithofacies). However, Naive Bayes often works well in practice because:

Even if the probability estimates are inaccurate, the ranking of classes may still be correct.
The decision boundary is all that matters for classification — exact probabilities are not needed.
With limited training data, the simpler model may generalize better (bias-variance tradeoff).

Types of Naive Bayes

Gaussian Naive Bayes

Assumes each feature follows a normal (Gaussian) distribution within each class:

P(x_i | C_k) = \frac{1}{\sqrt{2\pi\sigma_{k,i}^2}} \exp\left(-\frac{(x_i - \mu_{k,i})^2}{2\sigma_{k,i}^2}\right)

where $\mu_{k,i}$ and $\sigma_{k,i}^2$ are the mean and variance of feature $i$ for samples in class $k$ , estimated from the training data.

Use for: Continuous numerical features (well-log values, geochemical concentrations).

Multinomial Naive Bayes

Assumes features are counts (or frequencies). The likelihood follows a multinomial distribution.

P(\mathbf{x} | C_k) \propto \prod_{i=1}^{n} p_{k,i}^{x_i}

Use for: Text classification (word counts), document categorization. Can also be used for discrete count data in geoscience (e.g., mineral grain counts, fossil counts).

Bernoulli Naive Bayes

Assumes binary features (present/absent).

P(x_i | C_k) = p_{k,i}^{x_i} \cdot (1 - p_{k,i})^{(1-x_i)}

Use for: Binary features (mineral present/absent, fault present/absent, indicator variables).

The Naive Bayes Classifier in Detail

Putting it all together for Gaussian NB with two classes (Sandstone, Shale):

Training: For each class and each feature, compute the mean $\mu$ and variance $\sigma^2$ from the training data. Also compute the prior $P(C_k) = n_k / n$ (fraction of training samples in each class).
Prediction for a new sample $\mathbf{x}$ :

For each class $C_k$ :

\text{score}(C_k) = \log P(C_k) + \sum_{i=1}^{n} \log P(x_i | C_k)

(We use log to avoid numerical underflow from multiplying many small probabilities.)

Choose the class with the highest score: $\hat{y} = \arg\max_k \text{score}(C_k)$

Pros and Cons

Pros	Cons
Extremely fast to train (just compute means and variances)	Independence assumption is usually violated
Works well with small training sets	Probability estimates are often poorly calibrated
Handles many features well (no curse of dimensionality)	Cannot model feature interactions
Excellent baseline model	Outperformed by more flexible models on complex data
Scales to large datasets (linear time complexity)	Sensitive to feature correlations
Naturally handles missing data (skip the missing feature in the product)	Gaussian assumption may not hold for all features

Text Classification with Multinomial NB

Bag-of-Words and TF-IDF

Multinomial Naive Bayes is the workhorse for text classification. The pipeline is:

Tokenize: Split documents into words (tokens).
Build vocabulary: Create a dictionary of all unique words across the corpus.
Bag-of-Words (BoW): Represent each document as a vector of word counts. The "bag" ignores word order.
TF-IDF weighting (optional): Weight each word count by its inverse document frequency to downweight common words (the, is, and) and upweight rare, informative words (sandstone, dolomite, porosity).

\text{TF-IDF}(w, d) = \text{tf}(w, d) \times \log\frac{N}{\text{df}(w)}

where $\text{tf}(w, d)$ is the term frequency in document $d$ , $N$ is the total number of documents, and $\text{df}(w)$ is the number of documents containing word $w$ .

Laplace Smoothing

Preventing Zero Probabilities

If a word never appears with a certain class in the training data, the likelihood $P(x_i | C_k) = 0$ , which zeros out the entire posterior (one zero in the product kills everything). Laplace smoothing (additive smoothing) fixes this:

P(x_i | y) = \frac{N_{yi} + \alpha}{N_y + \alpha n}

where $N_{yi}$ is the count of feature $i$ in class $y$ , $N_y$ is the total count for class $y$ , $n$ is the number of features, and $\alpha$ is the smoothing parameter ( $\alpha = 1$ is Laplace smoothing, $\alpha < 1$ is Lidstone smoothing).

Smoothing adds a small "pseudocount" $\alpha$ to every feature-class combination, ensuring no probability is ever exactly zero.

Log-Probability Trick for Numerical Stability

Why Work in Log Space?

The NB classifier computes $P(C_k) \prod_{i=1}^{n} P(x_i | C_k)$ . With many features, this product of small probabilities can underflow to zero in floating-point arithmetic. The solution is to work in log space:

\log P(C_k | \mathbf{x}) \propto \log P(C_k) + \sum_{i=1}^{n} \log P(x_i | C_k)

Multiplication becomes addition. The argmax is preserved because $\log$ is monotonically increasing. This is essential in practice: with 1000 features (e.g., word vocabulary), the raw product would be around $10^{-3000}$ , which no computer can represent.

When NB Outperforms Complex Models

Situations Favoring Naive Bayes

Small training sets: NB has very few parameters (2 per feature per class for Gaussian NB), so it does not overfit with limited data. Complex models like Random Forest or neural networks may overfit severely with fewer than 50-100 samples per class.
High-dimensional data: NB scales gracefully to thousands of features (e.g., text classification with 10,000-word vocabulary). Other models suffer from the curse of dimensionality.
Real-time classification: NB prediction is $O(nK)$ , making it ideal for real-time applications like seismic event classification during monitoring.
Baseline and sanity check: If a complex model cannot beat NB, something is wrong with the feature engineering or the data.

Calibration Issues

NB Probabilities Are Poorly Calibrated

Although NB outputs "probabilities," these are typically poorly calibrated: they tend to be pushed toward 0 and 1 (overconfident). This happens because the independence assumption ignores correlations, causing the product of likelihoods to be systematically too extreme.

The ranking of classes is usually correct (the most probable class is chosen correctly), but the magnitude of the probabilities should not be interpreted literally. If calibrated probabilities are needed, apply Platt scaling (fit a logistic regression on NB outputs) or isotonic regression as a post-processing step. In scikit-learn: CalibratedClassifierCV(gnb, method="sigmoid").

Geoscience Applications

Well-log facies classification (baseline): Gaussian NB on well-log features (GR, RHOB, NPHI, RT) provides a fast baseline for lithofacies classification. It is often surprisingly competitive, especially when features are somewhat independent.
Mineral classification from spectral data: Given reflectance spectra from drill-core spectroscopy, NB classifies mineral assemblages. Each wavelength band is treated as a feature, with Gaussian NB modeling the spectral response for each mineral class. The high dimensionality (hundreds of spectral bands) is handled naturally by NB.
Seismic facies classification: Using seismic attributes as features, NB provides a rapid first-pass classification.
Drilling report classification: Multinomial NB on TF-IDF word vectors classifies daily drilling reports by operational status (drilling, tripping, cementing, trouble) or by formation being drilled. This enables automated extraction of operational statistics from thousands of legacy reports.

References

Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.), ch. 6.6 (naive Bayes). Springer.
Bishop, C.M. (2006). Pattern Recognition and Machine Learning, ch. 4 (linear models for classification, generative methods). Springer.
Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction, ch. 9 (generative models, naive Bayes). MIT Press.
James, G., Witten, D., Hastie, T., Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.), ch. 4.4 (naive Bayes). Springer.