Model comparison: Bayes factors and WAIC

Part 7 — Bayesian methods

Learning objectives

Define the BAYES FACTOR as the ratio of marginal likelihoods between two models
Recognise Bayes factors' dependence on the PRIOR — specifically the role of the prior's scale in the Occam's-razor penalty
Explain LINDLEY'S PARADOX and why vague priors can paradoxically favour the simpler model
Compute and interpret WAIC and LOO-IC as out-of-sample predictive criteria
Choose between Bayes factors, WAIC, and LOO based on the comparison's scientific goal

§7.6 covered diagnostics for a single model. The complementary question is: given two or more candidate models, which is better? Bayesian model comparison has two main tools: Bayes factors (the classical Jeffreys-Kass-Raftery approach) and WAIC / LOO-CV (the modern Vehtari-Gelman-Gabry approach). They answer different versions of "better" and can disagree dramatically. This section covers both and when to use each.

The Bayes factor

Compare model M₁ to M₀. Each has its own MARGINAL LIKELIHOOD (also called "evidence"):

p(y \mid M_j) = \int p(y \mid \theta_j, M_j) \, p(\theta_j \mid M_j) \, d\theta_j.

The Bayes factor is their ratio:

\text{BF}_{10} = \frac{p(y \mid M_1)}{p(y \mid M_0)}.

The Bayes factor is the multiplicative update to the prior odds in favour of M₁: $\text{Posterior odds} = \text{Prior odds} \times \text{BF}_{10}$ . If you started with equal prior weights on the two models, the posterior probability of M₁ is BF₁₀ / (1 + BF₁₀).

Jeffreys' (1961) rough scale of evidence:

BF > 100: decisive evidence for M₁
BF > 10: strong evidence
BF > 3: substantial evidence
1 < BF < 3: anecdotal evidence
BF < 1: evidence favours M₀

The Occam's razor built into the Bayes factor

The marginal likelihood $p(y \mid M_j)$ marginalises over the prior. A model with a wider prior spreads its likelihood mass over more candidate parameter values; each individual region gets LESS density. When the data localises on a small region of parameter space, the marginal likelihood is HIGHER for a model whose prior was already concentrated there. This is Occam's razor automatic: simpler models (narrower priors, fewer parameters) get a "prior reward" only if the data don't fight them.

Quantitatively for the normal-normal example with M₀: μ = 0 vs M₁: μ ~ N(0, τ²), and Y_i ~ N(μ, σ²):

\log \text{BF}_{10} = -\tfrac{1}{2} \log\left(1 + \tfrac{N \tau^2}{\sigma^2}\right) + \tfrac{1}{2} \cdot \frac{(N \bar{y})^2 \tau^2}{\sigma^2 (\sigma^2 + N \tau^2)}.

First term: the Occam penalty, growing logarithmically with $\tau$ . Second term: the fit advantage, growing with sample size and observed effect. Vague τ (large) gives a big penalty; only large observed effects can overcome it.

Lindley's paradox

Lindley (1957) noted: with VAGUE M₁ priors (large τ) and LARGE sample sizes, the Bayes factor can favour M₀ EVEN when a frequentist test rejects M₀ at p < 0.05. The vague prior makes the Occam penalty so harsh that no observed effect can compensate. Conversely, a SHARP prior centred at the true value would give big Bayes factors for moderate effects.

This paradox isn't a bug; it's a feature. The Bayes factor is asking "does the data give evidence FOR the specific alternative encoded in the prior?". A vague prior is a weak alternative; the data has to overcome both the lack of fit at M₀ AND the lack of confidence in M₁. The implication: Bayes factors require thoughtful priors. Default vague priors give nonsensical Bayes factors. Use Bayes factors only when the prior on the alternative has substantive justification.

WAIC: the modern predictive-criterion alternative

The Watanabe-Akaike Information Criterion (WAIC, Watanabe 2010) is a Bayesian generalisation of AIC. It estimates the out-of-sample expected log pointwise predictive density (lppd) minus a complexity penalty:

\text{WAIC} = -2 \left( \text{lppd} - p_{\text{WAIC}} \right),

where

\text{lppd} = \sum_i \log \mathbb{E}_\text{post}[p(y_i \mid \theta)], \quad p_{\text{WAIC}} = \sum_i \text{Var}_\text{post}[\log p(y_i \mid \theta)].

lppd is the expected log-density of the observed data under the posterior; p_WAIC penalises model complexity via the variance of log-density (effective number of parameters in a Bayesian sense). Compute by drawing S posterior samples and averaging over them; lower WAIC = better. Differences in WAIC of >~4 are practically meaningful.

LOO-CV: leave-one-out cross-validation

The Bayesian analogue of leave-one-out cross-validation:

\text{elpd}_\text{loo} = \sum_i \log p(y_i \mid y_{-i}) = \sum_i \log \int p(y_i \mid \theta) p(\theta \mid y_{-i}) d\theta.

Each summand is the leave-one-out predictive density for observation i. Higher elpd_loo = better. Computed exactly via brute-force refitting N times, or approximately via PSIS-LOO (Pareto-smoothed importance sampling, Vehtari-Gelman-Gabry 2017), which extracts LOO from a single MCMC fit.

LOO is the gold standard for predictive comparison: closer to the "what would the model say about new data?" question than any other criterion. The R/Python loo package (Vehtari et al.) is the standard tool.

WAIC vs Bayes factor vs LOO: when to use which

Bayes factors: when both models have SUBSTANTIVELY-PRINCIPLED priors and you want to test a specific hypothesis. Standard for replicating frequentist hypothesis tests in a Bayesian framework. Avoid with vague priors (Lindley's paradox).
WAIC: cheap (single MCMC fit), reasonable for prediction-oriented comparison. Use as a quick check.
LOO-CV (PSIS-LOO): the modern default for prediction-oriented model comparison. Independent of prior normalisation; targets true out-of-sample performance.

WAIC and LOO answer "which model better predicts new data?"; Bayes factors answer "which model is supported by the data, given my prior beliefs?". They CAN disagree — particularly when priors are influential.

Practical workflow

Run PPCs (§7.6) on each candidate model. Reject any that misfit obviously.
Compute LOO-CV (PSIS-LOO) for the surviving models. Choose the highest elpd_loo, but check the standard error of the difference. If the difference is within 2 SEs, the models are practically equivalent — report both.
If you need a Bayes-factor-based decision (e.g., for regulatory or hypothesis-testing reasons), be EXPLICIT about your priors and conduct sensitivity analysis across reasonable prior choices.
For prediction-deployment decisions, the highest LOO (subject to PSIS-LOO Pareto-k diagnostics) wins.

Try it

Default: μ_true = 0.4, N = 30, τ = 2.0. The log BF and −ΔWAIC/2 are both moderately positive — both criteria favour M1. ȳ ≈ 0.4, SE(ȳ) ≈ 0.18, so the observed effect is about 2 SE from zero.
Drag τ up to 15 (very vague prior on μ in M1). Log BF drops significantly — the Occam penalty grows. −ΔWAIC/2 hardly moves (WAIC depends on the posterior, which is barely affected by vague priors). At very large τ, log BF can go NEGATIVE (M0 preferred) even though WAIC still favours M1. Lindley's paradox in real time.
Drag τ back to 2, then drag N up to 500. Both criteria grow more decisive — M1 increasingly preferred. The sweep chart shows how each criterion evolves with N.
Set μ_true to 0 (null is exactly true). Both criteria should favour M0 — log BF goes negative, ΔWAIC turns positive. At very large τ this happens decisively.
Set μ_true to 0.05 (tiny effect) and N to 500 with τ = 10 (vague). This is the classical Lindley setup: ȳ ≈ 0.05 is several SE from zero, so frequentist would reject. Bayes factor with vague τ may still favour M0; WAIC favours M1. The two are answering different questions.

A statistician runs Bayes factor analyses on a clinical trial using BF₁₀ = 0.5 with prior τ = 100 (very vague). The frequentist test gives p = 0.02. Whose conclusion is right, and what should the statistician have done differently?

What you now know

Bayes factors compare marginal likelihoods; they require principled priors and are subject to Lindley's paradox under vague priors. WAIC and LOO-CV are prediction-oriented criteria less sensitive to prior choice; LOO-CV via PSIS is the modern default for non-trivial model comparison. The Bayesian workflow combines PPCs (§7.6) to reject misfitting models and LOO/WAIC to rank survivors. Part 7 closes here; Parts 8 (resampling), 9 (ML for researchers) extend the analyst's toolkit further.

References

Kass, R.E., Raftery, A.E. (1995). "Bayes factors." JASA 90(430), 773–795. (Definitive review.)
Jeffreys, H. (1961). Theory of Probability (3rd ed.). Oxford. (Original Bayes-factor scale of evidence.)
Lindley, D.V. (1957). "A statistical paradox." Biometrika 44(1/2), 187–192. (Lindley's paradox.)
Watanabe, S. (2010). "Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory." JMLR 11, 3571–3594. (WAIC.)
Vehtari, A., Gelman, A., Gabry, J. (2017). "Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC." Statistics and Computing 27(5), 1413–1432. (PSIS-LOO & the modern workflow.)