Posterior-predictive checks

Part 7 — Bayesian methods

Learning objectives

Define the POSTERIOR-PREDICTIVE distribution p(y_rep | y) = ∫ p(y_rep | θ) p(θ | y) dθ
Implement a posterior-predictive check by drawing replicate datasets and computing a test statistic T
Compute and interpret the POSTERIOR-PREDICTIVE P-VALUE
Choose test statistics that target model assumptions the analyst SUSPECTS might fail
Distinguish PPCs (diagnose model misspecification) from BAYES FACTORS / WAIC (compare competing models)

You've fit your Bayesian model and computed a posterior. Now the obvious question: is the model any good? POSTERIOR-PREDICTIVE CHECKS (PPCs; Rubin 1984, Gelman, Meng & Stern 1996, Gelman et al. 2013 BDA Ch. 6) answer this by comparing REPLICATE DATA drawn from the posterior to the OBSERVED DATA. The check fails when the model can't generate data that looks like the observed data — making misspecification visible.

The posterior-predictive distribution

The posterior-predictive distribution for a new observation $y_{\text{rep}}$ is

p(y_{\text{rep}} \mid y) = \int p(y_{\text{rep}} \mid \theta) \, p(\theta \mid y) \, d\theta.

Read it as: draw θ from its posterior, then draw a new dataset from the likelihood at that θ. The integral marginalises over posterior uncertainty — the result is the distribution of "what new data should look like" GIVEN the model and the observed data. The integral is rarely tractable, but the recipe is operational: sample $\theta^{(s)} \sim p(\theta \mid y)$ via MCMC, then sample $y_{\text{rep}}^{(s)} \sim p(\cdot \mid \theta^{(s)})$ as a replicate dataset.

The PPC algorithm

Fit the model. Obtain posterior samples $\theta^{(1)}, \ldots, \theta^{(S)}$ .
For each posterior sample, generate a REPLICATE dataset $y_{\text{rep}}^{(s)}$ of the same size as the observed y.
Choose a TEST STATISTIC $T(y)$ that summarises the aspect of the data you care about (e.g. skewness, max, mean, dispersion, quantile-of-interest).
Compute $T(y_{\text{obs}})$ and the distribution of $T(y_{\text{rep}})$ across replicates.
Visualise — either overlay the histograms of observed and replicates, or plot the distribution of T(y_rep) with T(y_obs) marked — AND compute the POSTERIOR-PREDICTIVE P-VALUE:

p_{\text{PPC}} = \Pr(T(y_{\text{rep}}) \ge T(y_{\text{obs}}) \mid y) \approx \frac{1}{S} \sum_{s = 1}^{S} \mathbb{1}[T(y_{\text{rep}}^{(s)}) \ge T(y_{\text{obs}})].

Values near 0 or 1 indicate the observed statistic is in the tail of what the model predicts — misfit. Values near 0.5 indicate the model can comfortably reproduce that aspect of the observed data. Note: PPC p-values are NOT classical p-values; they shouldn't be compared to 0.05. The natural threshold is "is the observed statistic in the bulk or the tail of the replicate distribution?" — broadly, p < 0.05 or p > 0.95 is concerning; p in [0.1, 0.9] is fine.

Choosing test statistics

The KEY DESIGN CHOICE: test statistics should target aspects of the data the model MIGHT FAIL to capture. Standard choices:

Mean, median, variance: basic moment-matching. A Bayesian model usually matches these by construction, so PPCs on them tend to be uninformative.
Skewness, kurtosis: shape mismatch. A Normal-fitted-to-skewed-data fails skewness PPC dramatically.
Sample max, min, range: tail mismatch. A model that under-predicts extremes (e.g. ignores fat tails) fails these.
Number of zeros: for count data (Poisson vs zero-inflated): a vanilla Poisson under-predicts zeros for excess-zero data.
Goodness-of-fit measures conditioned on observed Y: e.g., the chi-squared discrepancy $T(y, \theta) = \sum (y_i - E[y_i \mid \theta])^2 / \text{Var}(y_i \mid \theta)$ — PPC p-values for these are interpretable as overdispersion / underdispersion checks.

BDA Ch. 6 (Gelman et al. 2013) develops the philosophy: choose T to match an aspect of the data that the model might handle wrong, given your substantive knowledge of the problem. For example, if you're fitting a Normal model to data that you suspect might be skewed, choose skewness; if you fit a Poisson but suspect zero inflation, choose the number of zeros.

Graphical PPCs

Visual checks are often more informative than a single p-value. Standard graphical PPCs:

Density overlay: overlay observed histogram with histograms (or KDE) of, say, 50 replicates. The model fits if the observed curve is in the cloud of replicates.
Discrepancy plots: scatter $T(y_{\text{rep}}, \theta)$ vs $T(y, \theta)$ for each posterior sample. The cloud near the diagonal indicates good fit.
LOO-PIT: leave-one-out posterior-predictive integral transformation. For each observation, compute the posterior-predictive CDF of $y_i$ given all other data; under correct specification these CDF values are uniform. Bayesian analogue of the frequentist QQ plot.

The R package bayesplot (Gabry & Mahr) automates all these visualisations against Stan/PyMC/NumPyro fits.

What PPCs CANNOT do

PPCs use the data TWICE (once to fit, once to check). The check is therefore CONSERVATIVE — the observed statistic is more likely to fall in the bulk than if the data hadn't been used for fitting. Bayarri & Berger (2000) and others have proposed alternative posterior-predictive distributions that avoid this double-use; in practice, the inflated calibration matters less than getting a clean visual diagnosis.

PPCs diagnose MISSPECIFICATION, not COMPARISON between competing models. For "which model fits better?" use Bayes factors or WAIC/LOO-CV (§7.7). The two tools are complementary — PPCs check whether ANY proposed model fits; Bayes factors / WAIC rank competing models.

What a failing PPC tells you

A failing PPC narrows the problem. If skewness PPC fails badly but variance PPC is fine, the residual is asymmetric — consider a skew-Normal or a transformation. If the sample-max PPC fails, the data has heavier tails than the model assumes — consider a Student-t likelihood. If the number-of-zeros PPC fails on count data, you need zero inflation. PPCs are diagnostic by design: their value is in directing the modeller to specific model improvements.

Try it

Start in CORRECT model mode. Observed data is Normal(3, 1.5²); the model is Normal with conjugate NIG prior. With test statistic T = skewness, observed T(y) is near zero (Normal data is symmetric) and the replicate T-distribution centres near zero. The PPC p-value is in the bulk (e.g., 0.4-0.6) — model fits.
Switch to MISSPECIFIED mode. Observed data is now LOGNORMAL (right-skewed). The Normal model is fit to it. Sample skewness of the observed is positive (say, around 1-2). Replicate skewness centres near zero (the Normal model predicts symmetric data). PPC p-value goes to 0 — the observed skewness is in the right tail of the replicate distribution. The model MISFITS this aspect of the data.
Switch the test statistic to T = max. In CORRECT mode, observed max is in the bulk of replicate maxes — model fits. In MISSPECIFIED mode, observed max is higher than the Normal model can produce — the tail is fatter than assumed.
Hit "Re-sample" to draw a new observed dataset and refit. PPC results vary in noise; the SCENARIO-LEVEL diagnosis (misfit or fit) is the robust signal across resamples.
Note: the LEFT-PANEL density overlay tells you the same story as the right-panel test-statistic histogram, but for the entire distribution. In CORRECT mode, the observed histogram is in the cloud of replicates. In MISSPECIFIED mode, the observed shape (right-skewed) sticks out from the cloud of symmetric Normal replicates.

A Poisson model is fit to daily count data of customer-service requests; the PPC p-value for T = number of zero-count days is 0.001. What does this tell you, and what model modification might you try?

What you now know

PPCs draw replicate datasets from the posterior-predictive distribution and compare a chosen test statistic of replicates to the observed value. Failure (T_obs in the tail of T_rep) diagnoses misspecification; success (T_obs in the bulk) reassures. The choice of test statistic should target what you suspect the model might handle wrong. PPCs diagnose IF the model fits; §7.7 covers HOW to compare competing models via Bayes factors and WAIC.

References

Rubin, D.B. (1984). "Bayesianly justifiable and relevant frequency calculations for the applied statistician." Annals of Statistics 12(4), 1151–1172. (Coined "posterior-predictive p-value".)
Gelman, A., Meng, X.L., Stern, H. (1996). "Posterior predictive assessment of model fitness via realized discrepancies." Statistica Sinica 6, 733–760.
Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B. (2013). Bayesian Data Analysis (3rd ed.), Chapter 6. CRC.
Bayarri, M.J., Berger, J.O. (2000). "P values for composite null models." JASA 95(452), 1127–1142.
Gabry, J., Simpson, D., Vehtari, A., Betancourt, M., Gelman, A. (2019). "Visualization in Bayesian workflow." JRSS-A 182(2), 389–402.